OCR - convert pdf to text with Casedo

Convert PDF to text on the fly with OCR

Right now, new technology is making it easier for lawyers to work with digitised documents. Most lawyers are familiar with pdfs: the standard file format for storing scanned documents, as well as for exchanging them with other parties. On the flip side, if you have ever tried to edit, copy or search through text in such a file, you’ll know just how frustrating pdfs can be to work with.

 

Optical Character Recognition (OCR) technology changes all of this. Designed with lawyers in mind, Casedo’s OCR feature enables you to convert pdf to text on the flay and so makes unreadable pdfs readable, allowing you to manipulate, search and extract text, just like you would with a Word document.

Here’s a closer look at how OCR works, and why this can mean a welcome boost in productivity and reduction in stress levels for lawyers, paralegals and support staff alike.

 

What’s the problem with scanned pdfs for lawyers?

There are good reasons why pdfs (portable document files) are used frequently within law firms. Originally developed more than two decades ago by Adobe, this file format lets you easily convert both electronic and paper documents into accurate digital versions of the original. Pdf documents are easy to share and to view, thanks to Adobe’s free-to-use Reader plugin. The files themselves are also relatively small, which makes it easier to send via email, and also makes it possible to store vast volumes of scanned documents on the firm’s hard drive.


Scanning and converting a document into a pdf creates an electronic image of the original. The file is non-editable, which can be useful from a security perspective when you need to show that a document is a true copy of the original. However, this characteristic also stops you from annotating, copying, searching through and extracting the text: bad news when you need to work on the document.

 

What does OCR software do?

Optical Character Recognition software changes the way your device processes pdf files. By converting pdf to text it enables the device to actually read the text, rather than treating it as an image.

For you, this means the document is transformed into an editable, machine-readable format.



How can I put OCR software to work?

Here are some of the many situations where OCR can prove especially valuable in law firms and chambers:


Post-disclosure investigation

As part of the disclosure and inspection process, you receive a large volume of the other party’s scanned bank statements in pdf format. As part of your investigations, you want to isolate all transactions relating to a particular payee. Rather than printing out the statements and examining them line-by-line with a highlighter, an OCR feature lets you do a text search for the payee and identify all relevant entries in an instant.


Expert evidence

Attached to the likes of medical and engineering reports, experts will often attach reference documents, such as research reports and articles from academic journals. These documents can often be dense in nature. Nevertheless, it is generally important to give them consideration for anything that might be especially relevant to the actual expert report and to your client’s case as a whole.

Once you convert PDF to text, OCR allows you to search for the segments of these documents that are likely to be most relevant. It also allows you to easily copy sections and paste them into opinions, correspondence and pleadings.

 

Research

One of the barristers’ chambers you frequently instruct has prepared a handy guide to tax law changes and has sent you a scanned version in pdf format. Some of the contents are directly relevant to a number of your cases. OCR enables you to annotate the document and extract useful sections and charts so you can add them to your casenote file on your case management system.



What are the benefits of OCR for lawyers?

Why should I use this feature to convert PDF to text? OCR can help you in the following ways:


Speed


By effectively ‘unlocking’ scanned documents, OCR removes the (frustrating!) requirement of having to retype sections of text contained in scanned documents.

For many of the scanned files lawyers deal, only certain sections of them are specifically relevant to the litigation in hand. As we explored in our article, How lawyers can reduce stress at work with legal tech, as much as 20% of a working day can be wasted in searching for the information you need to get the job done.(1) Trying to identify the relevant parts of huge files can be a big part of this. By letting you search for and then highlight specific areas, you can cut out a lot of this waste.

 

Accuracy

This can be especially relevant when you have large volumes of financial records to analyse. When assessing text manually, even the most experienced lawyer can miss something important. With OCR enabled, you can use the search function in full knowledge that nothing relevant will be missed.

 

Profitability

OCR makes it quicker to work with scanned documents, freeing up your time to devote to more valuable activities such as wider case strategy and building stronger client relationships. What’s more, because it creates less scope for error, there is often greater scope for delegating tasks such as document checking to more junior staff.

 

How to use Casedo software to convert PDF to text

With Casedo, you can now make unreadable scanned documents readable by following these simple steps:

  • Import the scanned document into the Casedo workspace
  • Right-click the imported file and select ‘Recognise text’
  • The software will then process the text (this can take a minute or two, depending on the size of your pdf)
  • Once the OCR feature has processed the text, you can search and edit it as you would a standard text document.

If you have one or more scanned documents and you want to search for specific words, you can simply import them all into Casedo, apply the OCR feature and search them together or individually. Request a Free Casedo 30 Day Trial today, and how easy it is for yourself.

 

References

  1. Noi, D. (2018). Do workers still waste time searching for information?. [online] Blog.xenit.eu. Available at: https://xenit.eu/do-workers-still-waste-time-searching-for-information/ [Accessed 11 Nov. 2022].

 

UPDATED: 2022.11.11


man throwing pdf types into the air

PDF types - How many are there?

Eight apparently, or is it three? PDFs are ubiquitous these days, and yet, like the internet, they haven't been around for long. The PDF first appeared in 1993 and for most people it is now the de facto way to share digital documents. For those of us using PDFs, or building products that use them, it's worth knowing that the humble PDF is not humble at all, there are many PDF types, all to given standards.

 

This 'range' falls roughly into different ways of categorising PDF types themselves: Technical and Everyday. Technically, PDFs have ISO standards and the like, standards for different business sectors and archiving, for engineering and for printing. There are point releases (have you heard of PDF 2.0?) and subsets (surely you know PDF/VT?), none of which, like any good ISO, impinge on our daily life, but are the hidden backbone to it.

Of more interest to most of us are what PDFs there are in everyday parlance, this is much simpler to grasp. Depending on the way the file originated, there are three main types of PDF documents. How the PDF was originally created defines whether the content of the PDF (text, images, tables) can be accessed or whether it is “locked” in an image of the page.

 

Everyday PDF Types:

  • Real PDFs
  • Scanned PDFs
  • Searchable PDFs

 

1. Real PDFs:

Real PDFs, also known as digitally created PDFs are ideal for most applications. This is usually the ideal PDF that allows the users to mark up, annotate, search, and copy/paste. Without having to do an extra step. You can easily create them in-app or via the "print" function. You can search these types of PDFs by default, and content such as text and images copied /pasted into other file formats.

Both the meta-information and the characters in the text hold an electronic character designation. With PDF Editors and other document readers you can search through these PDFs. You can also edit, select, or delete any of the content it holds. But not if the document itself has password protection.

 

2. Scanned PDFs:

Scanned PDFs are just an image of the actual text, so the content is "locked" in a snapshot-like image. This is the same as converting a camera image, a screenshot, jpg or tiff into a PDF. These image-only PDF files are not searchable, and their text usually cannot be modified or easily marked up. This is because they are scanned/photographed images of the pages, and thus without an underlying text layer.

You can converted these kinds of image-only PDFs from non-readable text into readable text, through an Optical Character Recognition (OCR) engine. This engine adds an underlying text layer into the image-like PDF. Do note that this is not the same as simply producing text output which will result in a text document, this is probably quite different in layout to the original PDF, see below for more detail.

 

3. Searchable PDFs:

A searchable PDF is a result of applying the Optical Character Recognition (OCR) function into the non-readable PDF or image-like PDF. During the text recognition process, the software analyses and 'reads' the characters and document structure. This results in the PDF file having 2 layers: one layer containing the image and the second layer containing the recognised text for searching, annotating and copying / pasting just like it can in a real PDF. Such PDF files are almost indistinguishable from the original documents. The gold standard is being able to convert PDF to text on the fly, in-application, when you need to.

 

Casedo can do that for you, take a look at this article to find out more.

 

References:

  • For another spin on the 'three types of pdfs', go to the Abbyy website article HERE
  • If you want a technical look at the 8 different actual 'standards' that exist for pdfs, Marconet has a good explanation HERE
  • Iceni Technology talks, briefly, about 'mixed' pdfs HERE
  • Over at Investintech.com there is another more technical article which puts pdfs into a historical perspective, HERE

 

UPDATED: 2022.11.04