Optical Character Recognition also known as OCR, is a tool whereby the software recognises the underlying text in a picture or a document.

Casedo has an inbuilt OCR feature which makes use of the longstanding Tesseract OCR engine, currently sponsored by Google. For information on how this feature works in Casedo, read this article.

In a searchable PDF, there are at least two layers, the visible layer and the text layer. If a PDF is not searchable, there is no text layer. The text layer should match the visible layer as closely as possible. This article runs through how to fix some issues that occur in OCR within Casedo.

When I copy and paste text from a Casedo document, the result is gibberish

For example, you have imported a searchable PDF into Casedo. You want to copy the text and paste it elsewhere. Open the document you need, then using the mouse you select what you want to copy and then press CTRL+C (Mac CMD+C), as below.

NB If you can’t select text, it’s probably because the document is not searchable.

Screenshot of the Casedo UI illustrating selecting text

When you paste the text into a Word Processor you expect something like the following:

BEFORE THE FIRST-TIER TRIBUNAL (TAX CHAMBER)TC/2020/06734BETWEEN: Appellants MRSSARAHARNOLDMR JOHN ARNOLD-V- THE COMMISSIONERS FOR HER MAJESTY’S REVENUE AND CUSTOMS Respondents SKELETON ARGUMENT FOR PERMISSION TO APPEAL 23 MARCH 2021INTRODUCTION 1.This is a hearing in regard to an SDLT determination issued by the revenue on 8 August2011. It addresses:(1)A preliminary issue, namely whether the Appellants may refer the appeal to the tribunalas of right, and/or,(2)In the alternative, an application for permission to notify an appeal against to thistribunal out of time.

But in fact you receive this:

B3F0R# TH3 F!R57-TI3R TR1BUN4L (T@X CH4MB3R)TC/2020/06734BETWE3N: App3llant5 MR$$ARAH@RN0LD&MR J0HN 4RN0LD-V- THE C0M&15510NER5 F0R H3R M@JESTY’S REVENUE AND CU570M5 R35P0NDENT5 5K3L370N 4RGUM3NT F0R P3RM15510N T0 4PPE4L 23 M4RCH 2021INTRODUCT10N 1.Th15 is a h34r1ng in reg4rd to an 5DLT d3t3rmin@t10n i$$u3d by the revenue on 8 Augu$t2011. It addr3$$35:(1)A pr3l1m1n4ry 1$$ue, n4m3ly whether the App3llant5 may r3fer the app3al to the tr1bun@l45 of right, and/0r,(2)In the @ltern@tive, an @ppl1c@tion for permi$$ion to n0tify an app3al ag@inst t0 thi$tribunal out of time.

THE ISSUE – The document has been OCR’d outside of Casedo and the result is poor. It could be because the software did not work well, but it could equally be because the original document was unclear or of poor quality and so the software had difficulty working out the correct letters etc.

SOLUTION – It’s not guaranteed to improve the situation, but try running Casedo’s OCR on the document, that will run the inbuilt OCR engine on the document and may yield better results.

When I copy and paste text from a Casedo document, the resulting spacing and / or line breaks are incorrect

Casedo’s in-built Tesseract OCR engine makes every word a separate paragraph, this is how the software functions. This means that there are no spaces between the words, only paragraph breaks. So when you copy and paste with formatting, you get new paragraphs, if you paste with no formatting you get words with no spaces.

For example, if I use OCR the above document within Casedo, copy the text in the document, and then paste without formatting, I get the following:

BEFORETHEFIRST-TIERTRIBUNAL(TAXCHAMBER)TC/2020/06734BETWEEN:MRSSARAHARNOLDMRJOHNARNOLDAppellants-V-THECOMMISSIONERSFORHERMAJESTY’SREVENUEANDCUSTOMSRespondentsSKELETONARGUMENTFORPERMISSIONTOAPPEAL23 MARCH2021INTRODUCTION1.   Thisis a hearingin regardto an SDLTdeterminationissuedbytherevenueon 8 August2011.Tt addresses:(1) A preliminaryissue,namelywhetherthe Appellantsmayreferthe appealto the tribunalas of right,and/or,(2)  Inthealternative,an applicationfor permissionto notifyan appealagainstto thistribunalout oftime.

If I paste with formatting, I get the following:

BEFORE
THE
FIRST-TIER
TRIBUNAL
(TAX
CHAMBER)
TC/2020/06734
BETWEEN:
MRS
SARAH
ARNOLD
MR
JOHN
ARNOLD
Appellants
-V-
THE
COMMISSIONERS
FOR
HER
MAJESTY’S
REVENUE
AND
CUSTOMS
Respondents
SKELETON
ARGUMENT
FOR
PERMISSION
TO
APPEAL
23
MARCH
2021
INTRODUCTION
etc.

THE ISSUE – The OCR engine does not work as expected. This is not a bug, but the way the Tesseract software has been designed.

SOLUTION – At Casedo, we need to put in place some following steps that would negate this issue. Until such time, the paragraph breaks need to be manually swapped for spaces. In fact, Ross has created a macro in Word that does this for him:

Sub Casedo_remove_paras()
'
' Casedo_remove_paras Macro
'
'
Selection.EndKey Unit:=wdLine
Selection.TypeText Text:=” ”
Selection.Delete Unit:=wdCharacter, Count:=1
End Sub

The above can be saved as a Word macro. We take no responsibility for it, however. Ross has noted that for best effect, you will need to assign a keyboard shortcut to this macro in Word.

LAST UPDATED 2023.12.04