Eight apparently, or is it three? PDFs are ubiquitous these days, and yet, like the internet, they haven’t been around for long. The PDF first appeared in 1993 and for most people it is now the de facto way to share digital documents. For those of us using PDFs, or building products that use them, it’s worth knowing that the humble PDF is not humble at all, there are a range of them, all to given standards.
This ‘range’ falls roughly into different ways of categorising PDFs themselves: Technical and Everyday. Technically, PDFs have ISO standards and the like, standards for different business sectors and archiving, for engineering and for printing. There are point releases (have you heard of PDF 2.0?) and subsets (surely you know PDF/VT?), none of which, like any good ISO, impinge on our daily life, but are the hidden backbone to it.
Of more interest to most of us are what PDFs there are in everyday parlance, this is much simpler to grasp. Depending on the way the file originated, PDF documents can be categorised into 3 different types. How the PDF was originally created defines whether the content of the PDF (text, images, tables) can be accessed or whether it is “locked” in an image of the page.
Everyday Types of PDF:
- Real PDFs
- Scanned PDFs
- Searchable PDFs
1. Real PDFs:
Real PDFs, also known as digitally created PDFs are ideal for most applications. This is usually the ideal PDF that allows the users to mark up, annotate, search, and copy/paste, without having to do an extra step. They can easily be created in-app or via the “print” function. These types of PDFs can be searched by default and content such as text and images can be copy/pasted into other file formats.
Both the meta-information and the characters in the text hold an electronic character designation. With PDF Editors and other document readers you can search through these PDFs and edit, select, or delete any of the content it holds, unless the document itself has been password protected.
2. Scanned PDFs:
Scanned PDFs are just an image of the actual text, so the content is “locked” in a snapshot-like image. This is the same as converting a camera image, a screenshot, jpg or tiff into a PDF. These image-only PDF files are not searchable, and their text usually cannot be modified or easily marked up. This is because they are scanned/photographed images of the pages, and thus without an underlying text layer.
These kinds of image-only PDFs can, however, be converted into from non-readable text, into readable text, and it is done through an Optical Character Recognition (OCR) engine. This engine adds an underlying text layer into the image-like PDF. It should be noted that this is not the same as simply producing text output which will result in a text document, probably quite different in layout to the original PDF, see below for more detail.
3. Searchable PDFs:
A searchable PDF is a result of applying the Optical Character Recognition (OCR) function into the non-readable PDF or image-like PDF. During the text recognition process, characters and the document structure are analyzed and “read”. This result in the PDF file having 2 layers: one layer containing the image and the second layer containing the recognised text that can be searched, annotated, marked up, and copy/pasted just like it can in a real PDF. Such PDF files are almost indistinguishable from the original documents.
Since version 1.1.0, Casedo has an integrated OCR feature. For more information follow this LINK.
- For another spin on the ‘three types of pdfs’, go to the Abbyy website article HERE
- If you want a technical look at the 8 different actual ‘standards’ that exist for pdfs, Marconet has a good explanation HERE
- Iceni Technology talks, briefly, about ‘mixed’ pdfs HERE
- Over at Investintech.com there is another more technical article which puts pdfs into a historical perspective, HERE