Eight apparently, or is it three? PDFs are ubiquitous these days, and yet, like the internet, they haven’t been around for long. The PDF first appeared in 1993 and for most people it is now the de facto way to share digital documents. For those of us using PDFs, or building products that use them, it’s worth knowing that the humble PDF is not humble at all, there are many PDF types, all to given standards.

 

This ‘range’ falls roughly into different ways of categorising PDF types themselves: Technical and Everyday. Technically, PDFs have ISO standards and the like, standards for different business sectors and archiving, for engineering and for printing. There are point releases (have you heard of PDF 2.0?) and subsets (surely you know PDF/VT?), none of which, like any good ISO, impinge on our daily life, but are the hidden backbone to it.

Of more interest to most of us are what PDFs there are in everyday parlance, this is much simpler to grasp. Depending on the way the file originated, there are three main types of PDF documents. How the PDF was originally created defines whether the content of the PDF (text, images, tables) can be accessed or whether it is “locked” in an image of the page.

 

Everyday PDF Types:

  • Real PDFs
  • Scanned PDFs
  • Searchable PDFs

 

1. Real PDFs:

Real PDFs, also known as digitally created PDFs are ideal for most applications. This is usually the ideal PDF that allows the users to mark up, annotate, search, and copy/paste. Without having to do an extra step. You can easily create them in-app or via the “print” function. You can search these types of PDFs by default, and content such as text and images copied /pasted into other file formats.

Both the meta-information and the characters in the text hold an electronic character designation. With PDF Editors and other document readers you can search through these PDFs. You can also edit, select, or delete any of the content it holds. But not if the document itself has password protection.

 

2. Scanned PDFs:

Scanned PDFs are just an image of the actual text, so the content is “locked” in a snapshot-like image. This is the same as converting a camera image, a screenshot, jpg or tiff into a PDF. These image-only PDF files are not searchable, and their text usually cannot be modified or easily marked up. This is because they are scanned/photographed images of the pages, and thus without an underlying text layer.

You can converted these kinds of image-only PDFs from non-readable text into readable text, through an Optical Character Recognition (OCR) engine. This engine adds an underlying text layer into the image-like PDF. Do note that this is not the same as simply producing text output which will result in a text document, this is probably quite different in layout to the original PDF, see below for more detail.

 

3. Searchable PDFs:

A searchable PDF is a result of applying the Optical Character Recognition (OCR) function into the non-readable PDF or image-like PDF. During the text recognition process, the software analyses and ‘reads’ the characters and document structure. This results in the PDF file having 2 layers: one layer containing the image and the second layer containing the recognised text for searching, annotating and copying / pasting just like it can in a real PDF. Such PDF files are almost indistinguishable from the original documents. The gold standard is being able to convert PDF to text on the fly, in-application, when you need to.

 

Casedo can do that for you, take a look at this article to find out more.

 

References:

  • For another spin on the ‘three types of pdfs’, go to the Abbyy website article HERE
  • If you want a technical look at the 8 different actual ‘standards’ that exist for pdfs, Marconet has a good explanation HERE
  • Iceni Technology talks, briefly, about ‘mixed’ pdfs HERE
  • Over at Investintech.com there is another more technical article which puts pdfs into a historical perspective, HERE

 

UPDATED: 2022.11.04