Although the PDF file type is a powerful format for text support, it comes with potential issues which can make them painful to work with.
I will start off with the strengths of PDF files in regards to text such as how well it works with all different types of fonts, how it uses embedded objects for better text support, and how flags and encoding rules can provide better control.
I will follow with some of the issues faced with text inside PDFs. These include no enforcement of rules, images being misinterpreted as text, and the lack of a standard structure and layout by default.
Before I start talking about the good points and potential issues with text in PDF, you can find general information on text in PDFs in the following articles:
- PDF Text – An Overview
- PDF to HTML5 Conversion – Extracting PDF Text and mapping Glyphs
- Understanding the PDF File Format – PDF Text Extraction with java
The good points:
One strength of the PDF file format is the versatility it has through the support of different data types include searchable text. A PDF file can be flexible in its support for text as text itself comes in different types and forms. There is a lot to consider when it comes to the technical side of text; for example, handling the copyright (c) symbol.
The PDF format makes this easy through the use of map tables. An extracted value can be mapped to a display value, enabling the PDF file to handle it. You can find more on this in this article here.
The main purpose of PDF (Portable Document Format) files is to make documents portable. One of the ways Adobe has made PDFs portable is through the support of embedded objects. Embedded objects allow useful data to be included with the PDF file so that the users do not require any downloads (except from a PDF viewer) to read the file. This allows PDF files to have better support for all different kind of fonts including right to left fonts, and fonts containing glyph. PDF viewers/creators handle it all for the user.
Flags and rules can be set for text in PDF files which give better control. They can specify any particular rules or conditions which the PDF viewer must follow to manipulate the output and display text in the desired format. They can also control the encoding of file as the encoding can have an impact on the decoding of the file, including the time taken to decode the file as well as the final output. For example, a PDF can have a flag to say whether the PDF document is tagged or not. This can have an impact on the structure of the text which may be easy to extract.
A problem with text in PDF files is that the encoding rules are not enforced. This means that the PDF file may look perfect but contain garbage in terms of text content. Without enforcement of the rules, PDF files are prone to corrupt content, unexpected outcomes and errors. More information on encoding of text in PDFs can be found in the article linked here.
Some PDF files can display text without actually including the searchable text. This generally occurs when the PDF file consists of images; text in images cannot be recognised as text without OCR. Text in the form of images are usually found in scanned documents which are converted into PDF files; something a lot of PDF files tend to be. Extracting text from images is a completely different problem and requires a separate tool to deal with. It is a common problem in PDF files which cannot be handled by PDF software alone.
PDF files do not have a set/fixed structure which it must follow. It can be laid out in any manner the creator desires and may not have a structure. This brings inconsistency in the structure of a PDF file; there is no expectation on how the PDF is to be structured. This makes it frustrating for some users and developers especially. Without a set of rules/syntax to follow, text in PDFs can be difficult to handle.
In my next article, I will be talking about structured text in PDF files.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.