This post was written in response to a request about how PDF text extraction works. If you have a specific PDF question, please feel free to let us know and we will try to make a blog post to answer it.
Text is defined in PDF files by a Font object and a set of TJ commands. So you will see something like this in the command stream.
The Tf command specifies that we are using the Font object defined as F1 in the Resources object. This object defines all the font details (name of font, width of glyphs, encoding).
The TJ command lists the glyph numbers to use. This is not a text value. Actually it contain an index which identifies the glyph. It so happens that in the most common form of encoding (WIN), these values are the same as the text ascii values, but it is not text. So you can tell in my example above, the text must be WIN encoded.
The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification) or you can create your own custom encoding (a /Differences object). We can also specify an alterative unicode value to be used for text extraction. A common use of this is where we have lignatures such as fl which look better onscreen as one character but we want to extract as 2 characters.
Generally, these values are Unicode 3.0 which works well with Java which also uses Unicode 3.0. When we write it out though, we have a choice. We can write out as Unicode 3.0 but we then require 2 bytes for each character, even if we are using languages which only need 1 byte such as most European Languages. So we use the following compromise which covers most cases.
If we are writing out XML, we use UTF-8 which is the usual encoding for XML. Otherwise we use the encoding of the machine being used for extraction. You can find this easily with
And the write out the text in that format. The only slight issue here is that if you extract Chinese text on a machine not expecting it, you might not get the right values. But you can always edit our examples to use UTF-16 or whatever. Or set the machine encoding to what you need.
So Java and PDF work well together for handling PDF text. The only real issue is what encoding you choose to write the text out in and this will obviously depend on your requirments.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.