Understanding the PDF file Format – PDF Text extraction with Java

This post was written in response to a request about how PDF text extraction works. If you have a specific PDF question, please feel free to let us know and we will try to make a blog post to answer it.

Text is defined in PDF files by a Font object and a set of TJ commands. So you will see something like this in the command  stream.

Tf /F1

TJ (Text)

The Tf command specifies that we are using the Font object defined as F1 in the Resources object. This object defines all the font details (name of font, width of glyphs, encoding).

The TJ command lists the glyph numbers to use. This is not a text value. Actually it contain an index which identifies the glyph. It so happens that in the most common form of encoding (WIN), these values are the same as the text ascii values, but it is not text. So you can tell in my example above, the text must be WIN encoded.

The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification) or you can create your own custom encoding (a /Differences object). We can also specify an alterative unicode value to be used for text extraction. A common use of this is where we have lignatures such as fl which look better onscreen as one character but we want to extract as 2 characters.

Generally, these values are Unicode 3.0 which works well with Java which also uses Unicode 3.0. When we write it out though, we have a choice. We can write out as Unicode 3.0 but we then require 2 bytes for each character, even if we are using languages which only need 1 byte such as most European Languages. So we use the following compromise which covers most cases.

If we are writing out XML, we use UTF-8 which is the usual encoding for XML. Otherwise we use the encoding of the machine being used for extraction. You can find this easily with

String encoding=System.getProperty(“file.encoding”);

And the write out the text in that format. The only slight issue here is that if you extract Chinese text on a machine not expecting it, you might not get the right values. But you can always edit our examples to use UTF-16 or whatever. Or set the machine encoding to what you need.

So Java and PDF work well together for handling PDF text. The only real issue is what encoding you choose to write the text out in and this will obviously depend on your requirments.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>