Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How is text stored in a PDF file?

55 sec read

Text is defined in PDF files by a Font object and a set of TJ commands. So you will see something like this in the command  stream.

Tf /F1

TJ (Text)

The Tf command specifies that we are using the Font object defined as F1 in the Resources object. This object defines all the font details (name of font, width of glyphs, encoding).

The TJ command lists the glyph numbers to use. This is not a text value. Actually it contain an index which identifies the glyph. It so happens that in the most common form of encoding (WIN), these values are the same as the text ascii values, but it is not text. So you can tell in my example above, the text must be WIN encoded.

The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification) or you can create your own custom encoding (a /Differences object). We can also specify an alternative unicode value to be used for text extraction. A common use of this is where we have ligatures such as fl which look better on the screen as one character but we want to extract as 2 characters.

You can generally use the extraction values as Unicode.

This post was written in response to a request about how PDF text extraction works. If you have a specific PDF question, please feel free to let us know and we will try to make a blog post to answer it.



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.