Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How is text stored in a PDF file?

1 min read

Text is defined in PDF files by a Font object and a set of TJ commands. So you will see something like this in the command  stream.

Tf /F1

TJ (Text)

The Tf command specifies that we are using the Font object defined as F1 in the Resources object. This object defines all the font details (name of font, width of glyphs, encoding).

The TJ command lists the glyph numbers to use. This is not a text value. Actually it contain an index which identifies the glyph. It so happens that in the most common form of encoding (WIN), these values are the same as the text ascii values, but it is not text. So you can tell in my example above, the text must be WIN encoded.

The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification) or you can create your own custom encoding (a /Differences object). We can also specify an alternative unicode value to be used for text extraction. A common use of this is where we have ligatures such as fl which look better on the screen as one character but we want to extract as 2 characters.

You can generally use the extraction values as Unicode.

This post was written in response to a request about how PDF text extraction works. If you have a specific PDF question, please feel free to let us know and we will try to make a blog post to answer it.



Are you a Developer working with PDF files?

Our developers guide contains a large number of technical posts to help you understand the PDF file Format.

Find out more about our software for Developers

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2022. All rights reserved.