Text is defined in PDF files by a Font object and a set of TJ commands. So you will see something like this in the command stream.
Tf /F1
TJ (Text)
The Tf command specifies that we are using the Font object defined as F1 in the Resources object. This object defines all the font details (name of font, width of glyphs, encoding).
The TJ command lists the glyph numbers to use. This is not a text value. Actually it contain an index which identifies the glyph. It so happens that in the most common form of encoding (WIN), these values are the same as the text ascii values, but it is not text. So you can tell in my example above, the text must be WIN encoded.
The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification) or you can create your own custom encoding (a /Differences object). We can also specify an alternative unicode value to be used for text extraction. A common use of this is where we have ligatures such as fl which look better on the screen as one character but we want to extract as 2 characters.
You can generally use the extraction values as Unicode.
This post was written in response to a request about how PDF text extraction works. If you have a specific PDF question, please feel free to let us know and we will try to make a blog post to answer it.
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |