Every so often people send us files and ask why we cannot extract the text from them – I mean we can view the PDF file onscreen and see the text. Very often the files are from certain version of Ghostscript.
The problem is that the PDF file does not contain text – they contain a list of glyphs and for each glyph there is information on how to display it, and which characters it represents (its encoding). Often the encoding is a standard built-in pattern (ie MAC encoding or WIN encoding).
However Ghostscript creates a custom pattern for each font based on character usage. So if the first word is cat, glyph 1 would be c, glyph 2 would be a and glyph 3 would be t. If the first word is dog, glyph 1 is d, glyph 2 is o, and glyph 3 is g. But there is no encoding data to tell us what the pattern for each font is – all we know is the glyph 1 is drawn using this outline.
As a result, the PDF files contain no valid text. So if you want to extract text from a PDF file, make sure it has valid encoding.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- Introducing the new XFA Parser in FormVu - May 16, 2018
- Moving to JPedal release 8 - May 2, 2018
- Which version of Java SE should I use? - April 25, 2018
- How we are improving our code quality with IDEA in 2018 - March 7, 2018
- How we are improving our code quality with NetBeans in 2018 - March 1, 2018