Every so often people send us files and ask why we cannot extract the text from them – I mean we can view the PDF file onscreen and see the text. Very often the files are from certain version of Ghostscript.
The problem is that the PDF file does not contain text – they contain a list of glyphs and for each glyph there is information on how to display it, and which characters it represents (its encoding). Often the encoding is a standard built-in pattern (ie MAC encoding or WIN encoding).
However Ghostscript creates a custom pattern for each font based on character usage. So if the first word is cat, glyph 1 would be c, glyph 2 would be a and glyph 3 would be t. If the first word is dog, glyph 1 is d, glyph 2 is o, and glyph 3 is g. But there is no encoding data to tell us what the pattern for each font is – all we know is the glyph 1 is drawn using this outline.
As a result, the PDF files contain no valid text. So if you want to extract text from a PDF file, make sure it has valid encoding.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.