Why can I not extract text from this Ghostscript generated PDF file

Every so often people send us files and ask why we cannot extract the text from them – I mean we can view the PDF file onscreen and see the text. Very often the files are from certain version of Ghostscript.

The problem is that the PDF file does not contain text – they contain a list of glyphs and for each glyph there is information on how to display it, and which characters it represents (its encoding). Often the encoding is a standard built-in pattern (ie MAC encoding or WIN encoding).

However Ghostscript creates a custom pattern for each font based on character usage. So if the first word is cat, glyph 1 would be c, glyph 2 would be a and glyph 3 would be t. If the first word is dog, glyph 1 is d, glyph 2 is o, and glyph 3 is g. But there is no encoding data to tell us what the pattern for each font is – all we know is the glyph 1 is drawn using this outline.

As a result, the PDF files contain no valid text. So if you want to extract text from a PDF file, make sure it has valid encoding.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>