PDF Text extraction - Why can I not extract text from a PDF file?

Every so often people send us files and ask why we cannot extract the text from them – I mean we can view the PDF file onscreen and see the text. Here are the 2 most common reasons…

File contains no Encoding data

Sometimes the PDF file will contain Font data, but no Encoding information. Very often the files are from certain version of Ghostscript but it does apply to other PDF files as well and you can deliberately do this to stop people accessing your content.

The issue is that the PDF files do not contain text – they contain a list of glyphs and for each glyph there is information on how to display it, and which characters it represents (its encoding). Often the encoding is a standard built-in pattern (ie MAC encoding or WIN encoding).

It is possible to create a custom pattern for each font based on character usage. So if the first word is cat, glyph 1 would be c, glyph 2 would be a and glyph 3 would be t. If the first word is dog, glyph 1 is d, glyph 2 is o, and glyph 3 is g. But if there is no encoding data to tell us what the pattern for each font is – all we know is the glyph 1 is drawn using this outline.

As a result, the PDF files contain no valid text, just links to the Glyph to display.

No actual text

A potential client sent me a PDF file which displays arabic text and asked why they could not extract the Arabic text from it. I am quite often asked this question so I thought it would make a good blog post.

The page itself contains lots of Arabic text which the user wants to extract. The first thing to look at is whether the PDF contains any font objects. These define the text and the Encoding shows how to extract any actual text. Here is the font properties for this file in Acrobat 9.0 – as you will notice, it is empty.