Every so often people send us files and ask why we cannot extract the text from them – I mean we can view the PDF file onscreen and see the text. Here are the 2 most common reasons…
File contains no Encoding data
Sometimes the PDF file will contain Font data, but no Encoding information. Very often the files are from certain version of Ghostscript but it does apply to other PDF files as well and you can deliberately do this to stop people accessing your content.
The issue is that the PDF files do not contain text – they contain a list of glyphs and for each glyph there is information on how to display it, and which characters it represents (its encoding). Often the encoding is a standard built-in pattern (ie MAC encoding or WIN encoding).
It is possible to create a custom pattern for each font based on character usage. So if the first word is cat, glyph 1 would be c, glyph 2 would be a and glyph 3 would be t. If the first word is dog, glyph 1 is d, glyph 2 is o, and glyph 3 is g. But if there is no encoding data to tell us what the pattern for each font is – all we know is the glyph 1 is drawn using this outline.
As a result, the PDF files contain no valid text, just links to the Glyph to display.
No actual text
A potential client sent me a PDF file which displays arabic text and asked why they could not extract the Arabic text from it. I am quite often asked this question so I thought it would make a good blog post.
The page itself contains lots of Arabic text which the user wants to extract. The first thing to look at is whether the PDF contains any font objects. These define the text and the Encoding shows how to extract any actual text. Here is the font properties for this file in Acrobat 9.0 – as you will notice, it is empty.
Closer inspection shows that the pages are composed of a single large image. Looking at the page data in Acrobat 9.0 you can see the image details.
So there is no actual text in the PDF to extract. You would need to use an OCR tool to try to get any text from the image.
So while PDF files may ‘display’ text, you can only extract the text if it actually existing in the file.
Are you a Developer working with PDF files?
|Free: The Developer's Guide to PDF|
|Convert PDF files to HTML|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|