PDF text extraction – Why can I not extract text from this PDF file?

A potential client sent me a PDF file which displays arabic text and asked why they could not extract the Arabic text from it. I am quite often asked this question so I thought it would make a good blog post.

The page itself contains lots of Arabic text which the user wants to extract. The first thing to look at is whether the PDF contains any font objects. These define the text and the Encoding shows how to extract any actual text. Here is the font properties for this file in Acrobat 9.0 – as you will notice, it is empty.

Closer inspection shows that the pages are composed of a single large image. Looking at the page data in Acrobat 9.0 you can see the image details.

So there is no actual text in the PDF to extract. You would need to use an OCR tool to try to get any text from the image.

So while PDF files may ‘display’ text, you can only extract the text if it actually existing in the file.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>