Why is pdf text extraction problematic?

Mark Stephens

15 years ago

PDF text is a subject which causes much confusion. People look at PDF files and they are a fantastic way to present content. If setup correctly, you can be sure they will appear exactly as you intended (with none of that horrid wrong formatting you get in Microsoft Word if the user does not have your fonts). They are secure, self-contained and cross-platform.

The problem comes with extraction. People look at the text layout. Because the human eye is very good at working out the flow of the page, they assume that this information is all in the file.

The PDF file is really a form of Vector graphic – it contains a whole load of commands to draw shapes, images and text. So long as the end result looks correct that is the key requirement. Often the text will be in the correct order but there is no guarantee. Nothing in PDF specification enforces any standardization.

Complex structures such as tables exist because your brain perceives them on the finished document – there is nothing describing them in the PDF beyond a set of draw String commands at certain locations.

I was sent a PDF file once created with a tool which obviously started life as a plotter. With a plotter, the slowest activity is to change the pen on the plotter, so where possible you try to draw all the black lines first, then change the pen. In this PDF, all the bold text on the page was drawn first, and then the rest, font by font. It looked perfect to view, but it was very messy inside.

You can extract text from PDF files , but you have to allow for all of these possibilities and there is no heuristics capable of handling all the possible options. The more general you make the algorithm the less able it becomes. In our software we took the decision to try and provide some solutions focused on specific tasks (ie keywords, tables, flowing text). Many times this works perfectly, but it is always a guessing game.

If you are creating PDF files and want the text to be extracted, you should use Adobe’s Structured Content feature. This does include tagging information in the PDF making it possible to extract a perfect XML version of the page – but it needs to be switched on to be in the file.

Otherwise, remember that a PDF is essentially a Vector graphic and while it may look perfectly structured, that the structure is probably your brains perception of the picture and not in the file.