Extracting structured text from PDF files

Several customers have asked about ‘structured’ text extraction recently so this blog post is intended to clarify this topics. If there are any PDF related topics you are interested in and would like to know more about, please let us know…

When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces. It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.

It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back. So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it. This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created. You can find out if it is present by reading my blog post the-easy-way-to-discover-if-a-pdf-file-contains-structured-content.

There are several tools which claim they can add this information into existing PDF files or recreate it (with varying degrees of success). But the bottom line really is that if you want to extract Structured content from a PDF file, it really needs to contain it in the first place.

We have a code example to extract structured content (if present) and if missing you will now get this output file. We hope that is clearer. Would you like any other help?


xml version="1.0" encoding="UTF-8"?>

<!-- http://www.jpedal.org -->
<TaggedPDF-doc/>
<!--There is NO Structured text in the file to extract!!-->

<!--Please read our blog post at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/ -->

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>