Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Extracting structured text from PDF files

1 min read

Several customers have asked about ‘structured’ text extraction recently so this blog post is intended to clarify this topics. If there are any PDF related topics you are interested in and would like to know more about, please let us know…

When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces. It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.

It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back. So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it. This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created. You can find out if it is present by reading my blog post the-easy-way-to-discover-if-a-pdf-file-contains-structured-content.

There are several tools which claim they can add this information into existing PDF files or recreate it (with varying degrees of success). But the bottom line really is that if you want to extract Structured content from a PDF file, it really needs to contain it in the first place.

We have a code example to extract structured content (if present) and if missing you will now get this output file. We hope that is clearer. Would you like any other help?


xml version="1.0" encoding="UTF-8"?>

<!-- http://www.jpedal.org -->
<TaggedPDF-doc/>
<!--There is NO Structured text in the file to extract!!-->

<!--Please read our blog post at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/ -->

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Updates to our Text to Speech support in PDF…

Some time ago we introduced text to speech functionality to the JPedal example viewer. This used the FreeTTS library and its default voices with the option of...
Kieran France
1 min read

Three ways to convert PDF to HTML5: Text and…

There are several ways that you can deal with text and fonts in PDF files when converting to HTML5. Here are there are the...
Leon Atherton
2 min read

Leave a Reply

Your email address will not be published. Required fields are marked *