Several customers have asked about ‘structured’ text extraction recently so this blog post is intended to clarify this topics. If there are any PDF related topics you are interested in and would like to know more about, please let us know…
When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces. It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.
It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back. So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it. This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created. You can find out if it is present by reading my blog post the-easy-way-to-discover-if-a-pdf-file-contains-structured-content.
There are several tools which claim they can add this information into existing PDF files or recreate it (with varying degrees of success). But the bottom line really is that if you want to extract Structured content from a PDF file, it really needs to contain it in the first place.
We have a code example to extract structured content (if present) and if missing you will now get this output file. We hope that is clearer. Would you like any other help?
xml version="1.0" encoding="UTF-8"?>
<!-- http://www.jpedal.org -->
<!--There is NO Structured text in the file to extract!!-->
<!--Please read our blog post at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/ -->
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.