Because it turned out that people wanted to extract text from PDFs (and not just view them), Adobe added a feature called marked content. This allows the PDF file to contain additional tags as information, preserving the structure of the text. However, this feature needs to be used in the creation of the PDF – otherwise the additional information is not there!
There is a very easy way to tell if the PDF file has been created in this way. Open the file in Acrobat Reader and look at the properties menu – the Tagged PDF menu option (bottom left entry on the advanced section) tells you if the PDF contains these extra tags. This file does not.
If you can create Tagged PDF, it is worth setting this on by default – the files are not much larger and it makes text extraction much more viable if you need it in the future.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- First impressions of the new RaspberryPi for Raspbian, Riscos, Java and NetBeans - March 3, 2015
- Converting a Swing application into JavaFX – Listeners and fast scrolling - February 3, 2015
- BCS talk by Liz Bacon and dinner at Greenwich - January 22, 2015
- Updating all your Fogbugz colleagues using a virtual user - January 15, 2015
- 5 reasons why JavaFX is better than Swing for developing a Java PDF viewer - January 6, 2015