Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

The easy way to discover if a PDF file contains ‘structured content’

45 sec read

Because it turned out that people wanted to extract text from PDFs (and not just view them), Adobe added a feature called marked content. This allows the PDF file to contain additional tags as information, preserving the structure of the text. However, this feature needs to be used in the creation of the PDF – otherwise the additional information is not there!

There is a very easy way to tell if the PDF file has been created in this way. Open the file in Acrobat Reader and look at the properties menu – the Tagged PDF menu option (bottom left entry on the advanced section) tells you if the PDF contains these extra tags. This file does not.

tagged PDF menu So this PDF file will contain only limited structure tags.

If you can create Tagged PDF, it is worth setting this on by default – the files are not much larger and it makes text extraction much more viable if you need it in the future.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!



Can we help you to solve any of these problems?

IDRsolutions has been helping companies to solve these problems since 1999.

Convert PDF to HTML5 or SVG with BuildVuConvert PDF to HTML5 or SVGConvert AcroForms and XFA to HTML5 with FormVuConvert PDF forms to HTML5
Java Image SDK for working with Image files with JDeliJava SDK for Image files JPedal Java PDF SDK for working with PDF filesJava SDK for PDF files
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

3 Replies to “The easy way to discover if a PDF file…”

  1. Thanks Mark for a great article! I was rather confused about this issue (needed to extract some data from a pdf and didn’t have a clue), and your blog was very informative and helpful. Keep the great posts coming!

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2021. All rights reserved.