How to find out if a PDF file has 'structured content'

Because it turned out that people wanted to make PDF files accessible and extract content from PDF documents (and not just view them), Adobe added a feature called marked content. This allows the tagged PDF file to contain additional tags as information, preserving the structure of the text. However, this feature needs to be used in the creation of the PDF – otherwise the additional information is not there!

There is a very easy way to tell if the PDF file has been created in this way. Open the file in Acrobat Reader and look at the properties menu – the Tagged PDF menu option (bottom left entry on the advanced section) tells you if the PDF contains these extra tags. This file does not.

So this PDF file will contain only limited structure tags.

JPedal (a Java library to Convert, Print, View PDF files) also contains a PDFUtilities class which allows you to programmatically check if the file is fully tagged according to the PDF specification (if it is not you may still be able to extract some structured content from it).

If you can create Tagged PDF, it is worth setting this on by default – the files are not much larger and it makes text extraction much more viable if you need it in the future.

There is a related article How to extract text from PDF files explaining how to extract XML from Structured PDf files with JPedal.

Our software libraries allow you to

Convert PDF files to HTML

Use PDF Forms in a web browser

Convert PDF Documents to an image

Work with PDF Documents in Java

Read and write HEIC and other Image formats in Java

3 Replies to “How to find out if a PDF file has…”

Pingback: Embedding your own data in PDF files | Java PDF Blog

Thanks Mark for a great article! I was rather confused about this issue (needed to extract some data from a pdf and didn’t have a clue), and your blog was very informative and helpful. Keep the great posts coming!

Thanks for the encouragement. If you have anything you would like to see, just let us know.

Comments are closed.

How to find out if a PDF file has ‘structured content’

Our software libraries allow you to

How to remove unused objects from PDF file (Tutorial)

How to extract text from a PDF as JSON

How to add a CMYK image to a PDF…

3 Replies to “How to find out if a PDF file has…”

How to find out if a PDF file has ‘structured content’

Related posts:

Our software libraries allow you to

How to remove unused objects from PDF file (Tutorial)

How to extract text from a PDF as JSON

How to add a CMYK image to a PDF…

3 Replies to “How to find out if a PDF file has…”