What is tagged PDF?
A tagged PDF is a PDF file which contains additional information to define its structure (wordflow, headers, tables, paragraphs, etc). This is really useful because it allows the content to be made accessible (the text can be read out because the flow is clearly defined) and for content reuse and processing. The content from tagged PDF files can be extracted as XML/HTML with many libraries including our JPedal PDF library.
Aren’t all PDF files tagged?
Sadly, No. About 20% of PDF files out there are tagged and the rest are much less usable. Tagged PDF is an option on creation. It cannot be added afterwards. The old argument against it used to be that it made the PDF files slightly bigger. But in an age where we have terabytes of storage, the value of slightly smaller files against being able to make them easily accessible, searchable and reusable is no longer valid.
Which PDF creators with make properly tagged files?
LibreOffice, Microsoft Office, InDesign and Acrobat will all create tagged PDF files (make sure the setting is enabled). If you want to check if your PDF files contain tagged content, read our post on How to find out if a PDF file has structured content.
Please use tagged PDF files!
Our recommendation is ALWAYS to create tagged PDF files. even if you do not think it matters now, it will make your PDF files much easier to use in the future.