In my last article, I mentioned how well PDF integrates text along with common issues with text inside PDFs. In this article, I will be more specific as I will be talking about structured text content. I will mention how marked content enables text extraction, how it follows a tree structure, potential issues it may have including the necessity of marked content being included on creation. The following article also gives an overview of text in PDF.
What is marked content?
Adobe introduced the PDF at the beginning without any form of structure which made it difficult for users to extract text from PDF files. Users and developers wanted the ability to extract the text content out of the PDF for their own uses. In later versions, Adobe added a feature in the form of marked content into PDF files which enabled the text in the file to be included separately.
Marked content introduces tagged PDFs. PDFs can have a flag inside it to say that the PDF is tagged and so it contains marked content. Extra information CAN be added which defines the structures of the text and allows them to be included inside the extracted content. When people complain that PDFs are unstructured, they really mean it was not created properly!
The tags inside PDFs follow a tree like structure similar to XML. There is a root object which contains nodes and child nodes recursively. The text in the PDF is tagged according to where in the tree it should appear. This makes it easier for software to determine how the text should be ordered/structured when extracting text. You can find out how to determine whether a PDF is tagged or not by reading this article.
As mentioned above, the biggest issue with marked content is that it is an optional feature; not all PDF files contain marked content and some tools do not even add the option to create it. OpenOffice creates nicely structured PDF files (if you tick the options box telling it to make them).
So, it would only be included if the creator of the PDF file included it when creating it. Although there is software available that would allow marked content to be added to existing PDFs, they are not always successful and accurate. This is frustrating for users as it brings inconsistency. Without marked content, it is difficult to extract text from PDF files as the structure and layout can differ; there is no set rule/syntax of structure which PDFs must follow. It would need to be added when the PDF file is being created. PDF creation software’s can add this for you.
Most PDFs do come with marked content which includes a StructTreeRoot object (root of the tree structure) and so text can be extracted from most PDF files. But occasionally, you may stumble across a PDF file which does not contain marked content, making it difficult to extract text. You may also find the odd file which doesn’t follow the PDF spec, e.g marked content without a StructTreeRoot object. Without marked content, it is near enough impossible to extract text out of PDF files unless they follow a simple structure.
But, when used properly, PDFs can be created with clear structure for the content. And PDF 2.0 adds some really nice features to make this even better…
Next time we carry on looking at PDF files with a discussion on Image in PDF Files.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.