I see a lot of complaints about the PDF file format on various forums, moaning about it. They tend to focus mainly on 2 issues:-
1. The PDF file format is complicated.
2. Extraction, especially of text, is not always straight-forward.
Both of these, I think, are essentially unfair. PDF arose out of Postscript and is more akin to a program, with the final display, as its output. It offers a very powerful and elegant structure to do this, but getting into PDF is a bit like learning a programming language. As with any programming language, you need to have a decent set of tools and a good working knowledge to achieve anything.
Many so-called ‘PDF killers’ have appeared over the years and yet PDF still remains because it is an excellent technical solution for many problems. PDF was never envisaged as something you could hack in a text editor.
The issue with text extraction arises because PDF was designed as an end-file display format so it does not contain lots of details on text structure and layout which you might find in other formats. Adobe did remedy this by adding a feature to embed Structured content tags into the PDF and if this is used, very accurate text can be extracted. The problem is that very few people use this when creating PDFs. So again, don’t blame the format – if used correctly it works very well.
The PDF format’s biggest issue really is that it has been so successful, people are trying to push it into areas which are not it’s strength or push beyond what it was designed to do.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.