Don’t blame the PDF file format

I see a lot of complaints about the PDF file format on various forums, moaning about it. They tend to focus mainly on 2 issues:- 1. The PDF file format is complicated. 2. Extraction, especially of text, is not always straight-forward. Both of these, I think, are essentially unfair. PDF arose out of Postscript and…

Read More

PDF format and style information

Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software…

Read More