I have been looking at a customer PDF file today which highlights how ‘elastic’ the PDF file specification can be….
In theory, all objects are pointed to by a reference so if you got to the byte offset for object 100, you would see
100 0 obj ...... endobj
The data starts with the reference number and generation number. So far so good 🙂
I was looking at a file today which gave me this data for object 100
0 endobj 100 0 obj end
In other words, the offset is set to 8 bytes too early in the stream, so you get the end of the previous object before the correct data for object 100. Most people would regard this as a PDF bug, but it opens in Acrobat (or course it would) which is very forgiving and does lots of error checking.
The real problem is not to correctly fix this file, but to fix the issue without adding code that does not slow down or break all the billions of PDF files out there which work correctly. It is also why you need a very large library of PDF files to regression test any code changes. I have just bought some new i7 PCs with SSD drives to help with our running our continuous testing our collection of PDF files!
After a morning of coding, I now have a code tweak which can handle this (and does not slow down our library in any way), but it is a good example of the issues which you can find in badly created PDF files.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.