Interesting PDF bugs – How wrong can the references be?

I have been looking at a customer PDF file today which highlights how ‘elastic’ the PDF file specification can be….

In theory, all objects are pointed to by a reference so if you got to the byte offset for object 100, you would see

100 0 obj

......

endobj

The data starts with the reference number and generation number. So far so good 🙂

I was looking at a file today which gave me this data for object 100

0 endobj

100 0 obj

end

In other words, the offset is set to 8 bytes too early in the stream, so you get the end of the previous object before the correct data for object 100. Most people would regard this as a PDF bug, but it opens in Acrobat (or course it would) which is very forgiving and does lots of error checking.

The real problem is not to correctly fix this file, but to fix the issue without adding code that does not slow down or break all the billions of PDF files out there which work correctly. It is also why you need a very large library of PDF files to regression test any code changes. I have just bought some new i7 PCs with SSD drives to help with our running our continuous testing our collection of PDF files!

After a morning of coding, I now have a code tweak which can handle this (and does not slow down our library in any way), but it is a good example of the issues which you can find in badly created PDF files.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (Twitter, Facebook and Google+) or the  Blog RSS.

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>