Interesting PDF bugs – Missing image data

I have been looking at a customer file which illustrates one of my least favourite features of the PDF file spec – the need to constantly check everything. Take for example the case of image data.

In the customer PDF file I have an image. To make the maths simple, let us say it was a 1 bit image 80 pixels wide and 10 pixels high. As each byte contains 8 pixels that means we need 10 bytes to hold the pixel data so a total of 100 bytes.

However, some tools ignore any blocks of white at the end. In this case, the bottom of the image is white (ie blank) so I am given only 60 bytes of data. I am expected to assume that the last 40 bytes of data are white and fill in. As Java is meticulous at expecting the data array for an image to contain data for all the pixels (larger is allowed and it will then ignore), it will throw an error if I try to turn this into a BufferedImage. I have to check the size of all image data and ‘correct’. Here is some code if you ever need to do this yourself.

if(decodeColorData.getID()== ColorSpaces.DeviceGray){
   if(d==1){
            int requiredSize=((w+7)>>3)*h;
            int oldSize=data.length;
            if(oldSize

So when you extract the data from a PDF file for your own uses, you will often need to ‘fix’ it and fill-in the missing bits (literally in this case).

Another file fixed, but I do sometimes miss a spec which actually enforces the spec. Do you think this would be a good idea?

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

2 thoughts on “Interesting PDF bugs – Missing image data

  1. Mark,
    I don’t think that this is a problem of the PDF spec. As far as I know (and I have read that darn thing several times), it never says that you can leave out image data. When you look at section 8.9.3 Sample Representation, it does not even talk about pixels, it talks about samples, and the size of the image is described by it’s width and height of the image in samples. The stream that’s associated with the image then contains those samples. And my interpretation of the spec is that you would have width times height samples in that data. The only thing that should be in that stream that is not in the image are the padding bits at the end of a line if your sample size is not a multiple of 8 (and you therefore end up with a line of samples that does not fill the last byte completely).

    The file you are analyzing is therefore not a PDF file, because it does not conform to the spec. I know, your customer does not want to hear that 🙂 but unfortunately there are too many files out there that claim to be PDF, but do not actually conform to the spec. Everybody is expected to just work with them (hey, they have a file extension of .pdf, so your software better deal with them) and not complain. The other thing I found out is that those files usually never have a meaningful “Producer” information in the document information meta data, so it’s impossible to complain to the maker of the software that created those bad PDF files.

    So, don’t blame the PDF spec, blame whoever created the bad PDF file, and blame Adobe for adding support for bad PDFs to their viewer applications (“The file can’t be bad, Adobe Reader displays it correctly”).

    Thanks for sharing another interesting problem you’ve encountered.

    Karl Heinz

  2. Karl,

    Thanks for sharing your knowledge. Another really cool things about this file was that because of the way Java color works, the missing bytes need to be 255 (not 0). So the array not only has to be sized but populated with a default value. There is hours of fun on what we mean by set/unset and white in PDF!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>