Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Interesting PDF bugs – Missing image data

1 min read

I have been looking at a customer file which illustrates one of my least favourite features of the PDF file spec – the need to constantly check everything. Take for example the case of image data.

In the customer PDF file I have an image. To make the maths simple, let us say it was a 1 bit image 80 pixels wide and 10 pixels high. As each byte contains 8 pixels that means we need 10 bytes to hold the pixel data so a total of 100 bytes.

However, some tools ignore any blocks of white at the end. In this case, the bottom of the image is white (ie blank) so I am given only 60 bytes of data. I am expected to assume that the last 40 bytes of data are white and fill in. As Java is meticulous at expecting the data array for an image to contain data for all the pixels (larger is allowed and it will then ignore), it will throw an error if I try to turn this into a BufferedImage. I have to check the size of all image data and ‘correct’. Here is some code if you ever need to do this yourself.

if(decodeColorData.getID()== ColorSpaces.DeviceGray){
            int requiredSize=((w+7)>>3)*h;
            int oldSize=data.length;

So when you extract the data from a PDF file for your own uses, you will often need to ‘fix’ it and fill-in the missing bits (literally in this case).

Another file fixed, but I do sometimes miss a spec which actually enforces the spec. Do you think this would be a good idea?

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Watch how to use our PDF Viewer JPedal

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

2 Replies to “Interesting PDF bugs – Missing image data”

  1. Mark,
    I don’t think that this is a problem of the PDF spec. As far as I know (and I have read that darn thing several times), it never says that you can leave out image data. When you look at section 8.9.3 Sample Representation, it does not even talk about pixels, it talks about samples, and the size of the image is described by it’s width and height of the image in samples. The stream that’s associated with the image then contains those samples. And my interpretation of the spec is that you would have width times height samples in that data. The only thing that should be in that stream that is not in the image are the padding bits at the end of a line if your sample size is not a multiple of 8 (and you therefore end up with a line of samples that does not fill the last byte completely).

    The file you are analyzing is therefore not a PDF file, because it does not conform to the spec. I know, your customer does not want to hear that 🙂 but unfortunately there are too many files out there that claim to be PDF, but do not actually conform to the spec. Everybody is expected to just work with them (hey, they have a file extension of .pdf, so your software better deal with them) and not complain. The other thing I found out is that those files usually never have a meaningful “Producer” information in the document information meta data, so it’s impossible to complain to the maker of the software that created those bad PDF files.

    So, don’t blame the PDF spec, blame whoever created the bad PDF file, and blame Adobe for adding support for bad PDFs to their viewer applications (“The file can’t be bad, Adobe Reader displays it correctly”).

    Thanks for sharing another interesting problem you’ve encountered.

    Karl Heinz

  2. Karl,

    Thanks for sharing your knowledge. Another really cool things about this file was that because of the way Java color works, the missing bytes need to be 255 (not 0). So the array not only has to be sized but populated with a default value. There is hours of fun on what we mean by set/unset and white in PDF!

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2022. All rights reserved.