Part of the PDF file format is the ability to compress large chunks of binary data (fonts, image data, etc) using a variety of compression methods (Flate, CCITTFaxDecode, LZWDecode, JBIG2, etc). This is very useful because it allows the size of the data to be reduced. The file is smaller and it loads quicker. It also allows the PDF creation tool to choose the most appropriate compression method for the data (CCITTFaxDecode does a brilliant job on black and white image data but it is not a great choice for binary font data). So far so good.
But I have been debugging some ‘interesting PDF files’ recently. In particular, there are 2 specific issues with compressed data streams…
1. The standards are not ‘tight’
Several of the compression formats used do not have tight defninitions. CCITTFaxDecode encoded data in particular can have alsorts of little hacks which do not appear to be in the specification but which need to be added. And not all Flate decoders can handle all Flate streams. The only real definition of acceptable is if it opens in Acrobat (which has its own internal repair heuristics).
2. Corrupted data streams can contain valid data, some valid data (or total garbage)
If the data stream is corrupted midway through, it may still contain valid data (especially on the page contents). It might be the last byte of the last file. This also means that any code which uses the data needs to be very robust – it could be handling valid data, valid data with bits missing at the end or total rubbish. And it has to avoid slowing down on 99% of the good PDF files just to allow for 1% which may be only partly correct.
The PDF file specification is very powerful but I think it would be much more useful if standards were enforced – such as data streams must be valid or ignored. What do you think?
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.