2 problems with corrupt PDF data streams

Part of the PDF file format is the ability to compress large chunks of binary data (fonts, image data, etc) using a variety of compression methods (Flate, CCITTFaxDecode, LZWDecode, JBIG2, etc). This is very useful because it allows the size of the data to be reduced. The file is smaller and it loads quicker. It also allows the PDF creation tool to choose the most appropriate compression method for the data (CCITTFaxDecode does a brilliant job on black and white image data but it is not a great choice for binary font data). So far so good.

But I have been debugging some ‘interesting PDF files’ recently. In particular, there are 2 specific issues with compressed data streams…

1. The standards are not ‘tight’

Several of the compression formats used do not have tight defninitions. CCITTFaxDecode encoded data in particular can have alsorts of little hacks which do not appear to be in the specification but which need to be added. And not all Flate decoders can handle all Flate streams. The only real definition of acceptable is if it opens in Acrobat (which has its own internal repair heuristics).

2. Corrupted data streams can contain valid data, some valid data (or total garbage)

If  the data stream is corrupted midway through, it may still contain valid data (especially on the page contents). It might be the last byte of the last file. This also means that any code which uses the data needs to be very robust – it could be handling valid data, valid data with bits missing at the end or total rubbish. And it has to avoid slowing down on 99% of the good PDF files just to allow for 1% which may be only partly correct.

The PDF file specification is very powerful but I think it would be much more useful if standards were enforced – such as data streams must be valid or ignored. What do you think?

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

One thought on “2 problems with corrupt PDF data streams

  1. This is one of the nice features of the XPS specification, readers are required to reject documents that are malformed and most of it is well specified.

    That and Microsoft have managed to avoid putting in everything but the kitchen sink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>