Working on a Java PDF library means that we see all sorts of PDF documents that take liberties with the PDF specification, this week however I altered our LZW decompression algorithm to take in to account a parameter in the PDF reference that hasnt been encounted in the 10 odd years of developing JPedal.
We were sent a PDF document to debug that displayed a rare, but depressing sight in our viewer: a blank page where there should be stuff. The image that was meant to be shown had a parameter for the LZWDecode filter showing /EarlyChange 0. The PDF specification says:
(LZWDecode only) An indication of when to increase the code length. If the value of this entry is 0, code length increases are postponed as long as possible. If the value is 1, code length increases occur one code early. This parameter is included because LZW sample code distributed by some vendors increases the code length one code earlier than necessary. Default value: 1.
The meaning of this may appear obvious to you, but it seemed a little vague to me and certainly seemed more complicated than what the eventual solution turned out to be.
LZW compression involves creating codes for sequences of data and replaces the data with the codes, hopefully ending up with less data than you started with. This form of compression has no idea how many codes it will eventually need so its starts with each code being a certain width (9 bits) and then if it goes beyond its upper capacity it makes the code width one bit longer.
It turns out that because of the way LZW compression might be implemented the point where the code lengths are increased may differ when decompressed. If the decompression algorithm isn’t aware of this all the data associated with the codes gets out of whack and you end up with a blank page and red writing spewed into the output console.
The solution is simple: if early change is enabled you read the data associated with the code and then check to see if you should increase the bit width of the code. If early change is disabled you check to see if you should increase the bit width, increase it and then get the data associated with the code. I think I would have got the answer quicker if I hadnt read the PDF spec first!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.