Now to complete the series on CCITT, I will go as in-depth on how to decode a G31D encoded PDF file.
G31D CCITT is a variation to the Huffman keyed compression scheme. Essentially to decode a G31D PDF file, a scan line is read in single bit pixel runs. Each of these bits representing a number of white or black pixels. The black and white run length alternate and vary in length making them uniquely identified when decoded, the maximum size of a the run lengths is bounded to the maximum width of the scan line(page width).
More frequently occurring run-lengths are assigned to smaller code words while less frequently occurring run-lengths are assigned to longer code words. This is particularly useful as in a typical hand written or printed document more short run-lengths are encountered than long run-lengths.
While still on the subject of pixel runs and run-lengths it is important to mention facts about how pixel runs are encoded which in turn makes it easier to to decode. Pixel runs which are between 0 and 63 pixels in length are generally encoded using a single terminating code while runs between 64 and 2623 are encoded by a single make up code and a terminating code. However, when the run length is above 2623 pixels they are encoded using as many make up codes as needed and only a terminating code.
Firstly a pre-calculated lookup table for both the black and white pixel runs have to be created to which the current data is compared against. You want to be able to keep track of your current bit location in the scan line. This is so that when a different bit is hit, be it black or white the decoder can group the previous bits into code word of either make up (longer code words) or terminating (shorter code words) code words which are then checked against the table and decoded as needed. The make-up code word represents long run-lengths while the short run-length is represented by the terminating cord-words. The sum of the length values of each code word makes up the run length. The process is repeated as new EOLs are hit.
It is also worth mentioning that each EOL usually starts with a white run length code word. But there are some unusual cases where it does not follow the norm i.e. begins with a black run-length. In this situation, the beginning of that scan is preceded by a zero length white run-length code word. However, if 6 EOLs are hit consecutively then this denotes the end of the file i.e. RTC.
I have found that the big advantages of G31D CCITT is good compression of black and white data. The disadvantage is that it cannot optimise across lines or for multiple empty lines. It also takes a while to get to grips with the algorithm.
What do you think of CCITT?? Would you like to know more about other types of CCITT compression?
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.