How does CCITT compression work?
CCITT encodes black and white data. It does this by encoding runs of black or white pixels. We can do this in various ways (G31D/ G32D/G42D). They are also known as Group 3/ Group 4 compression. We explain how the most common type (G31D) works in detail below.
As most images contain more white than black, we assume that we start with white. For cases where we do not start with white, we add a marker at the start to show this. If we encode black as value 1, we just set these bits in our decompressed data – we do not explicitly need to set white values (because it is binary, not setting a value to black means that it is white).
But sometimes, we find that there are more pixels that are black that white. Well, in this case, we can just invert the image (flipping bits is very fast) and then we get the best compression. All we need is a flag (BlackIs1 in the PDF file format – it’s default value is false) so flag that the image data needs inversion to appear correctly.
How does G31D compression work?
This is the simpler form of CCITT to decode. Firstly here are some keywords that would make it easier to understand how G31D works.
Pixel run- Usually 1-bit, 1 for Black and 0 for White. A block of pixels all the same.
Scan line– The width of data from one end of the page to the other.
Code Words– This contains information regarding what the data does eg makeup or Terminating.
Run Length– Block of either White or Black bits to be decoded/ encoded.
End of line(EOL)- Unique 12-bit code word used to determine the start and end of a scan line.
Return to control(RTC)- Six EOL code words occurring consecutively usually determines the end of the file. EOL & RTC would become more obvious in later blogs.
G31D CCITT is a variation to the Huffman keyed compression scheme. Essentially to decode a G31D PDF file, a scan line is read in single bit pixel runs. Each of these bits representing a number of white or black pixels. The black and white run length alternate and vary in length making them uniquely identified when decoded, the maximum size of a the run lengths is bounded to the maximum width of the scan line(page width).
More frequently occurring run-lengths are assigned to smaller code words while less frequently occurring run-lengths are assigned to longer code words. This is particularly useful as in a typical hand written or printed document more short run-lengths are encountered than long run-lengths.
While still on the subject of pixel runs and run-lengths it is important to mention facts about how pixel runs are encoded which in turn makes it easier to to decode. Pixel runs which are between 0 and 63 pixels in length are generally encoded using a single terminating code while runs between 64 and 2623 are encoded by a single make up code and a terminating code. However, when the run length is above 2623 pixels they are encoded using as many make up codes as needed and only a terminating code.
Firstly a pre-calculated lookup table for both the black and white pixel runs have to be created to which the current data is compared against. You want to be able to keep track of your current bit location in the scan line. This is so that when a different bit is hit, be it black or white the decoder can group the previous bits into code word of either make up (longer code words) or terminating (shorter code words) code words which are then checked against the table and decoded as needed. The make-up code word represents long run-lengths while the short run-length is represented by the terminating cord-words. The sum of the length values of each code word makes up the run length. The process is repeated as new EOLs are hit.
It is also worth mentioning that each EOL usually starts with a white run length code word. But there are some unusual cases where it does not follow the norm i.e. begins with a black run-length. In this situation, the beginning of that scan is preceded by a zero length white run-length code word. However, if 6 EOLs are hit consecutively then this denotes the end of the file i.e. RTC.
I have found that the big advantages of G31D CCITT is good compression of black and white data. The disadvantage is that it cannot optimise across lines or for multiple empty lines. It also takes a while to get to grips with the algorithm.
Do you need to read or write Tiff files in Java?
Our JDeli image library offers a range of advantages over ImageIO and alternatives for Tiff files, including:
- prevents heap related JVM crashes
- reads 1-32 bit bilevel, grayscale, rgb, argb, cmyk, acmyk, ycbcr Colorspaces, and converts to sRGB BufferedImage
- implements both Little and Big Endian Byte Ordering
- decompresses uncompressed, CCITT group 3 and 4, Deflate/Adobe Deflate, LZW, Packbits
- support for Single, Multi-file, Tiling, Planar (Chunky, Separated), Predictor, 16,32 bit floating samples
- improve read performance
- supports threading
- superior image scaling algorithms