One of the biggest issues within PDF files we find is the use of carriage returns, spaces and line feeds as gaps with the PDF file data. Most examples within the PDF file references show space and return used as deliminator in the dictionaries so that the PDF file data is easily readable. Here is an example which is what you would see if you opened an uncompressed PDF file in a text editor.
246 0 obj << /Type /Encoding /BaseEncoding /MacRomanEncoding /Differences [32/space 97/a 99/c/d/e/f 104/h/i 108/l 110/n/o 115/s/t/u 121/y] >> endobj
A space is used as a deliminators between pairs of values with each key set on a separate line (ie carriage return at the end). In practice, the PDF file format is much more flexible (or depressingly vague if you are trying to write a parser). All the gaps can be either a space, tab, or return character. Or you can have no separator between values. So this is still valid:-
246 0 objMacRomanEncoding
The format means that you need to be very careful parsing data and allow for returns where you might expect spaces, and also allow for no gaps. You can also have longer gaps (ie lots of spaces rather than one). Tools like Ghostscript write out the PDF data using very different deliminators.
You can also get deliminators in the middle of the Postscript commands or binary data (which is very annoying). Some tools have an aversion to lines longer than 80 characters, so they embed a return in the binary data. These have to be stripped out and ignored to avoid breaking the binary stream data.
All this calls for great care and flexibility when reading and parsing the data from a PDF files. Even after developing our JPedal PDF library for the last 10 years, we still sometimes find examples which we need to tweak our parser to handle.
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.