Understanding the PDF file Format – Carriage returns, spaces and other gaps

One of the biggest issues within PDF files we find is  the use of carriage returns, spaces and line feeds as gaps with the PDF file data. Most examples within the PDF file references show space and return used as deliminator in the dictionaries so that the PDF file data is easily readable. Here is an example which is what you would see if you opened an uncompressed PDF file in a text editor.

246 0 obj
<<
/Type /Encoding
/BaseEncoding /MacRomanEncoding
/Differences [32/space 97/a 99/c/d/e/f 104/h/i 108/l
110/n/o 115/s/t/u 121/y]
>>
endobj

A space is used as a deliminators between pairs of values with each key set on a separate line (ie carriage return at the end). In practice, the PDF file format is much more flexible (or depressingly vague if you are trying to write a parser). All the gaps can be either a space, tab, or return character. Or you can have no separator between values. So this is still valid:-

246 0 objMacRomanEncoding

The format means that you need to be very careful parsing data and allow for returns where you might expect spaces, and also allow for no gaps. You can also have longer gaps (ie lots of spaces rather than one). Tools like Ghostscript write out the PDF data using very different deliminators.

You can also get deliminators in the middle of the Postscript commands or binary data (which is very annoying). Some tools have an aversion to lines longer than 80 characters, so they embed a return in the binary data. These have to be stripped out and ignored to avoid breaking the binary stream data.

All this calls for great care and flexibility when reading and parsing the data from a PDF files. Even after developing our JPedal PDF library for the last 10 years, we still sometimes find examples which we need to tweak our parser to handle.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>