Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file Format – Carriage returns, spaces and other gaps

1 min read

One of the biggest issues within PDF files we find is  the use of carriage returns, spaces and line feeds as gaps with the PDF file data. Most examples within the PDF file references show space and return used as deliminator in the dictionaries so that the PDF file data is easily readable. Here is an example which is what you would see if you opened an uncompressed PDF file in a text editor.

246 0 obj
<<
/Type /Encoding
/BaseEncoding /MacRomanEncoding
/Differences [32/space 97/a 99/c/d/e/f 104/h/i 108/l
110/n/o 115/s/t/u 121/y]
>>
endobj

A space is used as a deliminators between pairs of values with each key set on a separate line (ie carriage return at the end). In practice, the PDF file format is much more flexible (or depressingly vague if you are trying to write a parser). All the gaps can be either a space, tab, or return character. Or you can have no separator between values. So this is still valid:-

246 0 objMacRomanEncoding

The format means that you need to be very careful parsing data and allow for returns where you might expect spaces, and also allow for no gaps. You can also have longer gaps (ie lots of spaces rather than one). Tools like Ghostscript write out the PDF data using very different deliminators.

You can also get deliminators in the middle of the Postscript commands or binary data (which is very annoying). Some tools have an aversion to lines longer than 80 characters, so they embed a return in the binary data. These have to be stripped out and ignored to avoid breaking the binary stream data.

All this calls for great care and flexibility when reading and parsing the data from a PDF files. Even after developing our JPedal PDF library for the last 10 years, we still sometimes find examples which we need to tweak our parser to handle.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why we wrote our own Java jpeg2000 libraries

JPEG2000 is an important image file format which offers significant benefits over JPEG. For our specific usage it does generate significantly smaller file sizes...
Mark Stephens
52 sec read

How to choose JPG versus JPEG2000 for image files

Since we started to support both JPG and JPG2000 as image file outputs in our software, we have found that this is a very...
Mark Stephens
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2019. All rights reserved.