Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How are carriage returns, spaces and other gaps defined in a PDF file?

1 min read

Understanding the PDF file format - carriage returns, spaces and other gaps

Understanding the PDF file format - carriage returns, spaces and other gaps

One of the biggest issues within PDF files we find is  the use of carriage returns, spaces and line feeds as gaps with the PDF file data. Most examples within the PDF file references show space and return used as deliminator in the dictionaries so that the PDF file data is easily readable. Here is an example which is what you would see if you opened an uncompressed PDF file in a text editor.

246 0 obj
<<
/Type /Encoding
/BaseEncoding /MacRomanEncoding
/Differences [32/space 97/a 99/c/d/e/f 104/h/i 108/l
110/n/o 115/s/t/u 121/y]
>>
endobj

A space is used as a deliminators between pairs of values with each key set on a separate line (ie carriage return at the end). In practice, the PDF file format is much more flexible (or depressingly vague if you are trying to write a parser). All the gaps can be either a space, tab, or return character. Or you can have no separator between values. So this is still valid:-

246 0 objMacRomanEncoding

The format means that you need to be very careful parsing data and allow for returns where you might expect spaces, and also allow for no gaps. You can also have longer gaps (ie lots of spaces rather than one). Tools like Ghostscript write out the PDF data using very different deliminators.

You can also get deliminators in the middle of the Postscript commands or binary data (which is very annoying). Some tools have an aversion to lines longer than 80 characters, so they embed a return in the binary data. These have to be stripped out and ignored to avoid breaking the binary stream data.

All this calls for great care and flexibility when reading and parsing the data from a PDF files. Even after developing with PDF files since 1999, we still sometimes find examples which we need to tweak our parser to handle.



Our software libraries allow you to

Convert PDF to HTML in Java
Convert PDF Forms to HTML5 in Java
Convert PDF Documents to an image in Java
Work with PDF Documents in Java
Read and Write AVIF, HEIC, WEBP and other image formats
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.