One of the biggest issues within PDF files we find is the use of carriage returns, spaces and line feeds as gaps with the PDF file data. Most examples within the PDF file references show space and return used as deliminator in the dictionaries so that the PDF file data is easily readable. Here is an example which is what you would see if you opened an uncompressed PDF file in a text editor.
246 0 obj << /Type /Encoding /BaseEncoding /MacRomanEncoding /Differences [32/space 97/a 99/c/d/e/f 104/h/i 108/l 110/n/o 115/s/t/u 121/y] >> endobj
A space is used as a deliminators between pairs of values with each key set on a separate line (ie carriage return at the end). In practice, the PDF file format is much more flexible (or depressingly vague if you are trying to write a parser). All the gaps can be either a space, tab, or return character. Or you can have no separator between values. So this is still valid:-
246 0 objMacRomanEncoding
The format means that you need to be very careful parsing data and allow for returns where you might expect spaces, and also allow for no gaps. You can also have longer gaps (ie lots of spaces rather than one). Tools like Ghostscript write out the PDF data using very different deliminators.
You can also get deliminators in the middle of the Postscript commands or binary data (which is very annoying). Some tools have an aversion to lines longer than 80 characters, so they embed a return in the binary data. These have to be stripped out and ignored to avoid breaking the binary stream data.
All this calls for great care and flexibility when reading and parsing the data from a PDF files. Even after developing with PDF files since 1999, we still sometimes find examples which we need to tweak our parser to handle.
Our software libraries allow you to
Convert PDF to HTML in Java |
Convert PDF Forms to HTML5 in Java |
Convert PDF Documents to an image in Java |
Work with PDF Documents in Java |
Read and Write AVIF, HEIC, WEBP and other image formats |