The PDF file format is a very ‘flexible’ file format. You can put returns into the middle of a most objects. There is a certain PDF creation tool which believes lines should never exceed 80 characters and so inserts a breaks if the line is too long…
So when parsing a PDF file you need to be very ‘flexible’. In the following cases, we have escaped octal character sequences and string objects with some returns. Some of them need to be ignored and some are legitimate parts of the data. So in which cases should we use the value and which should we ignore?
I have added (13) so show the exact byte.
2 0 obj << /Title (\376\377\000U\000m\000l\000a\000u\000t\000e\000:\000\344\000,\000 \000\304\(13)\000,\000 \000\366\000,\000 \000\326\000,\000 \000\374\000,\000 \000\334\(13))
/V (\376\377\000N\000\260\000 \000i\000d\000e\000n\000t\000i\000f\000i\000a\000n\000t\000 \0009\0009\000\(13)9\0009\0009\0009\0009\000X)
/V (\376\377\000O\000b\000j\000e\000t\000 \000:\000 \000A\000t\000t\000e\000s\000t\000a\000t\000i\000o\000\(13)
n\000 \000d\000e\000 \000p\000a\000i\000e\000m\000e\000n\000t\000 \000d\000\351\000l\000i\000v\000r\000\(13)
\351\000e\000 \000p\000a\000r\000 \000p\000o\000l\000e\000-\000e\000m\000p\000l\000o\000i\000.\000f\000\(13)
Over to you???
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.