Why writing a PDF parser is such a 'challenging' task (part 234)

In theory the PDF file format is specified in detail and is very precise. In practice, you meet alsorts of ‘interesting problems’ – the trick is to try to make your code robust enough to handle all these without making it slow or complex. Here is an interesting example I have been working on today…

Here is some raw data from inside a PDF file showing the PDF objects

929 0 obj
FormType 1/BBox[ 0 0 12.48 11.084]
/Matrix[ 1 0 0 1 0 0]/Resources<>>>>>
/Length 12/Filter/FlateDecode>>
stream
xú+‰
‰í◊
endstream
endobj

928 0 obj
<>>>>>
/Length 12/Filter/FlateDecode>>stream
xú+‰
‰í◊
endstream
endobj

938 0 obj <>
/ID[(&ó$‰K”Fû«\)oÆl)(\\ZπÇ˚ühGã-ıÅ√:%)]
/Info 298 0 R /Root 300 0 R /Type/XRef/Size 939
/Prev 102048/W[0 4 1]
/Index[158 (edited for space) 1 ]/Length 1440>>stream
(binary data here) endstream

startxref
187067
%%EOF

145 0 obj
<>/DA(/Helv 9 Tf 0 g)/DR<<
/Encoding<<
/PDFDocEncoding 129 0 R >>/Font<>>>
/F 4/FT/Tx
/Ff 8392704/P 1 0 R 
/Rect[ 166.2 228.484 538.2 241.084]/Subtype/Widget/T(undefined_19)
/TU(undefined)/Type/Annot
/V(I am hoping to replace my Fall work study award with student loans. )>>
endobj

In theory every object starts objectNumber 0 obj and ends with endobj. Except of course object 938 which skips the endobj and is followed immediately by a start ref pointer and a spurious End of File marker (it is not the end of the file as you can see). So you cannot assume that there will be an endobj marker at the end of each object – Acrobat does not! My code was making the assumption there would be an endobj and hanging.

Imagine if XML markup behaved like this! And that is why it is ‘challenging’ to write a decent PDF parser…