Why writing a PDF parser is such a ‘challenging’ task (part 234)

In theory the PDF file format is specified in detail and is very precise. In practise, you meet alsorts of ‘interesting problems’ – the trick is to try to make your code robust enough to handle all these without making it slow or complex. Here is an interesting example I have been working on today…

Here is some raw data from inside a PDF file showing the PDF objects

929 0 obj
FormType 1/BBox[ 0 0 12.48 11.084]
/Matrix[ 1 0 0 1 0 0]/Resources<>>>>>
/Length 12/Filter/FlateDecode>>
stream
xú+‰
‰í◊
endstream
endobj

928 0 obj
<>>>>>
/Length 12/Filter/FlateDecode>>stream
xú+‰
‰í◊
endstream
endobj

938 0 obj <>
/ID[(&ó$‰K”Fû«\)oÆl)(\\ZπÇ˚ühGã-ıÅ√:%)]
/Info 298 0 R /Root 300 0 R /Type/XRef/Size 939
/Prev 102048/W[0 4 1]
/Index[158 (edited for space) 1 ]/Length 1440>>stream
(binary data here) endstream

startxref
187067
%%EOF

145 0 obj
<>/DA(/Helv 9 Tf 0 g)/DR<<
/Encoding<<
/PDFDocEncoding 129 0 R >>/Font<>>>
/F 4/FT/Tx
/Ff 8392704/P 1 0 R 
/Rect[ 166.2 228.484 538.2 241.084]/Subtype/Widget/T(undefined_19)
/TU(undefined)/Type/Annot
/V(I am hoping to replace my Fall work study award with student loans. )>>
endobj

In theory every object starts objectNumber 0 obj and ends with endobj. Except of course object 938 which skips the endobj and is followed immediately by a start ref pointer and a spurious End of File marker (it is not the end of the file as you can see). So you cannot assume that there will be an endobj marker at the end of each object – Acrobat does not! My code was making the assumption there would be an endobj and hanging.

Imagine if XML markup behaved like this! And that is why it is ‘challenging’ to write a decent PDF parser…

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>