Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why writing a PDF parser is such a ‘challenging’ task (part 234)

1 min read

In theory the PDF file format is specified in detail and is very precise. In practise, you meet alsorts of ‘interesting problems’ – the trick is to try to make your code robust enough to handle all these without making it slow or complex. Here is an interesting example I have been working on today…

Here is some raw data from inside a PDF file showing the PDF objects

929 0 obj
FormType 1/BBox[ 0 0 12.48 11.084]
/Matrix[ 1 0 0 1 0 0]/Resources<>>>>>
/Length 12/Filter/FlateDecode>>
stream
xú+‰
‰í◊
endstream
endobj

928 0 obj
<>>>>>
/Length 12/Filter/FlateDecode>>stream
xú+‰
‰í◊
endstream
endobj

938 0 obj <>
/ID[(&ó$‰K”Fû«\)oÆl)(\\ZπÇ˚ühGã-ıÅ√:%)]
/Info 298 0 R /Root 300 0 R /Type/XRef/Size 939
/Prev 102048/W[0 4 1]
/Index[158 (edited for space) 1 ]/Length 1440>>stream
(binary data here) endstream

startxref
187067
%%EOF

145 0 obj
<>/DA(/Helv 9 Tf 0 g)/DR<<
/Encoding<<
/PDFDocEncoding 129 0 R >>/Font<>>>
/F 4/FT/Tx
/Ff 8392704/P 1 0 R 
/Rect[ 166.2 228.484 538.2 241.084]/Subtype/Widget/T(undefined_19)
/TU(undefined)/Type/Annot
/V(I am hoping to replace my Fall work study award with student loans. )>>
endobj

In theory every object starts objectNumber 0 obj and ends with endobj. Except of course object 938 which skips the endobj and is followed immediately by a start ref pointer and a spurious End of File marker (it is not the end of the file as you can see). So you cannot assume that there will be an endobj marker at the end of each object – Acrobat does not! My code was making the assumption there would be an endobj and hanging.

Imagine if XML markup behaved like this! And that is why it is ‘challenging’ to write a decent PDF parser…

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Enabling SVG Gzip Compression on Apache and NGINX

Gzip compression is a widely supported method of reducing the size of the content sent from a web server in order to improve the...
Leon Atherton
47 sec read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2019. All rights reserved.