In theory the PDF file format is specified in detail and is very precise. In practice, you meet alsorts of ‘interesting problems’ – the trick is to try to make your code robust enough to handle all these without making it slow or complex. Here is an interesting example I have been working on today…
Here is some raw data from inside a PDF file showing the PDF objects
929 0 obj FormType 1/BBox[ 0 0 12.48 11.084] /Matrix[ 1 0 0 1 0 0]/Resources<>>>>> /Length 12/Filter/FlateDecode>> stream xú+‰ ‰í◊ endstream endobj 928 0 obj <>>>>> /Length 12/Filter/FlateDecode>>stream xú+‰ ‰í◊ endstream endobj 938 0 obj <> /ID[(&ó$‰K”Fû«\)oÆl)(\\ZπÇ˚ühGã-ıÅ√:%)] /Info 298 0 R /Root 300 0 R /Type/XRef/Size 939 /Prev 102048/W[0 4 1] /Index[158 (edited for space) 1 ]/Length 1440>>stream (binary data here) endstream startxref 187067 %%EOF 145 0 obj <>/DA(/Helv 9 Tf 0 g)/DR<< /Encoding<< /PDFDocEncoding 129 0 R >>/Font<>>> /F 4/FT/Tx /Ff 8392704/P 1 0 R /Rect[ 166.2 228.484 538.2 241.084]/Subtype/Widget/T(undefined_19) /TU(undefined)/Type/Annot /V(I am hoping to replace my Fall work study award with student loans. )>> endobj
In theory every object starts objectNumber 0 obj and ends with endobj. Except of course object 938 which skips the endobj and is followed immediately by a start ref pointer and a spurious End of File marker (it is not the end of the file as you can see). So you cannot assume that there will be an endobj marker at the end of each object – Acrobat does not! My code was making the assumption there would be an endobj and hanging.
Imagine if XML markup behaved like this! And that is why it is ‘challenging’ to write a decent PDF parser…