Where do your PDF objects start in a PDF file?

In theory this is a really easy question to answer for non-Compressed PDF files. There is a reference table to all the PDF objects in the file giving you the binary offset to the start byte of the PDF object. So if I have a PDF reference table which looks like this….

xref
6153 30
0000000016 00000 n
0000004052 00000 n
0000004206 00000 n
0000004691 00000 n
0000004730 00000 n
0000004845 00000 n
0000005665 00000 n
0000006418 00000 n
0000007221 00000 n
0000007946 00000 n
0000008730 00000 n
0000009275 00000 n
0000009768 00000 n
0000010025 00000 n
0000010605 00000 n
0000010875 00000 n
0000011349 00000 n
0000012081 00000 n
0000012618 00000 n
0000012880 00000 n
0000013448 00000 n
0000014283 00000 n
0000014904 00000 n
0000015265 00000 n
0000044743 00000 n
0000051663 00000 n
0000062772 00000 n
0000066596 00000 n
0000003791 00000 n
0000000914 00000 n

trailer
<<908F0712C6BF4DCDBB6825BD22FB3D57>]/Prev 53534925/XRefStm 3791>>
startxref
0
%%EOF

I would find 30 bytes from the start of the file

6153 0 obj
<>
endobj

and 4052 bytes from the start of the file.

6154 0 obj
<>/Metadata 2037 0 R/Pages 2023 0 R/StructTreeRoot 2039 0 R/Type/Catalog/ViewerPreferences<>>>
endobj

Things became more complicated with compressed objects, where objects can be embedded inside binary streams (allowing you to make the file smaller). So object 6154 might be embedded in compressed object data attached to object 2086. This is why you cannot see all the PDF objects inside PDF files if you open these PDF files in a text editor.

Where is gets very messy however, is that Adobe Acrobat does not enforce the rules about where an object starts (and generally adjusts to allow for errors). So you could find that

at 30 bytes from the start of the file, you see

<>
endobj

and 4052 bytes from the start of the file.

<>/Metadata 2037 0 R/Pages 2023 0 R/StructTreeRoot 2039 0 R/Type/Catalog/ViewerPreferences<>>>
endobj

I have recently seen several tools that do this and because the files work in Adobe Acrobat, they assume they are writing out ‘correct’ PDF files. This makes life very hard for us developers!

As with a lot of things in the PDF file format, there are clearly laid rules but they are not enforced. So this is where your PDF objects should start, but as you need to know that the values may not be totally correct. How much error do you allow for with the PDF file specification?