Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Do you need to process or display PDF files?

Find out why you should be using IDRSolutions software

Where do your PDF objects start in a PDF file?

1 min read

In theory this is a really easy question to answer for non-Compressed PDF files. There is a reference table to all the PDF objects in the file giving you the binary offset to the start byte of the PDF object. So if I have a PDF reference table which looks like this….

xref
6153 30
0000000016 00000 n
0000004052 00000 n
0000004206 00000 n
0000004691 00000 n
0000004730 00000 n
0000004845 00000 n
0000005665 00000 n
0000006418 00000 n
0000007221 00000 n
0000007946 00000 n
0000008730 00000 n
0000009275 00000 n
0000009768 00000 n
0000010025 00000 n
0000010605 00000 n
0000010875 00000 n
0000011349 00000 n
0000012081 00000 n
0000012618 00000 n
0000012880 00000 n
0000013448 00000 n
0000014283 00000 n
0000014904 00000 n
0000015265 00000 n
0000044743 00000 n
0000051663 00000 n
0000062772 00000 n
0000066596 00000 n
0000003791 00000 n
0000000914 00000 n

trailer
<<908F0712C6BF4DCDBB6825BD22FB3D57>]/Prev 53534925/XRefStm 3791>>
startxref
0
%%EOF

I would find 30 bytes from the start of the file

6153 0 obj
<>
endobj

and 4052 bytes from the start of the file.

6154 0 obj
<>/Metadata 2037 0 R/Pages 2023 0 R/StructTreeRoot 2039 0 R/Type/Catalog/ViewerPreferences<>>>
endobj

Things became more complicated with compressed objects, where objects can be embedded inside binary streams (allowing you to make the file smaller). So object 6154 might be embedded in compressed object data attached to object 2086. This is why you cannot see all the PDF objects inside PDF files if you open these PDF files in a text editor.

Where is gets very messy however, is that Adobe Acrobat does not enforce the rules about where an object starts (and generally adjusts to allow for errors). So you could find that

at 30 bytes from the start of the file, you see

<>
endobj

and 4052 bytes from the start of the file.

<>/Metadata 2037 0 R/Pages 2023 0 R/StructTreeRoot 2039 0 R/Type/Catalog/ViewerPreferences<>>>
endobj

I have recently seen several tools that do this and because the files work in Adobe Acrobat, they assume they are writing out ‘correct’ PDF files. This makes life very hard for us developers!

As with a lot of things in the PDF file format, there are clearly laid rules but they are not enforced. So this is where your PDF objects should start, but as you need to know that the values may not be totally correct. How much error do you allow for with the PDF file specification?



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to insert an image into a PDF

Recently, we released JPedal 2023.07 which contains the ability to insert images into PDF files. All you need is a copy of JPedal, a...
Jacob Collins
18 sec read