How to read a PDF file

To show how to read a PDF file, I have created a step by step guide to reading very ‘simple’ PDF (it just shows Hello World). You can download the PDF here.

Read the last 1024 bytes of the file
startxref 6338 %%EOF
There will be a value startxref followed by a number. This is the binary offset to the Cross Reference table.
Read the object offsets
The Cross reference table tells us that there are 16 objects. Object 0 is unused and the final object is 15. Object one is found at byte offset 285, Object 2 at 3440, etc. You can use this information to read in any object.
xref 0 16 0000000000 65535 f 0000000285 00000 n 0000003440 00000 n 0000000022 00000 n 0000000395 00000 n 0000003204 00000 n 0000003673 00000 n 0000003631 00000 n 0000000492 00000 n 0000003292 00000 n 0000003239 00000 n 0000003369 00000 n 0000003529 00000 n 0000004040 00000 n 0000004306 00000 n 0000006138 00000 n
Read the trailer
The trailer is immediately after the cross reference table. It is a Dictionary so all keys start with a / (/Size, /Root, /Info, /ID). Values can be a number, an object (found using the references above) or various data values.
The actual PDF data structure is a binary tree, which we parse to access any part of the document. It starts with the Root Node (Object 12).
The PDF file also has an information Object (Object 15) which contains metadata about the file (information about Author, tools used to create, date, etc).
ID is two strings (stored as Hex values) which you will need if the file is encrypted.
trailer <</Size 16 /Root 12 0 R /Info 15 0 R /ID [ <ee0351b21bd521fdd345ea49d40844bb> <ee0351b21bd521fdd345ea49d40844bb> ] >>
Read the Root Object
The Root object is object 12, which starts at byte offset 3529 from the start of the file. This is another Dictionary Object (it is the Catalog). It tells us that we can find the Page tree defined in Object 2. This PDF has some structure information (if we wanted to extract structured content (which is defined in Object 10).
12 0 obj << /Type /Catalog /Pages 2 0 R /MarkInfo << /Marked true >> /StructTreeRoot 10 0 R >> endobj
Read the Pages Object
The Pages object is object 2, which gives us the page dimensions (those values are A4). It tells us there is 1 Page defined as Object 1
2 0 obj << /Type /Pages /MediaBox [0 0 595.28 841.89] /Count 1 /Kids [ 1 0 R ] >> endobj
Read the Page Object
The Page object is object 1 and also gives us the page dimensions (those values are A4). If not, it would inherit the value defined in the parent Pages Object. The page is drawn using the PostScript stream command defined in Contents (Object 3) with the fonts, colors and images from Resources (Object 4).
The data in object 3 is compressed binary data, which needs to be read as binary data and decompressed.
Explaining how all this data is parsed is way beyond the scope of this tutorial!
3 0 obj << /Filter /FlateDecode /Length 191 >> stream x]�� @��}�9��& �< >�VD�J��7�QrHf�/��`��xS0ؑa��uO��g�{�� H��&֐a��#O8"�`:E��W]7�a��}i |e*)��c6��P� 6H�4[(P��a� �bAoë�6�c��G�NMJWܯ�t#�� \+�h�>> endstream endobj 1 0 obj << /Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 3 0 R /MediaBox [0 0 595.28 841.89] >> endobj 4 0 obj << /ProcSet [ /PDF /Text ] /ColorSpace << /Cs1 5 0 R >> /Font << /TT1 6 0 R >> >> endobj

Final comments

PDF files have a very complex structure (we have been writing our PDF parser since 2001 and still tweaking it). So our advice is always to use a library (there are lots of Open Source and commercial ones out there including our JPedal library) for anything other than basic access. If you want to view or rasterize a PDF file, you will definitely need a third party-library.