Make your own PDF file – Part 2: Structure of a PDF file

Before we can start hacking together our own simple PDF file, a quick look at the high level structure of a PDF is in order. The file is broken down into four parts. The first two are pretty straight forward. First there is the header section whose only requirement is to have the version number in it:

%PDF-1.3

If you open up any old PDF document in a text editor you will see one at the top. You might also see a line or two of % symbols followed by some nonsense. Normally % means the rest of the text on that line is ignored (i.e. comments) but some things, like %PDF-1.3, mean something to a PDF reader. You see that a list of objects follows the header section. That’s the body section which contains all the objects I spoke about in Part One.

The next section requires a bit more of a long winded explanation. It starts with a table that follows the xref keyword. If you still have your text editor open you can do a search for xref and see it’s preceded by an endobj; that’s the end of the head section.

Here’s an example:

xref
0 4
0000000003 65535 f
0000017496 00000 n
0000000721 00003 n
0000000000 00007 f

The table is mainly a list of the addresses of each object in the body section. This is so objects can be accessed quickly just by using their ID number instead of traversing the object tree.

The first number after xref says that this list starts at object 0. You won’t find a 0 0 obj (Object 0) in the PDF file because it’s a special sort of entry that represents the head of a linked list. That’s why the first line in the list (which represents object 0) has an f at the end. The lines with n at the end refer to the objects you find in the body section. They go up in sequential order so object 1 0 obj is the second one on the list, 2 0 obj is the third etc. The second number after xref is a count of how many objects (4) are in this table (called the Cross Reference Table). Note that as a PDF gets updated there could be a number of these lists in this section of the file.

The entries are all formatted the same way but some of the blocks of numbers might mean different things. Lets get entry 0 out the way first. The first block of numbers in a entry ending with an f points to the next node in the linked list. You see it says 3 and if you look at the entry for object 3 your see that it also ends with an f. The last node in the linked list points to object 0, like entry 3 in this example. The point of this list is to reuse objects that were removed when the PDF file is edited. If this PDF was edited again, the list could be examined and you’d know that the number 3 can be used for a new object id.

The second block of numbers in object 0 is a special number (65535) that just means that this entry is the head of the list (therefore it can’t be re-used as a normal object). In all other case, including objects ending in n, the second lot of numbers are a generation number. In our example 2 0 obj has been recycled 3 times, 3 0 obj 7 times.

An entry ending n refers to an object that is in use. The second block of numbers is still the generation number but the first block represent the number of bytes (in decimal) from the beginning of the PDF file to where the object appears in the body.

After that lot is the trailer section which contains a bunch of information about the PDF file. It’s a dictionary (remember the pointy brackets?) which at a bare minimum must contain the amount of objects in the document (/Size 4) and a reference to the root object of the document. After that is the keyword startxref followed by the amount of the bytes from the start of the file till the xref keyword is reached. The %%EOF indicates the end of the file.

trailer
<< /Size 4 /Root 1 0 R>>
startxref
1205
%%EOF

Next time: Create your own non-working PDF file!