Make your own PDF file – Part 2: Structure of a PDF file

This article is part of a 7 part series to create a hello world PDF. Click here to visit the series index.

Before we can start hacking together our own simple PDF file, a quick look at the high level structure of a PDF is in order. The file is broken down into four parts. The first two are pretty straight forward. First there is the header section whose only requirement is to have the version number in it:

%PDF-1.3

If you open up any old PDF document in a text editor you will see one at the top. You might also see a line or two of % symbols followed by some nonsense. Normally % means the rest of the text on that line is ignored (i.e. comments) but some things, like %PDF-1.3, mean something to a PDF reader. You see that a list of objects follows the header section. That’s the body section which contains all the objects I spoke about in Part One.

The next section requires a bit more of a long winded explanation. It starts with a table that follows the xref keyword. If you still have your text editor open you can do a search for xref and see it’s preceded by an endobj; that’s the end of the head section.

Here’s an example:

xref
0 4
0000000003 65535 f
0000017496 00000 n
0000000721 00003 n
0000000000 00007 f

The table is mainly a list of the addresses of each object in the body section. This is so objects can be accessed quickly just by using their ID number instead of traversing the object tree.

The first number after xref says that this list starts at object 0. You won’t find a 0 0 obj (Object 0) in the PDF file because it’s a special sort of entry that represents the head of a linked list. That’s why the first line in the list (which represents object 0) has an f at the end. The lines with n at the end refer to the objects you find in the body section. They go up in sequential order so object 1 0 obj is the second one on the list, 2 0 obj is the third etc. The second number after xref is a count of how many objects (4) are in this table (called the Cross Reference Table). Note that as a PDF gets updated there could be a number of these lists in this section of the file.

The entries are all formatted the same way but some of the blocks of numbers might mean different things. Lets get entry 0 out the way first. The first block of numbers in a entry ending with an f points to the next node in the linked list. You see it says 3 and if you look at the entry for object 3 your see that it also ends with an f. The last node in the linked list points to object 0, like entry 3 in this example. The point of this list is to reuse objects that were removed when the PDF file is edited. If this PDF was edited again, the list could be examined and you’d know that the number 3 can be used for a new object id.

The second block of numbers in object 0 is a special number (65535) that just means that this entry is the head of the list (therefore it can’t be re-used as a normal object). In all other case, including objects ending in n, the second lot of numbers are a generation number. In our example 2 0 obj has been recycled 3 times, 3 0 obj 7 times.

An entry ending n refers to an object that is in use. The second block of numbers is still the generation number but the first block represent the number of bytes (in decimal) from the beginning of the PDF file to where the object appears in the body.

After that lot is the trailer section which contains a bunch of information about the PDF file.  It’s a dictionary (remember the pointy brackets?) which at a bare minimum must contain the amount of objects in the document (/Size 4) and a reference to the root object of the document.  After that is the keyword startxref followed by the amount of the bytes from the start of the file till the xref keyword is reached. The %%EOF indicates the end of the file.

trailer
<< /Size 4 /Root 1 0 R>>
startxref
1205
%%EOF

Next time:  Create your own non-working PDF file!

This article is part of a 7 part series to create a hello world PDF. Click here to visit the series index.

Related Posts:

The following two tabs change content below.

Daniel

Developer at IDR Solutions
When not delving into obscure PDF or Java bugs, Daniel is exploring the new features in JavaFX.
Daniel

About Daniel

When not delving into obscure PDF or Java bugs, Daniel is exploring the new features in JavaFX.

2 thoughts on “Make your own PDF file – Part 2: Structure of a PDF file

  1. […] A PDF file consists of lots of PDF objects, which contain all the PDF data. You can learn more about this in Daniel’s excellent set of articles on Growing your own PDF. […]

  2. […] this part we going to use what we learnt in the last article to create a PDF file using a text editor.  The only problem with the PDF we are going to make is […]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>