Brace yourself! For you could soon be the proud owner of your own shiny, handmade, completly blank, one page PDF document. But before you embark on what will no doubt be the defining moment of your life I’m going to have to spill the beans about the body of a PDF document. I mentioned the body in my previous articles, it is the part that contains all the objects that describe the actual document you see in a PDF viewer. It has to contain some important dictionary objects so a PDF reader can figure out whats going on.
The objects in the body section of a PDF file are arranged in a tree hierarchy with a dictionary object at the root of it all called the Catalog. This is the Root object that the dictionary in the trailer section was referring to. The catalog dictionary must contain a minimum of two elements. /Type /Catalog (so the PDF reader knows the dictionary is a catalog) and a reference to another dictionary object that represents the root of the Page Tree ie /Pages 2 0 R.
1 0 obj << /Type /Catalog /Pages 2 0 R>> endobj
The Page Tree contains two types of nodes that are both dictionary objects. Either a Page Tree node like the one the Catalog needs to know about or Page nodes which represents a single page in the document. The Page Tree nodes contain references to Page nodes and/or references to other Page Tree nodes. Bear in mind that a Page Tree node doesn’t refer to the actual logical structure of a tree, like chapters and stuff, but is used to balance a tree of pages so that a large document can be accessed quickly.
So we need to put some stuff in our Page Tree node which is 2 0 obj. As per usual it needs a entry that says what it is: /Type /Pages. and an array called /Kids which has references to the other parts of the Page Tree its connected to. We are going to be really adventurous and have one page in our document: /Kids [3 0 R]. You then need a count of all the Page nodes that this Page Node is connected to. So if it was at the root of a 100 page document the count would be a hundred but a different Page Tree node in the same document which was at the top of the tree with one page attached to it would have a count of 1. Anyway, this is what we want in our root Page Tree node:
2 0 obj <> endobj
This object references 3 0 obj which is going to be a Page node. Again we need a entry that says what it is /Type /Page. We need a reference to its parent node /Parent 2 0 R. Then we need some slightly more exciting elements.
A object that is a page should have a resources element which describes lots of fun stuff about the the page. The /Resources key maps to a dictionary were you put in what fonts the page is going to use and lots of weird and wonderful things involving graphics, shading and things with funny names. As we are avoiding fun stuff like the plague for the time being we’re going to ignore it. We’ll be back to the resources in the next part. Leaving it out just means that we will get a blank page, we could say /Resources << >> or /Resources null. It will all mean the same thing to a PDF reader: a blank page.
The last thing we need to make our Page object satisfy a PDF viewer is a /MediaBox entry. The Media Box refers to an array which describes the entire size of a page. But before I go on this is a good time to explain something related to the contents of our MediaBox array. You know I said objects have different types? Well there are also a whole bunch of data types that are nothing to do with objects but are used for modelling data that is contained in objects. For example, we are going to have a Media Box entry:
/MediaBox [0 0 500 800].
This is a Name object (/MediaBox) and a Array object [0 0 500 800]). The actually data in the array represents two corners of a rectangle: the coordinates of the bottom left corner and the coordinates of the top right corner (So its 500 – 0 wide and 800 – 0 high). Those four numbers represent the parts needed for a PDF data type that must be contained in an array object: type Rectangle. Type Rectangle is pretty useful as it is a means to specify where you want to put stuff. There are a whole load of other types such as strings and streams and even function so you can do a lots of nifty things with them. Anyway here is our Page object:
3 0 obj<> /MediaBox [0 0 500 800]>> endobj
Anyway, thats all we need to get a blank page. Adjusting what we had from Part 2b: Create your own non-working PDF we get:
%PDF-2.0 1 0 obj <> endobj 2 0 obj <> endobj 3 0 obj<> /MediaBox [0 0 500 800]>> endobj xref 0 4 0000000000 65535 f 0000000010 00000 n 0000000060 00000 n 0000000115 00000 n trailer <> startxref 199 %%EOF
Note that I have had to make a few adjustments to the Cross Reference Table and the Trailer section. I’ve changed the size of the table to 4 in the part after the xref keyword and in the trailer dictionary, added elements corresponding to the positions of our new objects in the Cross Reference Table and adjusted the startxref address to the new position of the ref keyword. So you probably (definitely!) have to get the hex editor out to get the numbers correct if your doing it yourself.
Next time: Hello World PDF!
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
|Display PDF documents in a Web app|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|
2 Replies to “Make your own PDF file – Part 3: DIY…”
I think there is a typo in the row of the mediabox. I assume it should be << /MediaBox [0 0 500 800]>>. A small thing that can be quite confusing for newcomers.
I was unable to get the blank.pdf as written above to work. I needed to change the xref table offsets since I’m on a macOS machine and line endings are a single line feed. Also the /MediaBox section was missing information that I added. Here is a blank.pdf that works on my macbook:
1 0 obj <>
2 0 obj <>
3 0 obj <>
0000000000 65535 f
0000000009 00000 n
0000000056 00000 n
0000000111 00000 n