This series of articles is part of my learning experience and intended to give you a practical tour of how a PDF document works. I’m going to start by giving a brief description of what elements are needed to make a pdf. This will lead (eventually!) to busting out the text editor and a hex editor to create your very own low level Hello World pdf file that you can show off to all your friends and loved ones.
Before I started working with PDF files I would have assumed they were just some kind of fancy text document with scripts embedded into it to render graphics, and then carried on happily with my life. PDF is a bit like that but I feel it’s more helpful to think of it as a programming language where your pdf file is the code, and a PDF viewer like Adobe Reader or our own PDF viewer in JPedal is an interpreter that converts the code into a document you can look at.
A PDF file is predominately made up of objects organised into a tree structure where each node of the tree is an object. Each object can represent one of the eight data types that a PDF reader can understand: strings, arrays, numbers (integer and real), boolean values (true/false), name objects (see later), associated arrays (called dictionaries), streams (which consist of a dictionary and a load of binary stuff), and a null object. If you open up a PDF file in a text editor you will notice lots of objects.
Here’s an example:
41 0 obj<</Type/Pages/Kids[34 0 R 43 0 R 52 0 R]/Count 3>>
This is an object called 41, the first number is the name. The second is a revision number (this never appears to get used much as it always seems to be zero). The obj part says its an object. It you follow the text you eventually get to an endobj, indicating that all the stuff in between belongs to object 41. The pointy brackets (<< >>) indicate that object 41 is of type Dictionary. A dictionary is a frequently used data type in PDF files and contains a list of key and value pairs describing some aspect of the document. The value part can be any of the eight types of objects including other dictionaries.
The example I’ve used contains 3 keys: /Type, /Kids and /Count. These are name objects, you can tell because they start with a /. A name object is a basically a name that means something to a PDF Reader. Keys in dictionaries are always name objects. This dictionary says it contains information about pages through its /Type key (/Type/Pages). Following is a name/array pair about some child nodes. This is indicated by the name object Kids, /Kids[34 0 R 43 0 R 52 0 R].
Arrays are enclosed in square brackets. Inside the array may look familiar, they are pointers to other objects, instead of the obj characters there is an R. For example, if I wanted to create a link to the object 41 0 obj, I would say 41 0 R. The last element has a key called /Count that maps to the number object 3. Therefore object 41 says this document has 3 pages and you can find the information about each page at objects 34, 43 and 52.
Next time: Overall PDF file structure.
This article is part of a 7 part series to create a hello world PDF. Click here to visit the series index.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.