Daniel When not delving into obscure PDF or Java bugs, Daniel is exploring the new features in JavaFX.

Make your own PDF file – Part 1: PDF Objects and Data Types

11 min read

This is part of a series on How to make your own PDF files.

This series of articles is part of my learning experience and intended to give you a practical tour of how a PDF document works. I’m going to start by giving a brief description of what elements are needed to make a pdf. This will lead (eventually!) to busting out the text editor and a hex editor to create your very own low level Hello World pdf file that you can show off to all your friends and loved ones.

Jump to section


Part 1: PDF Objects and Data Types

Before I started working with PDF files I would have assumed they were just some kind of fancy text document with scripts embedded into it to render graphics, and then carried on happily with my life. PDF is a bit like that but I feel it’s more helpful to think of it as a programming language where your pdf file is the code, and a PDF viewer like Adobe Reader or our own PDF viewer in JPedal is an interpreter that converts the code into a document you can look at.

A PDF file is predominately made up of objects organised into a tree structure where each node of the tree is an object. Each object can represent one of the eight data types that a PDF reader can understand: strings, arrays, numbers (integer and real), boolean values (true/false), name objects (see later), associated arrays (called dictionaries), streams (which consist of a dictionary and a load of binary stuff), and a null object. If you open up a PDF file in a text editor you will notice lots of objects.

Here’s an example:

41 0 obj<< /Type/Pages/Kids[34 0 R 43 0 R 52 0 R]/Count 3>>
endobj

This is an object called 41, the first number is the name. The second is a revision number (this never appears to get used much as it always seems to be zero). The obj part says its an object. It you follow the text you eventually get to an endobj, indicating that all the stuff in between belongs to object 41. The pointy brackets (<< >>) indicate that object 41 is of type Dictionary. A dictionary is a frequently used data type in PDF files and contains a list of key and value pairs describing some aspect of the document. The value part can be any of the eight types of objects including other dictionaries.

The example I’ve used contains 3 keys: /Type, /Kids and /Count. These are name objects, you can tell because they start with a /. A name object is a basically a name that means something to a PDF Reader. Keys in dictionaries are always name objects. This dictionary says it contains information about pages through its /Type key (/Type/Pages). Following is a name/array pair about some child nodes. This is indicated by the name object Kids, /Kids[34 0 R 43 0 R 52 0 R].

Arrays are enclosed in square brackets. Inside the array may look familiar, they are pointers to other objects, instead of the obj characters there is an R. For example, if I wanted to create a link to the object 41 0 obj, I would say 41 0 R. The last element has a key called /Count that maps to the number object 3. Therefore object 41 says this document has 3 pages and you can find the information about each page at objects 34, 43 and 52.


Part 2: Structure of a PDF File (click to expand)

Before we can start hacking together our own simple PDF file, a quick look at the high level structure of a PDF is in order. The file is broken down into four parts. The first two are pretty straight forward. First there is the header section whose only requirement is to have the version number in it:

%PDF-1.3

If you open up any old PDF document in a text editor you will see one at the top. You might also see a line or two of % symbols followed by some nonsense. Normally % means the rest of the text on that line is ignored (i.e. comments) but some things, like %PDF-1.3, mean something to a PDF reader. You see that a list of objects follows the header section. That’s the body section which contains all the objects I spoke about in Part One.

The next section requires a bit more of a long winded explanation. It starts with a table that follows the xref keyword. If you still have your text editor open you can do a search for xref and see it’s preceded by an endobj; that’s the end of the head section.

xref
0 4
0000000003 65535 f
0000017496 00000 n
0000000721 00003 n
0000000000 00007 f

The table is mainly a list of the addresses of each object in the body section. This is so objects can be accessed quickly just by using their ID number instead of traversing the object tree.

The first number after xref says that this list starts at object 0. You won’t find a 0 0 obj (Object 0) in the PDF file because it’s a special sort of entry that represents the head of a linked list. That’s why the first line in the list (which represents object 0) has an f at the end. The lines with n at the end refer to the objects you find in the body section. They go up in sequential order so object 1 0 obj is the second one on the list, 2 0 obj is the third etc. The second number after xref is a count of how many objects (4) are in this table (called the Cross Reference Table). Note that as a PDF gets updated there could be a number of these lists in this section of the file.

The entries are all formatted the same way but some of the blocks of numbers might mean different things. Lets get entry 0 out the way first. The first block of numbers in a entry ending with an f points to the next node in the linked list. You see it says 3 and if you look at the entry for object 3 your see that it also ends with an f. The last node in the linked list points to object 0, like entry 3 in this example. The point of this list is to reuse objects that were removed when the PDF file is edited. If this PDF was edited again, the list could be examined and you’d know that the number 3 can be used for a new object id.

The second block of numbers in object 0 is a special number (65535) that just means that this entry is the head of the list (therefore it can’t be re-used as a normal object). In all other case, including objects ending in n, the second lot of numbers are a generation number. In our example 2 0 obj has been recycled 3 times, 3 0 obj 7 times.

An entry ending n refers to an object that is in use. The second block of numbers is still the generation number but the first block represent the number of bytes (in decimal) from the beginning of the PDF file to where the object appears in the body.

After that lot is the trailer section which contains a bunch of information about the PDF file. It’s a dictionary (remember the pointy brackets?) which at a bare minimum must contain the amount of objects in the document (/Size 4) and a reference to the root object of the document. After that is the keyword startxref followed by the amount of the bytes from the start of the file till the xref keyword is reached. The %%EOF indicates the end of the file.

trailer
<< /Size 4 /Root 1 0 R >>
startxref
1205
%%EOF

Part 3: Create a Non-Working PDF (click to expand)

In the previous article of this series, we learn to use a text editor to structure and create a PDF file. The only problem with the PDF we are going to make is that it is not going to work. It will however give us an error message we can understand in Acrobat PDF viewer. This is going to form the basis for creating a working PDF file in the posts that follow. The ingredients you require are: a text editor, a hex editor (I’m going to use HxD) and a at least partially functioning human brain. Preferably your own.

We are going to create all the parts I mentioned in the last article in a text editor and figure out the address of the things we put in our file using the HxD. We can also see what error messages we can produce from Acrobat.

Firstly I’m gonna make a new blank file called myPdf.pdf. Just because I can I’m gonna load it in Acrobat to see what it says:

“Adobe Reader could not open ‘myPdf.pdf’ because it is either not a supported file type or because the file has been damaged.”

Hardly surprising, but if you get this message from a supposedly working PDF in the future you can be sure its a bit knackered.

Now I’m adding the header part, which only requires a version number in the form: %PDF-2.0. Next we have the body sections where all the objects go. For this section we’re just have one object: Object number 1 and its going to be a dictionary object (that we are not going to put anything in…yet!).

%PDF-2.0
1 0 obj << >>
endobj

Next we want the Cross Reference Table section. First we need the xref keyword. Then the number of the first object in our list and the amount of objects in our file. So far we have two objects: 1 0 obj that is in our body section and object 0 which is the head of the linked list that I described in Part 2. So we end up with a line with 0 2 on it.

xref
0 2
0000000000 65535 f
0000000010 00000 n

Next you need the final part which is the trailer section:

trailer << /Size 2 /Root 1 0 R >>
startxref
33
%%EOF

If you open this in Acrobat you’ll get a different kind of error. If you hold down Ctrl while clicking OK you see another part of the error message: “Expected a dict object.”


Part 4: Hello World PDF (click to expand)

Back when dinosaurs roamed the earth I talked about the different objects that are used to form a Pdf file. One type I mentioned were stream objects. Stream objects are the objects that contain all the instructions describing what a Pdf page is going to look like. By the end of this article we are going to be able to make a Hello World Pdf. I’m going to have to make use of a stream object so I can put some text in a Pdf document.

If you open up any old Pdf in a text editor the majority of text you will see will be contained in stream objects. Its format is slightly different than the other objects: Its starts with a dictionary. This must have a /Length mapping saying how long the stream is in bytes. The length of the stream is everything between the keywords stream and endstream (minus the final end-of-file characters if the stream has one). Normally when you open a Pdf the stuff in the stream is compressed. You can tell what kind of compression by the /Filter key in the streams main dictionary.

10 0 obj<</Length 40 /Filter /FlateDecode>>
stream

...bunch of compressed stuff...

endstream
endobj

If you went to the trouble of uncompressing this stuff you would find a list of instructions. The list of instructions are the commands that create all the content in a Pdf. Here is the contents of the stream uncompressed:

BT
/F1 24 Tf
175 720 Td
(Hello World!)Tj
ET

BT means Begin Text and ET means End Text. The stuff in between sets the font, position and what its going to say. The instructions are Tf, Td and Tj. Note how the values that these instructions need are written first.

Before I add that to my Pdf document we have to sort that reference to /F1 out. In streams you can’t reference objects in the same way you do when outside a stream (ie 10 0 R) you have to map /F1 to a object and make that available to the /Resources dictionary.

3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 7 0 R>>
endobj
4 0 obj<</Font 5 0 R>>
endobj
5 0 obj<</F1 6 0 R>>
endobj
6 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
7 0 obj<</Length 40>>
stream
BT
/F1 24 Tf
.....
endstream
endobj

So we are making use of a /Page object. The pages /Contents entry points to a Stream object that prints our text. The stream needs to know about what object /F1 points to.

Anyway put it all together and you get, possibly, a world first: How to make a “Hello World” pdf document!

%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 6 0 R>>
endobj
4 0 obj<</Font <</F1 5 0 R>>>>
endobj
5 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
6 0 obj
<</Length 44>>
stream
BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000056 00000 n
0000000111 00000 n
0000000212 00000 n
0000000250 00000 n
0000000317 00000 n
trailer <</Size 7/Root 1 0 R>>
startxref
406
%%EOF

Part 5: Path Objects (click to expand)

A Pdf is drawn using a load of commands that sit in stream objects. With these commands a Pdf viewer can figure out how to draw all the content that you see on a page. I’m going to explore the graphics commands and create a Pdf in a text editor that draws a couple of lines on a page.

In a content stream you can create a combination of different graphics objects that a Pdf can understand. The one we are going to muck about with in this article is the Path object. A Path object is basically a list of points. Each Path object has a starting point and new points are added to the list. Each segment of the path can be a straight line, curve or rectangle. The collection of points are treated as one Path object because the painting operation you apply is applied to all the segments of your path.

175 720 m 175 50 l h S

The previous line contains 3 commands. The first one is m. This means start a path at the coordinates given. The l command means draw a line segment from the previous point in the trail to the coordinates 175, 50. The h command closes off the path. When you have set your path out you then have to give it some sort of paint command so that it will do something. I used the command S which strokes the path.

175 720 m 175 700 l 300 800 400 720 v h S

This adds another command, v, which draws a curved line. The first two coordinates are the control point which sets the bend of the line, the next two are where the line ends.

175 720 m 175 700 l 300 800 400 600 v 100 650 50 75 re h S

The re command draws a rectangle at coordinates 100 650 with a width and height of 50 and 75 respectively.

%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 500 800]>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Contents 4 0 R>>
endobj
4 0 obj
<</Length 61>>
stream
175 720 m 175 500 l 300 800 400 600 v 100 650 50 75 re h S
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000140 00000 n
0000000202 00000 n
trailer <</Size 5/Root 1 0 R>>
startxref
314
%%EOF

Part 6: Graphics State (click to expand)

It would be nice to get some color on the screen this time round and in doing so give a introduction to the Graphics State.

Associated with a Pdf file is a Graphics State. This data structure holds information that describe how graphics are rendered to the screen. Values such as what the current colour is and what colours are available are stored in the Graphics State. As well as weird and wonderful elements like the current clip, the transformation matrix, funny things you can do with lines and other instructions that alter the way the graphics will be rendered from the user space (The coordinates system of the Pdf) to the device space (the monitor).

As there is just one Graphics State available and a Pdf may contains lots of graphics objects that may want to do entirely different things, the current graphics state is usually stored when a object stream is going to draw something. The graphics state is stored on a stack with the q command. Captial Q pops back the previously stored graphics state.

The Graphics State also has an associated colorspace. A Colorspace basically describes what colours are available and how they are rendered to the current page. They can be defined yourself and there are also default ones that the a Pdf viewer has to know about. For example, there is DeviceGray (Grayscale colours), DeviceRGB (red-green-blue) and a load more that represent colour in different ways. We’re going to stick with DeviceRGB for this article.

One way to define which colorspace is selected is by using an ExtGState (external graphics state) dictionary. This is used in the same way as Font is accessed in the resource dictionary that I discussed earlier. You associate a ExtGState with a reference like /GS1 and then get at the colour space that way using the gs command. Fortunatly, you dont have to bother with that for the default ones so you should find that an object stream containing:

0.9 0.5 0.0 rg 100 400 300 300 re f

Will draw an orange box on the screen. The rg command set the colour space to DeviceRGB and describes the red/green/blue components (maximum 1.0 and minimum 0.0) of the colour used to fill the rectangle (If you use a capital RG it represents the stroke colour to use).

%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <<MediaBox [0 0 500 800]>>
endobj
3 0 obj <</Type /Page /Parent 2 0 R /Contents 4 0 R>>
endobj
4 0 obj
<</Length 105>>
stream
0.9 0.5 0.0 rg 100 400 300 300 re f q 0.1 0.9 0.5 rg 100 200 200 200 re f Q 350 200 50 50 re f
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000140 00000 n
0000000203 00000 n
trailer <</Size 5/Root 1 0 R>>
startxref
352
%%EOF

Final Words

And that’s it. We are done. Congratulations on manually creating your own PDF file.

I hope you at least appreciate how complex it is, and why we generally recommend using software libraries to do it.



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Daniel When not delving into obscure PDF or Java bugs, Daniel is exploring the new features in JavaFX.

One Reply to “Make your own PDF file – Part 1: PDF…”

Comments are closed.