Table of Contents
- Intro: What a PDF Actually Is
- 1: PDF Objects and Data Types
- 2: Structure of a PDF File
- 3: Create a Non-Working PDF
- 4: DIY Blank Page
- 5: Hello World PDF
- 6: Path Objects
- 7: Graphics State
What a PDF Actually Is — And Why It’s Worth Understanding
PDF is everywhere: contracts, invoices, reports, ebooks. But very few developers ever look under the hood. Most just call a library, get a file back, and move on — which is completely fine until something breaks, a PDF won’t render correctly, or you need to extract data from a malformed file and have no idea why it’s failing.
Understanding how a PDF is structured gives you a real advantage. It helps you debug rendering issues, write better extraction logic, and have an informed conversation with tools like JPedal when they’re doing the heavy lifting for you.
This guide walks through the PDF format from scratch — literally building one in a text editor, byte offset by byte offset. It’s technical, occasionally tedious, and exactly the kind of thing that makes everything click.
If you’d rather start at the spec level, the PDF file format series covers the full picture in detail.
1: PDF Objects and Data Types
Before I started working with PDF files I would have assumed they were just some kind of fancy text document with scripts embedded into it to render graphics, and then carried on happily with my life. PDF is a bit like that but I feel it’s more helpful to think of it as a programming language where your pdf file is the code, and a PDF viewer like Adobe Reader or our own PDF viewer in JPedal is an interpreter that converts the code into a document you can look at.
A PDF file is predominately made up of objects organised into a tree structure where each node of the tree is an object. Each object can represent one of the eight data types that a PDF reader can understand: strings, arrays, numbers (integer and real), boolean values (true/false), name objects (see later), associated arrays (called dictionaries), streams (which consist of a dictionary and a load of binary stuff), and a null object. If you open up a PDF file in a text editor you will notice lots of objects.
Here’s an example:
41 0 obj<< /Type/Pages/Kids[34 0 R 43 0 R 52 0 R]/Count 3>>
endobjThis is an object called 41, the first number is the name. The second is a revision number (this never appears to get used much as it always seems to be zero). The obj part says its an object. It you follow the text you eventually get to an endobj, indicating that all the stuff in between belongs to object 41. The pointy brackets (<< >>) indicate that object 41 is of type Dictionary. A dictionary is a frequently used data type in PDF files and contains a list of key and value pairs describing some aspect of the document. The value part can be any of the eight types of objects including other dictionaries.
The example I’ve used contains 3 keys: /Type, /Kids and /Count. These are name objects, you can tell because they start with a /. A name object is a basically a name that means something to a PDF Reader. Keys in dictionaries are always name objects. This dictionary says it contains information about pages through its /Type key (/Type/Pages). Following is a name/array pair about some child nodes. This is indicated by the name object Kids, /Kids[34 0 R 43 0 R 52 0 R].
Arrays are enclosed in square brackets. Inside the array may look familiar, they are pointers to other objects, instead of the obj characters there is an R. For example, if I wanted to create a link to the object 41 0 obj, I would say 41 0 R. The last element has a key called /Count that maps to the number object 3. Therefore object 41 says this document has 3 pages and you can find the information about each page at objects 34, 43 and 52.
Want a more thorough breakdown of all eight object types and how PDF viewers parse them? The PDF file format guide goes deeper.
2: Structure of a PDF File
Before we can start hacking together our own simple PDF file, a quick look at the high level structure of a PDF is in order. The file is broken down into four parts. The first two are pretty straight forward. First there is the header section whose only requirement is to have the version number in it:
%PDF-1.3
If you open up any old PDF document in a text editor you will see one at the top. You might also see a line or two of % symbols followed by some nonsense. Normally % means the rest of the text on that line is ignored (i.e. comments) but some things, like %PDF-1.3, mean something to a PDF reader. You see that a list of objects follows the header section. That’s the body section which contains all the objects described earlier.
The next section requires a bit more of a long winded explanation. It starts with a table that follows the xref keyword.
Here’s an example:
xref
0 4
0000000003 65535 f
0000017496 00000 n
0000000721 00003 n
0000000000 00007 fThe table is mainly a list of the addresses of each object in the body section. This is so objects can be accessed quickly just by using their ID number instead of traversing the object tree.
The first number after xref says that this list starts at object 0. You won’t find a 0 0 obj in the PDF file because it’s a special sort of entry that represents the head of a linked list. That’s why the first line in the list has an f at the end. The lines with n at the end refer to the objects in use.
An entry ending n refers to an object that is in use. The first block of numbers represents the byte offset from the start of the file.
After that lot is the trailer section which contains information about the PDF file.
trailer
<< /Size 4 /Root 1 0 R>>
startxref
1205
%%EOF3: Create a Non-Working PDF
In the previous section, we learned to use a text editor to structure and create a PDF file. The only problem with the PDF we are going to make is that it is not going to work. It will however give us an error message we can understand in Acrobat PDF viewer. This is going to form the basis for creating a working PDF file in the posts that follow. The ingredients you require are: a text editor, a hex editor (I’m going to use HxD) and a at least partially functioning human brain. Preferably your own.
We are going to create all the parts mentioned earlier in a text editor and figure out the address of the things we put in our file using HxD. We can also see what error messages we can produce from Acrobat.
Firstly I’m gonna make a new blank file called myPdf.pdf. Just because I can I’m gonna load it in Acrobat to see what it says:
“Adobe Reader could not open ‘myPdf.pdf’ because it is either not a supported file type or because the file has been damaged.”
Hardly surprising, but if you get this message from a supposedly working PDF in the future you can be sure its a bit knackered.
Now I’m adding the header part, which only requires a version number in the form: %PDF-2.0. Next we have the body sections where all the objects go. For this section we’re just have one object: Object number 1 and its going to be a dictionary object (that we are not going to put anything in…yet!).
%PDF-2.0
1 0 obj << >>
endobjNext we want the Cross Reference Table section. First we need the xref keyword. Then the number of the first object in our list and the amount of objects in our file. So far we have two objects: 1 0 obj that is in our body section and object 0 which is the head of the linked list described earlier. So we end up with a line with 0 2 on it. The entries that follow have the information about our objects. They all have the same format which is 10 characters, a space, 5 characters, a space and then a letter describing what kind of object it is.
xref
0 2
0000000000 65535 f
0000000010 00000 nNotice I’ve put 10 as the address of object 1. As each letter is a byte its pretty easy to count up %PDF-2.0 plus return characters, but if you want to check you can open your file in HxD (set the width box to 10 and the number system to decimal to make life easier) and click on the 1 of 1 0 obj to get the starting address of 1 0 obj.
Next you need the final part which is the trailer section. You need a startxref then a trailer dictionary with the size in objects of the file and a reference to the root object:
trailer <</Size 2/Root 1 0 R>>
startxrefThen you need the address from the Cross Reference Table (where the xref keyword starts in bytes) which is 32 on mine. Finish of the file with %%EOF. So you end up with:
%PDF-2.0
1 0 obj << >>
endobj
xref
0 2
0000000000 65535 f
0000000010 00000 n
trailer <</Size 2/Root 1 0 R>>
startxref
33
%%EOFIf you open this in Acrobat you’ll get a different kind of error. If you hold down Ctrl while clicking OK you see another part of the error message: “Expected a dict object.” Which is fair enough as we havent put any values in it.
4: Create a Blank Page
Brace yourself! For you could soon be the proud owner of your own shiny, handmade, completly blank, one page PDF document. But before you embark on what will no doubt be the defining moment of your life I’m going to have to spill the beans about the body of a PDF document. I mentioned the body earlier, it is the part that contains all the objects that describe the actual document you see in a PDF viewer. It has to contain some important dictionary objects so a PDF reader can figure out whats going on.
The objects in the body section of a PDF file are arranged in a tree hierarchy with a dictionary object at the root of it all called the Catalog. This is the Root object that the dictionary in the trailer section was referring to. The catalog dictionary must contain a minimum of two elements. /Type /Catalog and a reference to another dictionary object that represents the root of the Page Tree ie /Pages 2 0 R.
1 0 obj << /Type /Catalog /Pages 2 0 R>>
endobjThe Page Tree contains two types of nodes that are both dictionary objects. Either a Page Tree node or Page nodes which represents a single page in the document.
So we need to put some stuff in our Page Tree node which is 2 0 obj:
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobjThis object references 3 0 obj which is going to be a Page node.
The last thing we need is a /MediaBox entry:
/MediaBox [0 0 500 800]This represents the page size.
Anyway here is our Page object:
3 0 obj<</Type /Page /Parent 2 0 R /MediaBox [0 0 500 800]>>
endobjPutting it all together:
%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /MediaBox [0 0 500 800]>>
endobj
xref
0 4
0000000000 65535 f
0000000010 00000 n
0000000060 00000 n
0000000115 00000 n
trailer <</Size 4/Root 1 0 R>>
startxref
199
%%EOF5: Hello World PDF
I have previously talked about the different objects that are used to form a Pdf file and you can read up on them in our PDF file format guide. One type I mentioned were stream objects. Stream objects are the objects that contain all the instructions describing what a Pdf page is going to look like. By the end of this article we are going to be able to make a Hello World Pdf. I’m going to have to make use of a stream object so I can put some text in a Pdf document.
If you open up any old Pdf in a text editor the majority of text you will see will be contained in stream objects. Its format is slightly different than the other objects: Its starts with a dictionary. This must have a /Length mapping saying how long the stream is in bytes. The length of the stream is everything between the keywords stream and endstream (minus the final end-of-file characters if the stream has one). Normally when you open a Pdf the stuff in the stream is compressed. You can tell what kind of compression by the /Filter key in the streams main dictionary. For example
10 0 obj<</Length 40 /Filter /FlateDecode>>
stream
...bunch of compressed stuff...
endstream
endobjIf you went to the trouble of uncompressing this stuff you would find a list of instructions. The list of instructions are the commands that create all the content in a Pdf. Here is the contents of the stream uncompressed:
BT
/F1 24 Tf
175 720 Td
(Hello World!)Tj
ETBT means Begin Text and ET means End Text. The stuff in between sets the font, position and what its going to say. The instructions are Tf, Td and Tj. Note how the values that these instructions need are written first. So for the first instruction Tf, it needs a reference to a font (/F1, I’ll come back to that in a bit) and a font size (24). The Td operator sets the text position. The first number is the amount of units from the left and the second parameter’s the units from the bottom. The units are quite interesting. They are related to a logical representation of a coordinate system that only gets translated to real world coordinates when something has to be rendered to a real life thing, such as a printer or a monitor. This allows, for example, the size and positioning of text to be consistent on different mediums. Finally we have the Tj instruction and the characters in the brackets get drawn on the Pdf document.
Before I add that to my Pdf document we have to sort that reference to /F1 out. In streams you can’t reference objects in the same way you do when outside a stream (ie 10 0 R) you have to map /F1 to a object and make that available to the /Resources dictionary. This dictionary of resources is associated with a /Contents mapping which points to your Stream object:
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 7 0 R>>
endobj
4 0 obj<</Font 5 0 R>>
endobj
5 0 obj<</F1 6 0 R>>
endobj
6 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
7 0 obj<</Length 40>>
stream
BT
/F1 24 Tf
.....
endstream
endobjSo we are making use of a /Page object. The pages /Contents entry points to a Stream object that prints our text. The stream needs to know about what object /F1 points to. Our /Resources dictionary is at 4 0 R and only contains a /Font entry which points to where /F1 is mapped to. You can see in 5 0 obj that it maps to an object that represents one of the default fonts: Helvetica. Even though it seems a bit long winded it actually helps towards speeding up a Pdf viewer. Instead of loading a font you just hang on to the reference, if it doesnt get called (you don look at the page the font is on) you dont have to load the font.
Anyway put it all together and you get, possibly, a world first: How to make a “Hello World” pdf document!
%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Resources 4 0 R /MediaBox [0 0 500 800] /Contents 6 0 R>>
endobj
4 0 obj<</Font <</F1 5 0 R>>>>
endobj
5 0 obj<</Type /Font /Subtype /Type1 /BaseFont /Helvetica>>
endobj
6 0 obj
<</Length 44>>
stream
BT /F1 24 Tf 175 720 Td (Hello World!)Tj ET
endstream
endobj
xref
0 7
0000000000 65535 f
0000000009 00000 n
0000000056 00000 n
0000000111 00000 n
0000000212 00000 n
0000000250 00000 n
0000000317 00000 n
trailer <</Size 7/Root 1 0 R>>
startxref
406
%%EOF6: Path Objects
A Pdf is drawn using a load of commands that sit in stream objects. With these commands a Pdf viewer can figure out how to draw all the content that you see on a page. I’m going to explore the graphics commands and create a Pdf in a text editor that draws a couple of lines on a page.
In a content stream you can create a combination of different graphics objects that a Pdf can understand. The one we are going to muck about with in this article is the Path object. A Path object is basically a list of points. Each Path object has a starting point and new points are added to the list. Each segment of the path can be a straight line, curve or rectangle. The collection of points are treated as one Path object because the painting operation you apply is applied to all the segments of your path. For example:
175 720 m 175 50 l h SThe previous line contains 3 commands. The first one is m. This means start a path at the coordinates given. The l command means draw a line segment from the previous point in the trail to the coordinates 175, 50. The h command closes off the path. When you have set your path out you then have to give it some sort of paint command so that it will do something. I used the command S which strokes the path.
175 720 m 175 700 l 300 800 400 720 v h SThis adds another command, v, which draws a curved line. The first two coordinates are the control point which sets the bend of the line, the next two are where the line ends.
175 720 m 175 700 l 300 800 400 600 v 100 650 50 75 re h SThe re command draws a rectangle at coordinates 100 650 with a width and height of 50 and 75 respectively. You may notice that when this gets draw the rectangle is not physically connected to the path, it is however part of the same path and the painting operation you apply at the end (S) is applied to the whole lot.
%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <</Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 500 800]>>
endobj
3 0 obj<</Type /Page /Parent 2 0 R /Contents 4 0 R>>
endobj
4 0 obj
<</Length 61>>
stream
175 720 m 175 500 l 300 800 400 600 v 100 650 50 75 re h S
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000140 00000 n
0000000202 00000 n
trailer <</Size 5/Root 1 0 R>>
startxref
314
%%EOFIf you save the above as a file and open it in a Pdf viewer you should be able to see a few lines and a rectangle about the place. Looks pretty boring at the moment so in the next section we’ll look at the different ways to fill paths in.
7: Graphics State
It would be nice to get some color on the screen this time round and in doing so give a introduction to the Graphics State. Associated with a Pdf file is a Graphics State. This data structure holds information that describe how graphics are rendered to the screen. Values such as what the current colour is and what colours are available are stored in the Graphics State. As well as weird and wonderful elements like the current clip, the transformation matrix, funny things you can do with lines and other instructions that alter the way the graphics will be rendered from the user space (The coordinates system of the Pdf) to the device space (the monitor).
As there is just one Graphics State available and a Pdf may contains lots of graphics objects that may want to do entirely different things, the current graphics state is usually stored when a object stream is going to draw something. The graphics state is stored on a stack with the q command. Captial Q pops back the previously stored graphics state.
The Graphics State also has an associated colorspace. A Colorspace basically describes what colours are available and how they are rendered to the current page. They can be defined yourself and there are also default ones that the a Pdf viewer has to know about. For example, there is DeviceGray (Grayscale colours), DeviceRGB (red-green-blue) and a load more that represent colour in different ways. We’re going to stick with DeviceRGB for this article. If you want to learn a bit more about colorspaces Mark has done an article on color in Pdf files.
One way to define which colorspace is selected is by using an ExtGState (external graphics state) dictionary. This is used in the same way as Font is accessed in the resource dictionary discussed earlier. You associate a ExtGState with a reference like /GS1 and then get at the colour space that way using the gs command. Fortunatly, you dont have to bother with that for the default ones so you should find that an object stream containing:
0.9 0.5 0.0 rg 100 400 300 300 re fWill draw an orange box on the screen. The rg command set the colour space to DeviceRGB and describes the red/green/blue components (maximum 1.0 and minimum 0.0) of the colour used to fill the rectangle (If you use a capital RG it represents the stroke colour to use). The following Pdf documents draws three coloured rectangles on a page, note how the graphic state is stored and restored.
%PDF-2.0
1 0 obj <</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj <<MediaBox [0 0 500 800]>>
endobj
3 0 obj <</Type /Page /Parent 2 0 R /Contents 4 0 R>>
endobj
4 0 obj
<</Length 105>>
stream
0.9 0.5 0.0 rg 100 400 300 300 re f q 0.1 0.9 0.5 rg 100 200 200 200 re f Q 350 200 50 50 re f
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000059 00000 n
0000000140 00000 n
0000000203 00000 n
trailer <</Size 5/Root 1 0 R>>
startxref
352
%%EOFFinal Words
And that’s it. You’ve manually built a PDF file — header, objects, cross-reference table, byte offsets and all. Congratulations.
Hopefully it’s now clear why nobody does this by hand in production. The spec is vast, the edge cases are many, and we didn’t even touch encryption, embedded fonts, form fields, or digital signatures.
This Is Exactly Why Developers Use JPedal
If you’ve just manually calculated byte offsets in a hex editor, you have a very concrete sense of what a PDF library is actually doing for you. JPedal handles all of this — parsing the object tree, managing the graphics state, resolving cross-reference tables, rendering text and paths correctly — so you don’t have to.
It’s a mature Java PDF library built for developers who need accurate, reliable PDF rendering and extraction without reinventing the wheel. Whether you’re extracting text, rendering pages, or processing documents at scale, it’s worth a look.
And if you want to go deeper on the PDF spec itself, the PDF file format series covers everything from linearisation to cross-reference streams to the full graphics model.
Our software libraries allow you to
| Convert PDF files to HTML |
| Use PDF Forms in a web browser |
| Convert PDF Documents to an image |
| Work with PDF Documents in Java |
| Read and write HEIC and other Image formats in Java |