Understanding the PDF File Format: Document and Page level

In my latest article on “Understanding the PDF File Format”, I want to to give you a better understanding of how a PDF file is structured and how this can impact on the way you create and use PDF files.

What is inside a PDF?

A PDF file is the binary data dump of the objects inside a PDF file. These are best imagined as a tree of linked objects which can be scanned down. There is one PDF tree for the whole document, with each Page having its own Object, so you can visualise a PDF as having a Document and a Page level.

What is at the Document Level?

The PDF will have a single tree of objects (which may be created by combining multiple reference tables). This will contain objects which are shared by all pages. These include:-

An information object (with metadata as both fields and an XML structure).
An encryption object (which is used to encrypt and decrypt all objects in the file).
An ID (used for encryption).
A form object (containing the form objects for all pages or possibly the XML streams which define the actual forms and pages).
Structure information on the flow of the data in the PDF.
Other document content such as Thumbnails, etc.

What is at the Page level?

The page level contains the content specific to the so it will contain the unique contents, fonts and images for each page. It can also contain a list of Annotations (for just that page). Because the PDF is a tree it will inherit all the Document level values, and any common data up the tree. For example, a font could be shared by several pages.

In most PDF files, the font will be unique to each page, which is why our PDF to HTML5 converter writes out a font for every page.

Page and Document level?

Some objects (such as Interactive objects can appear at both a Page (in the Annots) and a Document (in the Form Object) level. So if you are parsing the PDF, you may need to allow for this.

Some objects such as CropBox and MediaBox can have Document level settings which are then used as the default settings unless over-ridden. So if a PDF contains one different page size, it would have a Document level MediaBox used for most pages and then then a unique page level MediaBox setting for just the odd page. This gives the PDF file format great flexibility.

I find that when I am working with a PDF file, thinking of it at the Document and the Page level helps to better understand how the file works and how I can optimise. Do you have any tips on understanding PDF files?

Want to learn more about PDF files?

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!