Leon Atherton Leon is a developer at IDRsolutions and product manager for BuildVu. He oversees the BuildVu product strategy and roadmap in addition to spending lots of time writing code.

Understanding the PDF File Format

4 min read

pdf logo

The PDF file format is a very complex document structure, which we have spent over 20 years working with. In this guide we have gathered together a whole series of articles covering all aspects of the PDF format including bugs, gotchas and even how to manually create your own PDF file.

General:

General questions about PDF files and the format.

What new PDF developers need to know
Learning about PDF
Text, Shapes and Images
OCR (Optical Character Recognition) PDF files
Bookmarks and Links
What is PDF Pagesize? CropBox, MediaBox, ArtBox, BleedBox, TrimBox?
PDF Format and Style Information
A quick guide to PDF for Java (and non-Java) developers
Why writing a PDF parser is such a challenging task (Part 234)
Searching PDF Files
How do stacks work in PDF files
How do PDF files manage limitless position accuracy of shapes & images?
Why even Acrobat Reader can’t support 100% PDF Specification
Choosing sensible optimisations for PDF files
Corrupt PDFs? Maybe this is your problem
How to compare 2 PDF Files
Working out PDF Page Size in Inches or Centimetres
There is more than one PDF File Specification
Don’t Blame the PDF File Format
Be careful how you remove critical data from a PDF File
Find out what’s really in your PDF files
3 Reasons why PDF Commands matter.
The definitive PDF book from the top PDF expert

The PDF File itself:

This section covers the actual file formatted to store a PDF file – what you see when you open a PDF in a text editor.

Viewing PDF Objects
PDF Object Streams
Multiple Trailers on PDF Files
PDF Xref Tables Explained
Text Streams
decodeArray
How are images stored?
PDF Dictionary
Named Locations
Linearized PDF Files
Form XObjects
2 Problems with Corrupt PDF Data Streams
How can a PDF file be broken?
Identifying a PDF File from its first line
No Startxref found in last 1024 bytes?
Embedding your own data in PDF Files
Intriguing PDF xref Issue
Strange PDF File of the Week

Images in PDF:

Images can be stored in PDF files in several ways

Images – An Overview
3 Examples of unusual ways to use PDF Image Masks
3 Types of Image Mask
PDF Image DPI
Advantages of JBIG2 compression in PDF explained
There are several version of each image inside your PDF file
Do you need an image that big in your PDF file?
Small Images can cause big problems in PDF Files
A suggestion to the Prawn development team on making smaller PDF files
Making sure image names are unique in PDF files
Large images in a PDF File
Extract Raw JPEG Images from a PDF File
Filter and DecodeParms Objects for a PDF Image

Color handling in PDF:

Color support inside PDF files is very powerful and complex.

Color – An Overview
PDF Image Color Depth
Indexed Colorspaces
The Color White in PDF Files
ICCBased Colorspaces
YCCK Color Conversion in PDF Files
CMYK does not always mean CMYK
Fine Tuning PDF Image Color with ICC Profiles
Convert PDF to Grayscale or Black and White

Text in PDF:

How Text is stored, displayed and extracted from a PDF file

PDF Text – An Overview
ActualText
PDF Text Co-ordinates
Carriage returns, spaces and other gaps
PDF Mystery – What is the correct value for a Text Field
PDF Text Extraction with Java
The easy way to discover if a PDF File contains structured content
Why can I not extract text from this GhostScript generated PDF file?
Why can’t I extract text from this PDF file?
Extracting Text References from a PDF File
Extracting Structured Text from PDF Files
Space is a special character
Text Spaces in PDF Files
Space: The Final Frontier… in PDF

Fonts in PDF:

PDF files can use three different font technologies for display

PDF Fonts – An Overview
Introduction to PDF Font Technologies
Embedded CMAP Tables
What are CID Fonts?
Custom Font Encodings
Are there really 3 types of fonts in PDF files?
Standard Font Information
Glyph Names – What is in a name?
TrueType Font Hinting
Why the TrueType Hinting Patent Expiration Matters
Be careful with your PDF Fonts
Are your TrueType CMap Tables lying to you?
Mystery of the PDF file and the missing euro character
Problems caused by arial fonts in PDF files
Differences in the PDF Differences Tables
TrueType Hinting – Big Screens for Small Details
Why are CID Fonts far more complicated than non-CID Fonts?
Embedded PDF Truetype Fonts are always MAC encoded unless they are not
PDF with odd Type3 Fonts in Ghostscript 8.50

PDF Forms, Annotations & Interactive Elements:

PDF files can contain interactive elements

Introduction to PDF Forms
Introduction to FDF Forms
Introduction to XFA Forms
Interactive Elements
Layers in PDFs
Extracting Flattened Form Data from a PDF File
The Mystery Behind PDF Form Names
What is PDF Form Flattening?
What are PDF readonly text fields?
Not all forms are PDF forms

PDF Security:

PDF files have their own security systems and processes

PDF Security (Passwords and Certificates)
Brief Overview of Security Features offered by the PDF file format
PDF Password Protection
Protecting PDF Content
Why do I need the PDF password to open the PDF file?
Creating your own test certificates and keys for signing PDF files

Q&A:

Questions developers often ask us

Why use the PDF File Format?
How big is a PDF Page in bytes?
Why can’t I just open and edit a PDF File?
How do I find out the PDF version used?
How do barcodes appear inside a PDF file?
Do I have to download the whole PDF if I view it across the internet?
Why is my PDF Producer showing in Chinese?
What happens if the CropBox is smaller than the MediaBox?
Should Broken PDF Files Fail in Acrobat?
Where do your PDF objects start in a PDF file?

PDF Bugs we have investigated:

Here we write-up some of the more intriguing bugs we have investigated in PDF files.

An Extreme Case of Recursion
Using SMask and Image ‘the opposite way’ round
Zero Bytes in a String
X Marks the spot (or not)
ICC Colorspace Alt Setting
Simulating an SMask with Vector Graphics
Mixed up Font Object
PDF Text is really a tiny image with a big SMask
Tiny Dash Values and the Java JVM
Values out of Range
Missing Image Data
Missing Image Data 2
Dealing with 3 Types of Fonts
Pointless Font Inclusion
Odd text rendering issue in Acrobat on Mac
Phantom PDF Objects

CCITT Encoding in PDF:

CCITT is used to store compressed data inside PDF files.

CCITT Encoding in PDF – Converting CCITT data into a TIFF Image
CCITT Encoding in PDF – Black and White Facts
CCITT Encoding in PDF – Rows and Height Gotcha
CCITT Encoding in PDF – Decoding CCITT Data
CCITT Encoding in PDF – G31D CCITT Data Overview
CCITT Encoding in PDF – Decoding G31D CCITT Data

Make your own PDF file manually with our ‘Hello World’ coding example

One of our developers bravely set out to write the ‘Hello World’ tutorial of PDF files, creating a PDF file from scratch manually, in a text editor. Follow the series here:

Part 1: PDF Objects and Data Types
Part 2: Structure of a PDF file
Part 2.5: Create a non working PDF
Part 3: DIY Blank Page
Part 4: Hello World Pdf
Part 5: Path objects
Part 6: Graphics State



Are you a Developer working with PDF files?

Our developers guide contains a large number of technical posts to help you understand the PDF file Format.

Find out more about our software for Developers

Convert PDF to HTML5 or SVG Convert PDF to HTML5 or SVG
Convert AcroForms and XFA to HTML5Convert AcroForms and XFA to HTML5
Java PDF SDK for working with PDF files Java PDF SDK for working with PDF files
Leon Atherton Leon is a developer at IDRsolutions and product manager for BuildVu. He oversees the BuildVu product strategy and roadmap in addition to spending lots of time writing code.

13 Replies to “Understanding the PDF File Format”

  1. Hello, JPedal team 🙂
    First of all, allow me to express my satisfaction for reading such simple and clear PDF knowledges background. This is really helpful for who just starts into the PDF world. Thank you for this great job.
    I found these posts while looking for a *real* way of doing some redacts. I thought I had everything when I found ‘pdfedit’ (with a combination of its ‘replaceText’, its ‘findText’, its ‘drawRect’ and its ‘flattener’ functionalities), but then the ugly truth came up: sometimes not all characters are available. I guess it’s embedded fonts fault, but I am not quite sure. Here is when I started to read your posts 😉
    The fact is, I guess I will be able to implement a functional redact feature if found the way to ensure my replacement char (the one I use to replace each char of the redacted phrase) is in fact available. I see two scenarios here:
    1- the char is available (and so, everybody is happy :P), or
    2- the char is not available, but I am able to insert it to the right (embedded?) font.
    Could you help me to accomplish this, please? Some hints could be appreciated.
    Thanks in advance.
    Best regards and keep this spirit!


    Alejandro

  2. Hello Leon,
    thanks for your work here. This is a really nice collection of helpful hints and tips. I’m searching the web looking for some explaining word on how to embed a XML file to a PDF/a3 file by code. Can you help me?
    Kind regards,
    Al Mudy

  3. I have a question.
    What is the data format used in PDF to draw table. Is there some type of native table object in PDF we can use or is it just a vector graphic that paints a table. and how table extraction done in PDF content extraction libraries.
    Could you please explain this.

    1. If the PDF was created with additional tagged meta data then there may be tags (there is no specification for these so they might be HTML or some custom user creation). Most files do not have this feature enabled so I am afraid it is usually just content painted in an arbitrary order which your brain then interprets as a table.

  4. Hi Thanks for the excellent guide it helped me to understand this much better.

    I have a question though. Some PDF’s which I read using the PDFSharp Library for Visual Studio. When I grab the text from a page.

    I get weird text.

    Instead of it being in clear text it seems to be encoded or possibly encrypted
    I.E
    Td
    (/0\(11$2#\(11#$2’3#45’67″$8) Tj

    Reading Chapter 9 of PDF 32000-1:2008 I cant gather if this is a font encoding or not.
    How can I go about decoding the above text ?

  5. Pingback: CuiHuPan

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.