Understanding the PDF File Format: Bugs, Gotchas and Tips

While categorizing our posts about understanding the PDF file format, it became clear that we have written far too much to fit into a single index, so we have created this second index for the hardcore PDF fans who really want to get inside the PDF file format.

We recommend starting over at part 1 to learn about all the main concepts of the PDF format, then heading back here if you want to get into all the nitty gritty details and gotchas we have found from over 13 years of working with PDF!

The PDF File Format:

This section contains in depth information regarding how content is actually stored in a PDF file – what you see when you open a PDF in a text editor.

2 Problems with Corrupt PDF Data Streams
How can a PDF file be broken?
Identifying a PDF File from its first line
No Startxref found in last 1024 bytes?
Embedding your own data in PDF Files
Intriguing PDF xref Issue
Strange PDF File of the Week
Interesting PDF Bugs:


Interesting PDF Bugs – An Extreme Case of Recursion
Interesting PDF Bugs – Using SMask and Image ‘the opposite way’ round
Interesting PDF Bugs – Zero Bytes in a String
Interesting PDF Bugs – X Marks the spot (or not)
Interesting PDF Bugs – ICC Colorspace Alt Setting
Interesting PDF Bugs – Simulating an SMask with Vector Graphics
Interesting PDF Bugs – Mixed up Font Object
Interesting PDF Bugs – PDF Text is really a tiny image with a big SMask
Interesting PDF Bugs – Tiny Dash Values and the Java JVM
Interesting PDF Bugs – Values out of Range
Interesting PDF Bugs – Missing Image Data
Interesting PDF Bugs – Missing Image Data 2
Interesting PDF Bugs – Dealing with 3 Types of Fonts
Interesting PDF Bugs – Pointless Font Inclusion
Interesting PDF Bugs – Odd text rendering issue in Acrobat on Mac
Interesting PDF Bugs – Phantom PDF Objects

Images in PDF:


Do you need an image that big in your PDF file?
Small Images can cause big problems in PDF Files
A suggestion to the Prawn development team on making smaller PDF files
Making sure image names are unique in PDF files
Large images in a PDF File
Extract Raw JPEG Images from a PDF File
Filter and DecodeParms Objects for a PDF Image

If you require PDF to Image Conversion or Image Extraction from PDF, you may be interested in our Java PDF Library.

Colors in PDF:


CMYK does not always mean CMYK
Fine Tuning PDF Image Color with ICC Profiles
Convert PDF to Grayscale or Black and White
CCITT Encoding in PDF:


CCITT Encoding in PDF – Converting CCITT data into a TIFF Image
CCITT Encoding in PDF – Black and White Facts
CCITT Encoding in PDF – Rows and Height Gotcha
CCITT Encoding in PDF – Decoding CCITT Data
CCITT Encoding in PDF – G31D CCITT Data Overview
CCITT Encoding in PDF – Decoding G31D CCITT Data

Text in PDF:


PDF Mystery – What is the correct value for a Text Field
PDF Text Extraction with Java
The easy way to discover if a PDF File contains structured content
Why can I not extract text from this GhostScript generated PDF file?
Why can’t I extract text from this PDF file?
Extracting Text References from a PDF File
Extracting Structured Text from PDF Files
Space is a special character
Text Spaces in PDF Files
Space: The Final Frontier… in PDF
If you require PDF Text Extraction, PDF to Text Conversion or PDF Text search, you may be interested in our Java PDF Library.

Fonts in PDF:


Why the TrueType Hinting Patent Expiration Matters
Be careful with your PDF Fonts
Are your TrueType CMap Tables lying to you?
Mystery of the PDF file and the missing euro character
Problems caused by arial fonts in PDF files
Differences in the PDF Differences Tables
TrueType Hinting – Big Screens for Small Details
Why are CID Fonts far more complicated than non-CID Fonts?
Embedded PDF Truetype Fonts are always MAC encoded unless they are not
PDF with odd Type3 Fonts in Ghostscript 8.50
PDF Forms, Annotations & Interactive Elements:


Extracting Flattened Form Data from a PDF File
The Mystery Behind PDF Form Names
What is PDF Form Flattening?
What are PDF readonly text fields?
Not all forms are PDF forms
PDF Security:


Why do I need the PDF password to open the PDF file?
Creating your own test certificates and keys for signing PDF files
Why even Acrobat Reader can’t support 100% PDF Specification
Choosing sensible optimisations for PDF files
Corrupt PDFs? Maybe this is your problem
How to compare 2 PDF Files
Working out PDF Page Size in Inches or Centimetres
There is more than one PDF File Specification
Don’t Blame the PDF File Format
Be careful how you remove critical data from a PDF File
Find out what’s really in your PDF files
3 Reasons why PDF Commands matter.
The definitive PDF book from the top PDF expert
Is there something that we haven’t covered? Leave us a comment and we will see what we can do!

Leon Atherton Leon is a developer at IDRsolutions and product manager for BuildVu. He oversees the BuildVu product strategy and roadmap in addition to spending lots of time writing code.

5 Replies to “Understanding the PDF File Format: Bugs, Gotchas and Tips”

  1. The ToUnicode mappings in my pdf is not correct. Is there any way I can programatically edit the mappings and reproduce the table?

  2. I would like to know if we can extract the text which was changed in a pdf. For example, if a pdf had a word “hello”
    and lets say i used a pdf editor and changed it to “world” and saved it, by using this new pdf’s raw can i get the original text which is “hello” ??

  3. So what if I replace the text “hello” and write “world” and save the file, then will it be possible to get the original word “hello” using the pdf raw, I would really appreciate if you can help me pull that up if it is possible.

