} ?>
Leon Atherton Leon is a developer at IDRsolutions and product manager for BuildVu. He oversees the BuildVu product strategy and roadmap in addition to spending lots of time writing code.

Understanding the PDF File Format: Overview

2 min read

At IDR Solutions we have being developing a range of PDF software since 1999. We have a Java PDF Viewer and SDK, an Acrobat forms to HTML5 converter, a PDF to HTML5 converter and a Java ImageIO replacement). This has given us a lot of experience with the PDF file format and we have tried to share this knowledge on our blog.  If you really want to know what goes on inside PDF files, these articles will give you all the details!

This is part 1 of the index, which is aimed to give an overview of the format. In part 2, we talk a lot more about PDF bugs, gotchas and tips!

The PDF File Format:

This section contains in depth information regarding how content is actually stored in a PDF file – what you see when you open a PDF in a text editor.

Viewing PDF Objects
PDF Object Streams
Multiple Trailers on PDF Files
PDF Xref Tables Explained
Text Streams
How are images stored?
PDF Dictionary
Named Locations
Linearized PDF Files
Form XObjects
More articles…

Images in PDF:


Images – An Overview
3 Examples of unusual ways to use PDF Image Masks
3 Types of Image Mask
Advantages of JBIG2 compression in PDF explained
There are several version of each image inside your PDF file
More articles…

If you require PDF to Image Conversion or Image Extraction from PDF, you may be interested in JPedal, our Java PDF Library.

Colors in PDF:


Color – An Overview
PDF Image Color Depth
Indexed Colorspaces
The Color White in PDF Files
ICCBased Colorspaces
YCCK Color Conversion in PDF Files
More articles…

Text in PDF:


PDF Text – An Overview
PDF Text Co-ordinates
Carriage returns, spaces and other gaps
More articles…

If you require PDF Text Extraction or PDF Text search, you may be interested in JPedal, our Java PDF Library.

Fonts in PDF:


PDF Fonts – An Overview
Introduction to PDF Font Technologies
Embedded CMAP Tables
What are CID Fonts?
Custom Font Encodings
Are there really 3 types of fonts in PDF files?
Standard Font Information
Glyph Names – What is in a name?
TrueType Font Hinting

More articles…

PDF Forms, Annotations & Interactive Elements:


Introduction to PDF Forms
Introduction to FDF Forms
Introduction to XFA Forms
Interactive Elements
Layers in PDFs
More articles…

PDF Security:


PDF Security (Passwords and Certificates)
Brief Overview of Security Features offered by the PDF file format
PDF Password Protection
Protecting PDF Content
More articles…



What new PDF developers need to know
Learning about PDF
Text, Shapes and Images
OCR (Optical Character Recognition) PDF files
Bookmarks and Links
What is PDF Pagesize? CropBox, MediaBox, ArtBox, BleedBox, TrimBox?
PDF Format and Style Information
A quick guide to PDF for Java (and non-Java) developers
Why writing a PDF parser is such a challenging task (Part 234)
Searching PDF Files
How do stacks work in PDF files
How do PDF files manage limitless position accuracy of shapes & images?
More articles…

Make your own PDF file – Hello World:

One of our developers bravely set out to write the ‘Hello World’ tutorial of PDF files, creating a PDF file from scratch manually, in a text editor. Follow the series here:

Part 1: PDF Objects and Data Types
Part 2: Structure of a PDF file
Part 2.5: Create a non working PDF
Part 3: DIY Blank Page
Part 4: Hello World Pdf
Part 5: Path objects
Part 6: Graphics State

If you enjoyed this index, we have also have a second, longer index covering all the nitty gritty details and gotchas we have found from over 13 years of working with PDF!

Is there something that we haven’t covered? Leave us a comment and we will see what we can do!

Find out more about our software for Developers

Convert PDF to HTML5 or SVG
Convert PDF to HTML5 or SVG
Convert AcroForms and XFA to HTML5
Convert AcroForms and XFA to HTML5
Java PDF SDK for working with PDF files
Java PDF SDK for working with PDF files
Leon Atherton Leon is a developer at IDRsolutions and product manager for BuildVu. He oversees the BuildVu product strategy and roadmap in addition to spending lots of time writing code.

13 Replies to “Understanding the PDF File Format: Overview”

  1. Hello, JPedal team 🙂
    First of all, allow me to express my satisfaction for reading such simple and clear PDF knowledges background. This is really helpful for who just starts into the PDF world. Thank you for this great job.
    I found these posts while looking for a *real* way of doing some redacts. I thought I had everything when I found ‘pdfedit’ (with a combination of its ‘replaceText’, its ‘findText’, its ‘drawRect’ and its ‘flattener’ functionalities), but then the ugly truth came up: sometimes not all characters are available. I guess it’s embedded fonts fault, but I am not quite sure. Here is when I started to read your posts 😉
    The fact is, I guess I will be able to implement a functional redact feature if found the way to ensure my replacement char (the one I use to replace each char of the redacted phrase) is in fact available. I see two scenarios here:
    1- the char is available (and so, everybody is happy :P), or
    2- the char is not available, but I am able to insert it to the right (embedded?) font.
    Could you help me to accomplish this, please? Some hints could be appreciated.
    Thanks in advance.
    Best regards and keep this spirit!


  2. Hello Leon,
    thanks for your work here. This is a really nice collection of helpful hints and tips. I’m searching the web looking for some explaining word on how to embed a XML file to a PDF/a3 file by code. Can you help me?
    Kind regards,
    Al Mudy

  3. I have a question.
    What is the data format used in PDF to draw table. Is there some type of native table object in PDF we can use or is it just a vector graphic that paints a table. and how table extraction done in PDF content extraction libraries.
    Could you please explain this.

    1. If the PDF was created with additional tagged meta data then there may be tags (there is no specification for these so they might be HTML or some custom user creation). Most files do not have this feature enabled so I am afraid it is usually just content painted in an arbitrary order which your brain then interprets as a table.

  4. Hi Thanks for the excellent guide it helped me to understand this much better.

    I have a question though. Some PDF’s which I read using the PDFSharp Library for Visual Studio. When I grab the text from a page.

    I get weird text.

    Instead of it being in clear text it seems to be encoded or possibly encrypted
    (/0\(11$2#\(11#$2’3#45’67″$8) Tj

    Reading Chapter 9 of PDF 32000-1:2008 I cant gather if this is a font encoding or not.
    How can I go about decoding the above text ?

  5. Pingback: CuiHuPan

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.