Understanding the PDF File Format

We have been working with PDF files since 1999 and developed complex software to display PDF files. We have learnt a lot about the PDF file format in that time and share our knowledge in the articles below.

There are also a large number of technical terms used with PDF so we have created a Glossary of Terms with all the keywords.

We created our own PDF Java library, JPedal to make it easier for Java developers to do more with the PDF file format. By downloading the trial, you can test in your own workflows why it is trusted by companies worldwide.

If you are interested in using our software to display your PDF documents, we suggest BuildVu. It can rasterize your documents and convert PDF to clean HTML5 or SVG.

Conversely FormVu helps you by converting PDF forms to HTML forms, bringing all the control of the clean Markup Language with the functionality of PDF forms!

The PDF File itself:

This section covers the actual file format and how it works

Images in PDF:

This section explores image related topics in the PDF File format

Color handling in PDF:

Color support inside PDF files is very powerful and complex.

Text in PDF:

How Text is stored, displayed and extracted from a PDF file

Fonts in PDF:

PDF files can use three different font technologies for display

PDF Forms, Annotations & Interactive Elements:

PDF files can contain interactive elements with Forms and Annotations

PDF File Encryption:

PDF files can have their content protected using encryption.

PDF compression:

PDF files use CCITT, DCT, Flate, LZW and other forms of Compression to reduce the size of a PDF file.

Quick Tutorials:

How to solve common PDF tasks in Java with our software

BuildVu

Convert PDF to HTML/SVG with a clean API.

Converting PDF to HTML

Converting PDF to SVG

JPedal

Our comprehensive PDF toolkit.

Java Viewer

How to view PDF files

Rasterize

Print PDF

How to print a PDF file

Process Documents

Manipulate Pages

Extract Content

Interaction

PDF Inspector

How to find PDF page size

Guides:

Frequently Asked Questions:

Questions developers often ask us

Make your own PDF file manually with our ‘Hello World’ coding example

One of our developers bravely set out to write the ‘Hello World’ tutorial of PDF files, creating a PDF file from scratch manually, in a text editor. Follow the series:

Our software libraries allow you to

Convert PDF files to HTML

Use PDF Forms in a web browser

Convert PDF Documents to an image

Work with PDF Documents in Java

Read and write HEIC and other Image formats in Java

15 Replies to “Understanding the PDF File Format”

Alejandro says:
January 21, 2013 at 3:43 am
Hello, JPedal team 🙂
First of all, allow me to express my satisfaction for reading such simple and clear PDF knowledges background. This is really helpful for who just starts into the PDF world. Thank you for this great job.
I found these posts while looking for a *real* way of doing some redacts. I thought I had everything when I found ‘pdfedit’ (with a combination of its ‘replaceText’, its ‘findText’, its ‘drawRect’ and its ‘flattener’ functionalities), but then the ugly truth came up: sometimes not all characters are available. I guess it’s embedded fonts fault, but I am not quite sure. Here is when I started to read your posts 😉
The fact is, I guess I will be able to implement a functional redact feature if found the way to ensure my replacement char (the one I use to replace each char of the redacted phrase) is in fact available. I see two scenarios here:
1- the char is available (and so, everybody is happy :P), or
2- the char is not available, but I am able to insert it to the right (embedded?) font.
Could you help me to accomplish this, please? Some hints could be appreciated.
Thanks in advance.
Best regards and keep this spirit!
—
Alejandro
Al Mudy says:
December 17, 2015 at 12:38 pm
Hello Leon,
thanks for your work here. This is a really nice collection of helpful hints and tips. I’m searching the web looking for some explaining word on how to embed a XML file to a PDF/a3 file by code. Can you help me?
Kind regards,
Al Mudy
1. Alex Marshall says:
  December 21, 2015 at 3:16 pm
  We recommend Itext for embedding data in PDF.
harsha says:
May 4, 2016 at 2:36 am
I have a question.
What is the data format used in PDF to draw table. Is there some type of native table object in PDF we can use or is it just a vector graphic that paints a table. and how table extraction done in PDF content extraction libraries.
Could you please explain this.
1. Mark Stephens says:
  May 9, 2016 at 9:42 am
  If the PDF was created with additional tagged meta data then there may be tags (there is no specification for these so they might be HTML or some custom user creation). Most files do not have this feature enabled so I am afraid it is usually just content painted in an arbitrary order which your brain then interprets as a table.
Xuan-Loi Vu says:
September 25, 2016 at 2:53 pm
Hello, JPedal team ????
Thanks for sharing your knowledge. The articles help me a lots.
Martin says:
November 30, 2017 at 7:14 pm
Hi Thanks for the excellent guide it helped me to understand this much better.
I have a question though. Some PDF’s which I read using the PDFSharp Library for Visual Studio. When I grab the text from a page.
I get weird text.
Instead of it being in clear text it seems to be encoded or possibly encrypted
I.E
Td
(/0\(11$2#\(11#$2’3#45’67″$8) Tj
Reading Chapter 9 of PDF 32000-1:2008 I cant gather if this is a font encoding or not.
How can I go about decoding the above text ?
Mark Stephens says:
December 1, 2017 at 10:42 am
You need to stop thinking of it as text. It is encoded binary content which may look like test if the PDF uses WIN encoding.
There ate lots of articles to help you on our blog like
https://blog.idrsolutions.com/2011/03/understanding-the-pdf-file-format-–-pdf-text-extraction-with-java/
https://blog.idrsolutions.com/2011/04/understanding-the-pdf-file-format-–-custom-font-encodings/
Kumar says:
May 18, 2021 at 9:25 pm
A pdf file was edited using iTextPdf programmatically. The PDF has a few radio buttons. I can see the resulting pdf file in Chrome browser without issues, However when I open the same file in Acrobat Reader, radio buttons are not showing up in the file.
What is missing here?
Mark Stephens says:
May 19, 2021 at 8:38 am
It could be anything. I would recommend using the excellent free Itext tool Rups to look at the values set.
CuongND says:
September 17, 2021 at 9:33 am
Hello Mark, your blogs are amazing. I am looking for some technical to develop a software that can edit PDF files (text and images), like Foxit Reader. Have you and your colleagues done this, can you give me some advice. Thank you and Best Regards
1. Alicia says:
  September 21, 2021 at 10:08 am
  Many thanks for the compliment! I will pass on your feedback to the team. To answer your question we had considered in the past but have chosen to focus on our current product range for developers.
Chris Jewell says:
February 24, 2022 at 1:57 am
Hello Mark – at my last job as a software engineer, I worked with a third party PDF parser (PDFLib) to extract metadata (fonts, colors, page size, document info, etc.) – and arrived at a pretty good Java API to retrieve detailed metadata in json and protobuf format. Unfortunately the tool I was using could not extract page background color, at least according to the vendor when we contacted them about this. I was incredulous that such a simple attribute as page background color could not be retrieved. Anyway, now that I have retired (that was my last full time job), I plan to see if I can crack this issue in my own time – obviously I don’t have access to the source code I developed at my last job but all I really want to do for now is to discover how the PDF specification represents page background color, chose a parser that allows me to extract that, and write a simple demo utility that extracts that.
Any thoughts? Thanks in advance.
1. Mark Stephens says:
  February 24, 2022 at 9:29 am
  Hi Chris,
  It is certainly an interesting problem. What do you mean by page background colour? There is no global setting for this value and the final colour of any pixel will depend on parsing and executing the PDF commands – the easiest way to get this is to rasterize the page so you get the end result, unless you mean colour behind text. So it is not a simple task.
  As regards recommending Open Source PDF libraries, it really depends what language you want to use. You have PDF.js (Javascipt), PdfBox/iText (Java) or xpdf (C).
Chris J says:
February 25, 2022 at 3:20 am
Thanks for that Mark. I wasn’t aware that there is no single setting, although of course it’s easy to set from Acrobat. Regarding rasterizing the PDF, I will certainly look into that, but yes the problem the software was trying to address was to detect ‘hidden’ text: that is text with identical color as its background.
Likely I will be using Java or even Python (juts for fun!)

Comments are closed.