We have been working with PDF files since 1999 and developed complex software to display PDF files. We have learnt a lot about the PDF file format in that time and share our knowledge in the articles below.
If you are interested in using our software to display your PDF documents (we can rasterize them, convert them to HTML5 or SVG, or provide a complete Java PDF Viewer) pdf why not setup a call with us and see if we can help?
Here is an overview of the topics covered in this article:
- Quick Tutorials
- Frequently Asked Questions
- The PDF File itself
- Images in PDF
- Color handling in PDF
- Text in PDF
- Fonts in PDF
- PDF Forms, Annotations & Interactive Elements
- PDF Security
- CCITT Encoding in PDF
- Make your own PDF file manually
How to solve common PDF tasks in Java with JPedal
How to convert a PDF file to an image
How to rasterize PDF files
How to search a PDF file
How to print a PDF file
How to access PDF metadata
How to extract text from PDF files
How to extract structured text from PDF files
How to create or edit Annotations in a PDF file
How to extract images from a PDF file
How to extract clipped Images from a PDF file
How to extract PDF Bookmark data
How to find PDF page size
How to Compare PDF files
How to view PDF files
How to extract PDF file form data
What new PDF developers need to know
Learning about PDF
A quick guide to PDF for Java (and non-Java) developers
Frequently Asked Questions:
Questions developers often ask us
Why can’t I just open and edit a PDF File?
How do I find out the PDF version used?
What is a PDF renderer?
How big is a PDF Page in bytes?
What does an OCR PDF file contain?
What is PDF Pagesize? CropBox, MediaBox, ArtBox, BleedBox, TrimBox?
How to calculate PDF Page Size in Inches or Centimetres?
Why is my PDF Producer showing in Chinese?
How to Embed PDF files in HTML Web Pages
The PDF File itself:
This section covers the actual file format and how it works
How to view PDF objects
How to read a PDF file
Where do your PDF objects start in a PDF file?
Understanding the PDF file format – Text, shapes and images
What are PDF Object Streams?
Multiple Trailers in a PDF File
What are PDF Xref tables?
Understanding PDF Text Objects
How does a decodeArray work on Images?
What is a PDF Dictionary?
What is a Linearized PDF File?
What are Form XObjects?
How are stacks used in PDF files?
How to identify a PDF File
No Startxref found in last 1024 bytes?
How to Embed your own data in PDF files
Why writing a PDF parser is such a challenging task (Part 234)
Corrupt PDFs? Maybe this is your problem
Images in PDF:
This section explores image related topics in the PDF File format
How are images stored in a PDF file?
How are images displayed in a PDF file?
What are PDF Image Masks?
How to calculate PDF Image DPI?
How to extract Raw JPEG Images from a PDF File?
How do Filter and DecodeParms Objects change a PDF Image?
Color handling in PDF:
Color support inside PDF files is very powerful and complex.
How does Color work in PDF files?
How does image color depth work in PDF files?
What is an Indexed Colorspace in a PDF file?
Why is white a special color in PDF Files?
What are ICCBased Colorspaces?
What is a YCCK colorspace in a PDF file?
How to convert YCCK color to RGB color
Text in PDF:
How Text is stored, displayed and extracted from a PDF file
How is text stored in a PDF file?
Why is pdf text extraction problematic?
What text format and style information is in a PDF file?
How to find out if a PDF file contains ‘structured content’
What does the ActualText dictionary tag do?
How do PDF Text Coordinates work?
How are carriage returns, spaces and other gaps defined in a PDF file?
PDF Mystery – What is the correct value for a Text Field?
PDF Text extraction – Why can I not extract text from a PDF file?
How are text links defined in a PDF file?
How are Text spaces created in a PDF file?
Fonts in PDF:
PDF files can use three different font technologies for display
Introductory PDF font tutorial
Introduction to PDF Font Technologies
How are Embedded CMAP tables defined in a PDF File?
What are CID Fonts?
What are subsetted fonts in PDF files?
Where do PDF viewers get font data for non-embedded fonts?
Glyph Names – What is in a name?
Are your TrueType CMap Tables lying to you?
Embedded Truetype Fonts are always MAC encoded unless they are not
Hercule Poirot solves the mystery of the PDF file and the missing Euro
Problems caused by arial fonts in PDF files
How does TrueType Hinting work?
Why are CID Fonts far more complicated than non-CID Fonts?
PDF Forms, Annotations & Interactive Elements:
PDF files can contain interactive elements with Forms and Annotations
What are PDF Forms?
What are AcroForms?
What are XFA Forms?
How do PDF files add interactive elements?
How do Layers work in a PDF file?
Is it possible to extract flattened form data from a PDF file?
PDF Form Names explained
What is PDF Form Flattening?
How to display PDF forms in a browser
PDF files have their own security systems and processes
How are PDF files protected?
Overview of Security Features offered by the PDF file format
How are PDF files password protected?
How to create your own test certificates and keys for signing PDF files
CCITT Encoding in PDF:
CCITT is used to store compressed data inside PDF files.
CCITT Encoding in PDF – Converting CCITT data into a TIFF Image
CCITT Encoding in PDF – Black and White Facts
CCITT Encoding in PDF – Rows and Height Gotcha
CCITT Encoding in PDF – Decoding CCITT Data
CCITT Encoding in PDF – G31D CCITT Data Overview
CCITT Encoding in PDF – Decoding G31D CCITT Data
Make your own PDF file manually with our ‘Hello World’ coding example
One of our developers bravely set out to write the ‘Hello World’ tutorial of PDF files, creating a PDF file from scratch manually, in a text editor. Follow the series:
Part 1: PDF Objects and Data Types
Part 2: Structure of a PDF file
Part 2.5: Create a non working PDF
Part 3: DIY Blank Page
Part 4: Hello World Pdf
Part 5: Path objects
Part 6: Graphics State
How to edit PDF files using Incremental Updates
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
|Display PDF documents in a Web app|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|
20 Replies to “Understanding the PDF File Format”
Hello, JPedal team 🙂
First of all, allow me to express my satisfaction for reading such simple and clear PDF knowledges background. This is really helpful for who just starts into the PDF world. Thank you for this great job.
I found these posts while looking for a *real* way of doing some redacts. I thought I had everything when I found ‘pdfedit’ (with a combination of its ‘replaceText’, its ‘findText’, its ‘drawRect’ and its ‘flattener’ functionalities), but then the ugly truth came up: sometimes not all characters are available. I guess it’s embedded fonts fault, but I am not quite sure. Here is when I started to read your posts 😉
The fact is, I guess I will be able to implement a functional redact feature if found the way to ensure my replacement char (the one I use to replace each char of the redacted phrase) is in fact available. I see two scenarios here:
1- the char is available (and so, everybody is happy :P), or
2- the char is not available, but I am able to insert it to the right (embedded?) font.
Could you help me to accomplish this, please? Some hints could be appreciated.
Thanks in advance.
Best regards and keep this spirit!
thanks for your work here. This is a really nice collection of helpful hints and tips. I’m searching the web looking for some explaining word on how to embed a XML file to a PDF/a3 file by code. Can you help me?
We recommend Itext for embedding data in PDF.
I have a question.
What is the data format used in PDF to draw table. Is there some type of native table object in PDF we can use or is it just a vector graphic that paints a table. and how table extraction done in PDF content extraction libraries.
Could you please explain this.
If the PDF was created with additional tagged meta data then there may be tags (there is no specification for these so they might be HTML or some custom user creation). Most files do not have this feature enabled so I am afraid it is usually just content painted in an arbitrary order which your brain then interprets as a table.
Hello, JPedal team ????
Thanks for sharing your knowledge. The articles help me a lots.
Hi Thanks for the excellent guide it helped me to understand this much better.
I have a question though. Some PDF’s which I read using the PDFSharp Library for Visual Studio. When I grab the text from a page.
I get weird text.
Instead of it being in clear text it seems to be encoded or possibly encrypted
Reading Chapter 9 of PDF 32000-1:2008 I cant gather if this is a font encoding or not.
How can I go about decoding the above text ?
You need to stop thinking of it as text. It is encoded binary content which may look like test if the PDF uses WIN encoding.
There ate lots of articles to help you on our blog like
A pdf file was edited using iTextPdf programmatically. The PDF has a few radio buttons. I can see the resulting pdf file in Chrome browser without issues, However when I open the same file in Acrobat Reader, radio buttons are not showing up in the file.
What is missing here?
It could be anything. I would recommend using the excellent free Itext tool Rups to look at the values set.
Hello Mark, your blogs are amazing. I am looking for some technical to develop a software that can edit PDF files (text and images), like Foxit Reader. Have you and your colleagues done this, can you give me some advice. Thank you and Best Regards
Many thanks for the compliment! I will pass on your feedback to the team. To answer your question we had considered in the past but have chosen to focus on our current product range for developers.
Hello Mark – at my last job as a software engineer, I worked with a third party PDF parser (PDFLib) to extract metadata (fonts, colors, page size, document info, etc.) – and arrived at a pretty good Java API to retrieve detailed metadata in json and protobuf format. Unfortunately the tool I was using could not extract page background color, at least according to the vendor when we contacted them about this. I was incredulous that such a simple attribute as page background color could not be retrieved. Anyway, now that I have retired (that was my last full time job), I plan to see if I can crack this issue in my own time – obviously I don’t have access to the source code I developed at my last job but all I really want to do for now is to discover how the PDF specification represents page background color, chose a parser that allows me to extract that, and write a simple demo utility that extracts that.
Any thoughts? Thanks in advance.
It is certainly an interesting problem. What do you mean by page background colour? There is no global setting for this value and the final colour of any pixel will depend on parsing and executing the PDF commands – the easiest way to get this is to rasterize the page so you get the end result, unless you mean colour behind text. So it is not a simple task.
As regards recommending Open Source PDF libraries, it really depends what language you want to use. You have PDF.js (Javascipt), PdfBox/iText (Java) or xpdf (C).
Thanks for that Mark. I wasn’t aware that there is no single setting, although of course it’s easy to set from Acrobat. Regarding rasterizing the PDF, I will certainly look into that, but yes the problem the software was trying to address was to detect ‘hidden’ text: that is text with identical color as its background.
Likely I will be using Java or even Python (juts for fun!)