We have been working with PDF files since 1999 and built our own software to parse, render, and convert them. In that time we have developed a detailed understanding of how the format works at a technical level, not just how to use it, but why it is designed the way it is.
PDF (Portable Document Format) was created by Adobe in 1993 to solve a specific problem: a document should look identical regardless of the device, operating system, or application used to open it. It achieves this by embedding everything needed to render the file, including fonts, images, colour profiles, and layout instructions, directly inside the file itself. The format was standardised as ISO 32000 in 2008, and PDF 2.0 (ISO 32000-2) followed in 2017 with improvements to encryption, digital signatures, and accessibility.
What makes PDF technically distinctive is its page-description model. Rather than storing text as a linear stream as HTML does, PDF describes exactly where every character, shape, and image should appear on a fixed coordinate plane. This is why PDF renders consistently across devices, but also why editing, text extraction, reflow, and accessibility are hard engineering problems with no clean solutions.
All the technical knowledge we have accumulated is documented in the articles below, organised by topic. There is also a Glossary of PDF Terms covering the key vocabulary used throughout the format.
PDF File Structure & Syntax
Every PDF is built from a small set of primitive object types: booleans, numbers, strings, arrays, dictionaries, and streams, assembled into a structure that describes pages, resources, and metadata. The file begins with a header identifying the PDF version, followed by a body of objects, a cross-reference table that allows random access to those objects, and a trailer pointing to the document root. Understanding that structure is the foundation for everything else: rendering, extraction, editing, and repair.
- How to view PDF objects
- How to read a PDF file
- Where do your PDF objects start in a PDF file?
- Understanding the PDF file format – Text, shapes and images
- What are PDF Object Streams?
- Multiple Trailers in a PDF File
- What are PDF Xref tables?
- Understanding PDF Text Objects
- How does a decodeArray work on Images?
- What is a PDF Dictionary?
- What is a Linearized PDF File?
- What are Form XObjects?
- How are stacks used in PDF files?
- How to identify a PDF File
- No Startxref found in last 1024 bytes?
- How to Embed your own data in PDF files
- Why writing a PDF parser is such a challenging task (Part 234)
Images in PDF
PDF supports raster images (JPEG, JPEG 2000, JBIG2, raw bitmap) and treats vector graphics as drawn paths rather than embedded files. Images are stored as stream objects with associated filter chains that handle compression and colour transformation. A single image object can pass through multiple filters before producing pixel data, which is why decoding PDF images correctly requires a more careful implementation than most developers expect.
- How are images stored in a PDF file?
- What are Blend Modes in PDF files?
- What are PDF Image Masks?
- How to calculate PDF Image DPI?
- How to extract Raw JPEG Images from a PDF File?
- How do Filter and DecodeParms Objects change a PDF Image?
- How to convert an image into a PDF file
Color Handling in PDF
PDF colour support is one of the most sophisticated parts of the specification. The format supports DeviceRGB, DeviceCMYK, DeviceGray, Lab, ICC-based colour spaces, and indexed palettes, with colour transformations definable at the individual object level. This matters in print production workflows, where incorrect colour handling produces commercially unacceptable output, and in any application where colour fidelity across devices is a requirement.
- How does Color work in PDF files?
- How does image color depth work in PDF files?
- What is an Indexed Colorspace in a PDF file?
- Why is white a special color in PDF Files?
- What are ICCBased Colorspaces?
Text in PDF
Text in PDF is not stored as readable strings. It is encoded as sequences of glyph IDs mapped through font encoding tables, which is why copying text from a PDF often produces garbled output, and why text extraction is an active engineering problem rather than a solved one. The articles in this section explain how that encoding works, where it breaks down, and what can be done about it.
- How is text stored in a PDF file?
- Why is pdf text extraction problematic?
- What is Unicode?
- What text format and style information is in a PDF file?
- How to find out if a PDF file contains ‘structured content’
- What does the ActualText dictionary tag do?
- How do PDF Text Coordinates work?
- How are carriage returns, spaces and other gaps defined in a PDF file?
- PDF Mystery – What is the correct value for a Text Field?
- PDF Text extraction – Why can I not extract text from a PDF file?
- How are text links defined in a PDF file?
- How are Text spaces created in a PDF file?
Fonts in PDF
PDF supports Type 1, TrueType, OpenType, and CID-keyed fonts, and can embed them fully or as subsets to reduce file size. When fonts are not embedded, viewers substitute alternatives using metrics stored in the file, which is a common source of rendering differences across platforms. The CID font system, designed for East Asian character sets with tens of thousands of glyphs, adds significant complexity that most developers encounter only when something breaks.
- Introductory PDF font tutorial
- Introduction to PDF Font Technologies
- How are Embedded CMAP tables defined in a PDF File?
- What are CID Fonts?
- What are subsetted fonts in PDF files?
- Where do PDF viewers get font data for non-embedded fonts?
- Problems caused by arial fonts in PDF files
- How does TrueType Hinting work?
- Why are CID Fonts far more complicated than non-CID Fonts?
PDF Forms, Annotations & Interactive Elements
PDF forms exist in two incompatible formats: AcroForm, the original standard defined in the core PDF specification, and XFA, an XML-based format introduced by Adobe and effectively deprecated after Adobe Reader 2021. AcroForm is the format that matters today. Annotations cover a broader category that includes hyperlinks, comments, stamps, and digital signatures, each defined by their own dictionary structure within the file.
- What are PDF Forms?
- What are AcroForms?
- What are XFA Forms?
- What is the future of PDF forms?
- How to convert PDF forms to HTML in Java or PHP
- How to convert fillable PDF forms to HTML forms in the Command Line
- How to convert fillable PDF forms to HTML forms in Java
- How to extract Form data from PDF files in Java
- How do PDF files add interactive elements?
- How do Layers work in a PDF file?
- Is it possible to extract flattened form data from a PDF file?
- What is PDF Form Flattening?
- How to display PDF forms in a browser
PDF Standards and Accessibility
The base PDF specification is supplemented by a set of ISO sub-standards for specific use cases. PDF/A restricts the format for long-term archival by prohibiting features like encryption and embedded JavaScript. PDF/X defines requirements for print-ready exchange. PDF/UA mandates the tagging and logical structure needed to support assistive technologies. PDF 2.0 (ISO 32000-2) consolidated many of these requirements into the core specification and introduced improvements that affect how compliant files should be created and processed today.
PDF File Encryption
PDF supports encryption from 40-bit RC4 through to 256-bit AES, with separate controls for opening a document and for restricting operations such as printing, copying, or modifying content. The permission model is enforced by the viewer rather than the operating system, which has specific security implications that are worth understanding before relying on it.
- How are PDF files protected?
- Overview of Security Features offered by the PDF file format
- How are PDF files password protected?
- How to create your own test certificates and keys for signing PDF files
PDF Compression
PDF applies compression at the stream level, meaning individual objects within the same file can each use a different algorithm. Flate (zlib/DEFLATE) is the most common for general content. CCITT and JBIG2 are optimised for black-and-white images. JPEG and JPEG 2000 handle photographic content. Choosing the wrong algorithm for a given content type has a measurable effect on both file size and decode performance.
- What is CCITT compression?
- How to Convert CCITT data to TIFF image
- What is the best option to compress a PDF?
- How does CCITT compress image data?
Quick Tutorials
Code-first examples for converting PDF files in production workflows across multiple languages. These tutorials use BuildVu, our PDF to HTML5 and SVG conversion tool, available as both a cloud service and a server deployment.
Converting PDF to HTML
- How to convert a PDF file into HTML in Java
- How to convert PDF to HTML in Python
- How to convert PDF to HTML in C#
- How to convert PDF to HTML in PHP
- How to convert PDF to HTML in Other Languages
- Convert PDF to HTML/SVG on Your Phone
Converting PDF to SVG
- How to convert PDF to SVG
- How to convert PDF to SVG in C# (Tutorial)
- How to convert PDF to SVG in JavaScript (Tutorial)
- How to convert PDF to SVG in Python (Tutorial)
Guides
- Top 9 PDF file questions with answers for developers
- What is the PDF file format?
- What Java Developers need to know about PDF Files
- PDF Association cheat sheets reviewed
- What is inside a PDF file?
- Localization in Java Apps – Add Language Support
Frequently Asked Questions
Questions we are regularly asked by developers working with PDF files in code.
- Why can’t I just open and edit a PDF File?
- How do I find out the PDF version used?
- What is a PDF renderer?
- What is a tagged PDF?
- How big is a PDF Page in bytes?
- What does an OCR PDF file contain?
- What is PDF Pagesize? CropBox, MediaBox, ArtBox, BleedBox, TrimBox?
- How to calculate PDF Page Size in Inches or Centimetres?
- Why is my PDF Producer showing in Chinese?
- How to Embed PDF files in HTML Web Pages
- How to Compare PDF files
- How to handle corrupt PDF files
Build Your Own PDF File
The fastest way to understand how PDF works internally is to write one from scratch. This series walks through creating a valid PDF in a text editor, starting from raw object types and building up to a complete document with text and path drawing. It covers the object structure, cross-reference table, page tree, and graphics state, the internals that library abstractions hide but that matter when something goes wrong.
- Part 1: PDF Objects and Data Types
- Part 2: Structure of a PDF file
- Part 2.5: Create a non working PDF
- Part 3: DIY Blank Page
- Part 4: Hello World PDF
- Part 5: Path objects
- Part 6: Graphics State
- How to edit PDF files using Incremental Updates
Our PDF Software
We build Java-based PDF tools used by companies worldwide. JPedal is our Java library for developers who need programmatic access to PDF content, covering rendering, text extraction, annotation handling, and form processing. BuildVu converts PDF to clean HTML5 or SVG, suitable for browser display without a plugin. FormVu converts PDF forms to HTML forms, preserving field behaviour in native markup.
Our software libraries allow you to
| Convert PDF files to HTML |
| Use PDF Forms in a web browser |
| Convert PDF Documents to an image |
| Work with PDF Documents in Java |
| Read and write HEIC and other Image formats in Java |
Hello, JPedal team 🙂
First of all, allow me to express my satisfaction for reading such simple and clear PDF knowledges background. This is really helpful for who just starts into the PDF world. Thank you for this great job.
I found these posts while looking for a *real* way of doing some redacts. I thought I had everything when I found ‘pdfedit’ (with a combination of its ‘replaceText’, its ‘findText’, its ‘drawRect’ and its ‘flattener’ functionalities), but then the ugly truth came up: sometimes not all characters are available. I guess it’s embedded fonts fault, but I am not quite sure. Here is when I started to read your posts 😉
The fact is, I guess I will be able to implement a functional redact feature if found the way to ensure my replacement char (the one I use to replace each char of the redacted phrase) is in fact available. I see two scenarios here:
1- the char is available (and so, everybody is happy :P), or
2- the char is not available, but I am able to insert it to the right (embedded?) font.
Could you help me to accomplish this, please? Some hints could be appreciated.
Thanks in advance.
Best regards and keep this spirit!
—
Alejandro
Hello Leon,
thanks for your work here. This is a really nice collection of helpful hints and tips. I’m searching the web looking for some explaining word on how to embed a XML file to a PDF/a3 file by code. Can you help me?
Kind regards,
Al Mudy
We recommend Itext for embedding data in PDF.
I have a question.
What is the data format used in PDF to draw table. Is there some type of native table object in PDF we can use or is it just a vector graphic that paints a table. and how table extraction done in PDF content extraction libraries.
Could you please explain this.
If the PDF was created with additional tagged meta data then there may be tags (there is no specification for these so they might be HTML or some custom user creation). Most files do not have this feature enabled so I am afraid it is usually just content painted in an arbitrary order which your brain then interprets as a table.
Hello, JPedal team ????
Thanks for sharing your knowledge. The articles help me a lots.
Hi Thanks for the excellent guide it helped me to understand this much better.
I have a question though. Some PDF’s which I read using the PDFSharp Library for Visual Studio. When I grab the text from a page.
I get weird text.
Instead of it being in clear text it seems to be encoded or possibly encrypted
I.E
Td
(/0\(11$2#\(11#$2’3#45’67″$8) Tj
Reading Chapter 9 of PDF 32000-1:2008 I cant gather if this is a font encoding or not.
How can I go about decoding the above text ?
You need to stop thinking of it as text. It is encoded binary content which may look like test if the PDF uses WIN encoding.
There ate lots of articles to help you on our blog like
https://blog.idrsolutions.com/2011/03/understanding-the-pdf-file-format-–-pdf-text-extraction-with-java/
https://blog.idrsolutions.com/2011/04/understanding-the-pdf-file-format-–-custom-font-encodings/
A pdf file was edited using iTextPdf programmatically. The PDF has a few radio buttons. I can see the resulting pdf file in Chrome browser without issues, However when I open the same file in Acrobat Reader, radio buttons are not showing up in the file.
What is missing here?
It could be anything. I would recommend using the excellent free Itext tool Rups to look at the values set.
Hello Mark, your blogs are amazing. I am looking for some technical to develop a software that can edit PDF files (text and images), like Foxit Reader. Have you and your colleagues done this, can you give me some advice. Thank you and Best Regards
Many thanks for the compliment! I will pass on your feedback to the team. To answer your question we had considered in the past but have chosen to focus on our current product range for developers.
Hello Mark – at my last job as a software engineer, I worked with a third party PDF parser (PDFLib) to extract metadata (fonts, colors, page size, document info, etc.) – and arrived at a pretty good Java API to retrieve detailed metadata in json and protobuf format. Unfortunately the tool I was using could not extract page background color, at least according to the vendor when we contacted them about this. I was incredulous that such a simple attribute as page background color could not be retrieved. Anyway, now that I have retired (that was my last full time job), I plan to see if I can crack this issue in my own time – obviously I don’t have access to the source code I developed at my last job but all I really want to do for now is to discover how the PDF specification represents page background color, chose a parser that allows me to extract that, and write a simple demo utility that extracts that.
Any thoughts? Thanks in advance.
Hi Chris,
It is certainly an interesting problem. What do you mean by page background colour? There is no global setting for this value and the final colour of any pixel will depend on parsing and executing the PDF commands – the easiest way to get this is to rasterize the page so you get the end result, unless you mean colour behind text. So it is not a simple task.
As regards recommending Open Source PDF libraries, it really depends what language you want to use. You have PDF.js (Javascipt), PdfBox/iText (Java) or xpdf (C).
Thanks for that Mark. I wasn’t aware that there is no single setting, although of course it’s easy to set from Acrobat. Regarding rasterizing the PDF, I will certainly look into that, but yes the problem the software was trying to address was to detect ‘hidden’ text: that is text with identical color as its background.
Likely I will be using Java or even Python (juts for fun!)