We had a discussion last week about what tips would help new developers get to grips when starting to work with PDF files. Here are some of the ideas which came out of that. It is very much a personal suggestion list so please feel free to add your own suggestions.
Do not think of a PDF file as a ‘file’
When you start to learn HTML, you can open a file, hack it in a text editor and see what happens. You can’t do this with a PDF file. It is essentially a binary data structure – lots of the information cannot be seen if you open the raw file and editing one byte could potentially break the whole file. There are lots of really good tools out there on multiple platforms for examining the contents of a PDF file so you should not need to try and open the file directly.
PDF is all about objects
What the PDF file essentially contains is a whole lot of PDF objects. They all have a unique ID of the format number generation R (so you might see 3 0 R, 144 0 R). Most of the time generation is zero but not always.
There are lots types of objects – a Page Object describes a particular page, a Font object contains all the information about a specific font, a Form object contains information. Objects can reference other objects, so Page Object 5 0 R might reference Resources object 10 0 R which contains a list of Font objects used for the page, including Font objects 16 0 R, 17 0 R, 18 0 R.
The objects can also be thought of as a Tree. This is what allows any page to be opened quickly. The PDF root object points to the list of pages which point to the resources they use and their contents.
Two identical looking PDFs can be very different inside
The PDF specification is very broad and flexible so there are lots of different ways to achieve the same result. The specification does not enforce any approach so all the PDF creation tools do things in different ways. If you have a strange PDF, it is always worth seeing what the Producer or Creator settings are.
Images are ‘ripped’ up inside a PDF
When a PDF is created, images are broken up into their pixel and colour data so that they can be compressed as efficiently as possible. JPEG data may well be stored in a JPEG compression format (DCTDecode or JPXDecode) but it may still need to have colour information applied.
Essential reference material – The PDF Reference Guide
Adobe produces a detailed specification of the PDF Reference guide which is free to download. It is very big and there is an awful lot to it. Ideally, a beginner should start with the outline of the file format and just the areas they need to understand.
The PDF specification goes into considerable detail on the specification. But it may not be written from the precise viewpoint you need and also Adobe allows considerable interpretation in of what is acceptable. While there are lots of examples, it is possible for tools to do things in other ways.
What makes a PDF
A PDF file should ideally have a .pdf file type, an xref pointer in the last 1024 bytes of its data and the file line of a PDF should be the version. But there is quite a lot of variation in what is actually allowed in a PDF and how useful a PDF is. A PDF file can contain fonts and editable text or just be a raw around an image.
At the end of the day, if it opens in Acrobat it is accepted as a PDF and you need to handle it…
PDF is a collection of other technologies
Use the tools
There are lots of tools (both free and commercial) on all platforms and in different languages (C, Java, Perl, Php, etc). They make it much easier to work with PDF files and also experimenting with them (especially if you can access the source code) is a good way to understand how PDF works.
There are people to ask
I remember meeting Tom Phelps, the developer or Multivalent, at a conference in 2002. We were so pleased to find someone else we could actually have a conversation with, we spent the whole night discussing PDF issues at the pub afterwards. Everyone else in the bar complained it was the most boring night of their lives, but we both had a good time…
Thanks to the Internet, you can discuss PDF issues without totally destroying your street credibility! Many of the people or companies producing PDF tools run mailing lists or discussions forums (my first job every morning is to check the JPedal Support forums) and there are more general forums. I personally find stackoverflow a good place to ask questions.
Becoming an expert in PDF is not an overnight process
I started working with PDF files over 10 years ago and I still learn new things every day. PDF is a big, complex file format including a lot of technologies so it will need time to become proficient with it.
So that is my advice. What would say to a new PDF developer? Or do you have any tips or advice?
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- Introducing the new XFA Parser in FormVu - May 16, 2018
- Moving to JPedal release 8 - May 2, 2018
- Which version of Java SE should I use? - April 25, 2018
- How we are improving our code quality with IDEA in 2018 - March 7, 2018
- How we are improving our code quality with NetBeans in 2018 - March 1, 2018