The PDF file format is very useful and well-documented, but it is also quite complicated and it does not work how most people imagine. It is structured very differently from a Word or Excel document.
Most of the time, this is not an issue – you can just use PDF files without knowing anything about them and just enjoy the benefits. There comes a time though, when you may need to start to dabble. So this article is designed to give you some starting points.
It is worth getting to grips first with the basic idea that a PDF file is essentially a set of linked objects (so each page has a page object, which may include font objects defining the fonts, XObjects storing image data and so on). Then you can look at all the different types of objects. The PDF file contains all these objects and their locations (the references) so that they can be read as needed.
The definitive guide to the PDF file is the Adobe PDF reference guide. It is a very complete and comprehensive(and equally dull) volume which explains most of the internal working of the PDF file format. It is not designed to tell you about how to create or modify the PDF file – just to provide all the details. It is not an easy read, but the first 2 chapters do provide an excellent introduction to the PDF file format.
A slightly less technical introduction to the internals of a PDF file can be found at wikipedia. This also gives you a detailled inside into the structure of the file.
Once you have started to explore the internal guts of the PDF file format you can open up a few PDF files. It is not recommended that you directly edit this file (even adding a space can break it), but you can open it in a Text editor and view it. Much of the data is encrypted or compressed so a more useful tool is Acrobat 9. I explained how you can use this to examine the internals of a PDF file in my first posting.
To really do much with the PDF file you will need a third party library to manipulate the PDFs. We always recommend IText as a good starting point as its free and well-documented, with lots of examples.
So if you have reached the point where you want to start to explore the PDF file format, I hope this has provided some useful starting points and do feel free to post your experiences.
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.