The PDF file format is very useful and well-documented, but it is also quite complicated and it does not work how most people imagine. It is structured very differently from a Word or Excel document.
Most of the time, this is not an issue – you can just use PDF files without knowing anything about them and just enjoy the benefits. There comes a time though, when you may need to start to dabble. So this article is designed to give you some starting points.
What is a PDF file?
It is worth getting to grips first with the basic idea that a PDF file is essentially a set of linked objects (so each page has a page object, which may include font objects defining the fonts, XObjects storing image data and so on). Then you can look at all the different types of objects. The PDF file contains all these objects and their locations (the references) so that they can be read as needed. It only makes sense when it is decoded by a parser and all the elements are assembled together for the final output.
Who is incharge of the PDF file format?
The PDF Association is the overall governing body (and well worth joining if you work with PDF files). Adobe is an important member, but organisation contains lots of other PDF vendors (both large and small). It also organises conferences and provides lots of resources online.
Is the PDF file format Open?
The PDF File format was originally produced by Adobe but it is now an Open Specification (ISO-32000) and anyone can join the committees defining new features and versions.
How do I learn more about the format?
The definitive guide to the PDF Format is the PDF reference manual. this is a very complete and comprehensive(and equally dull) volume which explains most of the internal working of the PDF file format. It is not designed to tell you about how to create or modify the PDF file – just to provide all the details. You will not find it an easy read, but the first 2 chapters do provide an excellent introduction to the PDF file format.
A slightly less technical introduction to the internals of a PDF file can be found at wikipedia. This also gives you a detailed inside into the structure of the file.
First steps?
Once you have started to explore the internal guts of the PDF file format you can open up a few PDF files. It is not recommended that you directly edit this file (even adding a space can break it), but you can open it in a Text editor and view it. Much of the data is encrypted or compressed so a more useful tool is RUPS. I explained how you can use this to examine the internals of a PDF file in another article.
How do I work with PDF files directly?
To really do much with the PDF file you will need a third party library to manipulate the PDF files. We always recommend using libraries and tools (there are lots of commercial and Open Source ones) to work with PDF files. If you want to see how complex it is to edit PDF files manually, have a look at our series on How to make your own PDF file
Do you have any other recommended articles?
You may also find our series on Understanding the PDF file Format useful, especially the related post 10 things new PDF Developers need to know.
So if you have reached the point where you want to start to explore the PDF file format, I hope this has provided some useful starting points and please do post your own experiences or recommendations.
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
Display PDF documents in a Web app |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |