As I have said many times before, one of the ‘issues’ with the PDF spec is that some files can have a huge number of errors and still open. Adobe Acrobat has a large number of built-in fixing tools, so often the best solution is to open and recieve the PDF file. I have been looking at a good example today….
In theory the first line of a PDF file should be the %PDF identifier. Here is what the PDF Spec says
However, this is what I found in this PDF file.
Some random data has been appended to the file. This is a problem because the PDF file contains a large number of tables which use offsets from the start of the file (assuming that to be %PDF). How to handle these sorts of cases is not formally defined and different tools will handle it in different ways – we do not currently allow for it for example. It really depends on what sort of ‘rubbish’ files the developers of a library have met.
Generally the best solution with these files is to open and resave in Adobe Acrobat. This has some very powerful tools to fix and repair PDF files. Interestingly, the PDF I have been looking at drops from a size of 318K to 278k and now works in all PDF tools.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.