Site iconJava PDF Blog

How to identify a PDF file

The best way to identify a PDF file is to scan the first line of the file. In theory the first line of a PDF file should be the %PDF identifier with a number. The number tells you which version of the PDF File format it is using (they are backwards compatible with early versions). Here is what the PDF Spec says

As with the EOF marker in the last 1024 bytes rule, this is also liberally interpreted and you may find some rubbish appended to a PDF file. This is what I found in one PDF file.

Some random data has been appended to this file. This is a problem because the PDF file contains a large number of tables which use offsets from the start of the file (assuming that to be %PDF). How to handle these sorts of cases is not formally defined and different tools will handle it in different ways – we do not currently allow for it for example. It really depends on what sort of ‘rubbish’ files the developers of a library have met.

Generally the best solution with these files is to open and resave in Adobe Acrobat. This has some very powerful tools to fix and repair PDF files. Interestingly, the PDF I have been looking at drops from a size of 318K to 278k and now works in all PDF tools.