The best way to identify a PDF file is to scan the first line of the file. In theory the first line of a PDF file should be the %PDF identifier with a number. The number tells you which version of the PDF File format it is using (they are backwards compatible with early versions). Here is what the PDF Spec says
As with the EOF marker in the last 1024 bytes rule, this is also liberally interpreted and you may find some rubbish appended to a PDF file. This is what I found in one PDF file.
Some random data has been appended to this file. This is a problem because the PDF file contains a large number of tables which use offsets from the start of the file (assuming that to be %PDF). How to handle these sorts of cases is not formally defined and different tools will handle it in different ways – we do not currently allow for it for example. It really depends on what sort of ‘rubbish’ files the developers of a library have met.
Generally the best solution with these files is to open and resave in Adobe Acrobat. This has some very powerful tools to fix and repair PDF files. Interestingly, the PDF I have been looking at drops from a size of 318K to 278k and now works in all PDF tools.
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |
Mark,
This behavior used to be documented… In the olden days, when Adobe was the owner of the PDF spec, there used to be an appendix called “Implementation Notes” – in the PDF 1.4 version that I am looking at, this was section H.3. In that appendix, section 3.4.1 “File Header” included the following statement: “Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.”
Even though this was part of the spec, this implementation note was specifically about Adobe Acrobat and Reader, and no other PDF processor was obligated to implement the same behavior. However, as you’ve seen, there are PDF files out there that will have “stuff” before the PDF header, and it’s not a bad idea to try to find the PDF header in the first 1024 bytes and do as Acrobat does. You have to adjust your file offsets to make the XRef table work, but that’s just simple math 🙂
I actually worked on some printer based software a while ago where every now and then the 1024 bytes that Adobe software checks was not sufficient, and I wrote a custom routine that would search for the PDF header in the first 100kB of a file. This was necessary because of prepended print ticket information.
Karl Heinz
Thanks for adding your detailled explanation. You would not get that kind of tolerance in any other filetype.
Mark,
I think this comes from the PostScript legacy that PDF still carries around with it: WIth PostScript it was possible to pre-pend a print ticket in the form of PostScript comments. A lot of printers still use either PostScript or PCL commands to switch emulation modes, so in order to switch a printer into PDF mode, it may need a couple of lines of PCL or PostScript before the actual PDF content. This is fine if you are going to a printer, because that printer will “know” and will be able to strip that extra content out before processing the file. If the user however decides to save that “PDF” to disk, you end up with what the PDF parser considers garbage before the PDF file header.
Working with real PDFs is hard… Even if you understand the spec 100%, it still does not replace hands on experience with all those crazy PDF files that are floating around…
When working with PDFs, the approach you should take is then when you create PDF, you need to stick to the spec, but when you read PDF files, you need to be as open and inviting to all those non-conforming PDF files as possible. I would not go as far as Adobe goes in Acrobat and Reader where they repair severely broken PDF files, but I usually provide some accommodation for non-comforming files.
KHK
That still makes sense but we could not think of another filetype with that level of ‘flexibility’.