Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to identify a PDF file

49 sec read

The best way to identify a PDF file is to scan the first line of the file. In theory the first line of a PDF file should be the %PDF identifier with a number. The number tells you which version of the PDF File format it is using (they are backwards compatible with early versions). Here is what the PDF Spec says

PDF spec

As with the EOF marker in the last 1024 bytes rule, this is also liberally interpreted and you may find some rubbish appended to a PDF file. This is what I found in one PDF file.

rubbish at start of PDF file

Some random data has been appended to this file. This is a problem because the PDF file contains a large number of tables which use offsets from the start of the file (assuming that to be %PDF). How to handle these sorts of cases is not formally defined and different tools will handle it in different ways – we do not currently allow for it for example. It really depends on what sort of ‘rubbish’ files the developers of a library have met.

Generally the best solution with these files is to open and resave in Adobe Acrobat. This has some very powerful tools to fix and repair PDF files. Interestingly, the PDF I have been looking at drops from a size of 318K to 278k and now works in all PDF tools.



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

4 Replies to “How to identify a PDF file”

  1. Mark,

    This behavior used to be documented… In the olden days, when Adobe was the owner of the PDF spec, there used to be an appendix called “Implementation Notes” – in the PDF 1.4 version that I am looking at, this was section H.3. In that appendix, section 3.4.1 “File Header” included the following statement: “Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.”
    Even though this was part of the spec, this implementation note was specifically about Adobe Acrobat and Reader, and no other PDF processor was obligated to implement the same behavior. However, as you’ve seen, there are PDF files out there that will have “stuff” before the PDF header, and it’s not a bad idea to try to find the PDF header in the first 1024 bytes and do as Acrobat does. You have to adjust your file offsets to make the XRef table work, but that’s just simple math 🙂

    I actually worked on some printer based software a while ago where every now and then the 1024 bytes that Adobe software checks was not sufficient, and I wrote a custom routine that would search for the PDF header in the first 100kB of a file. This was necessary because of prepended print ticket information.

    Karl Heinz

    1. Mark,
      I think this comes from the PostScript legacy that PDF still carries around with it: WIth PostScript it was possible to pre-pend a print ticket in the form of PostScript comments. A lot of printers still use either PostScript or PCL commands to switch emulation modes, so in order to switch a printer into PDF mode, it may need a couple of lines of PCL or PostScript before the actual PDF content. This is fine if you are going to a printer, because that printer will “know” and will be able to strip that extra content out before processing the file. If the user however decides to save that “PDF” to disk, you end up with what the PDF parser considers garbage before the PDF file header.
      Working with real PDFs is hard… Even if you understand the spec 100%, it still does not replace hands on experience with all those crazy PDF files that are floating around…

      When working with PDFs, the approach you should take is then when you create PDF, you need to stick to the spec, but when you read PDF files, you need to be as open and inviting to all those non-conforming PDF files as possible. I would not go as far as Adobe goes in Acrobat and Reader where they repair severely broken PDF files, but I usually provide some accommodation for non-comforming files.

      KHK

Comments are closed.