Identifying a PDF file from its first line

As I have said many times before, one of the ‘issues’ with the PDF spec is that some files can have a huge number of errors and still open. Adobe Acrobat has a large number of built-in fixing tools, so often the best solution is to open and recieve the PDF file. I have been looking at a good example today….

In theory the first line of a PDF file should be the %PDF identifier. Here is what the PDF Spec says

PDF spec

However, this is what I found in this PDF file.

rubbish at start of PDF file

Some random data has been appended to the file. This is a problem because the PDF file contains a large number of tables which use offsets from the start of the file (assuming that to be %PDF). How to handle these sorts of cases is not formally defined and different tools will handle it in different ways – we do not currently allow for it for example. It really depends on what sort of ‘rubbish’ files the developers of a library have met.

Generally the best solution with these files is to open and resave in Adobe Acrobat. This has some very powerful tools to fix and repair PDF files. Interestingly, the PDF I have been looking at drops from a size of 318K to 278k and now works in all PDF tools.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

4 thoughts on “Identifying a PDF file from its first line

  1. Mark,

    This behavior used to be documented… In the olden days, when Adobe was the owner of the PDF spec, there used to be an appendix called “Implementation Notes” – in the PDF 1.4 version that I am looking at, this was section H.3. In that appendix, section 3.4.1 “File Header” included the following statement: “Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.”
    Even though this was part of the spec, this implementation note was specifically about Adobe Acrobat and Reader, and no other PDF processor was obligated to implement the same behavior. However, as you’ve seen, there are PDF files out there that will have “stuff” before the PDF header, and it’s not a bad idea to try to find the PDF header in the first 1024 bytes and do as Acrobat does. You have to adjust your file offsets to make the XRef table work, but that’s just simple math 🙂

    I actually worked on some printer based software a while ago where every now and then the 1024 bytes that Adobe software checks was not sufficient, and I wrote a custom routine that would search for the PDF header in the first 100kB of a file. This was necessary because of prepended print ticket information.

    Karl Heinz

  2. Thanks for adding your detailled explanation. You would not get that kind of tolerance in any other filetype.

    • Mark,
      I think this comes from the PostScript legacy that PDF still carries around with it: WIth PostScript it was possible to pre-pend a print ticket in the form of PostScript comments. A lot of printers still use either PostScript or PCL commands to switch emulation modes, so in order to switch a printer into PDF mode, it may need a couple of lines of PCL or PostScript before the actual PDF content. This is fine if you are going to a printer, because that printer will “know” and will be able to strip that extra content out before processing the file. If the user however decides to save that “PDF” to disk, you end up with what the PDF parser considers garbage before the PDF file header.
      Working with real PDFs is hard… Even if you understand the spec 100%, it still does not replace hands on experience with all those crazy PDF files that are floating around…

      When working with PDFs, the approach you should take is then when you create PDF, you need to stick to the spec, but when you read PDF files, you need to be as open and inviting to all those non-conforming PDF files as possible. I would not go as far as Adobe goes in Acrobat and Reader where they repair severely broken PDF files, but I usually provide some accommodation for non-comforming files.

      KHK

  3. That still makes sense but we could not think of another filetype with that level of ‘flexibility’.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>