10 Replies to “What are PDF Xref tables?”

  1. As part of our process where I work, we split incoming PDFs into individual pages before archiving them. We came across a few that broke the library we used to split them. Looking closer at the xref table I noticed these weird offsets:


    >grep -Eabo "[0-9]{10} [0-9]{5} (n|f)" "brokenpdf.pdf"
    ...
    337232:0000225190 00000 n
    337252:0032768500 00000 n
    337272:0000225389 00000 n
    337292:0000225565 00000 n
    337312:0000225763 00000 n
    337332:0000235424 00000 n
    337352:0000235612 00000 n
    337372:0000235815 00000 n
    337392:0000245138 00000 n
    337412:0000245318 00000 n
    337432:0032768500 00000 n
    337452:0000245520 00000 n
    ...

    This offset in hex looks like 0x01f401f4, which is suspicious, plus the fact that the PDF itself was only 200K. Then I searched for all the objects in the document:

    >grep -Eabo "[0-9]+ [0-9]+ obj" "brokenpdf.pdf"
    ...
    225190:28 0 obj
    225389:30 0 obj
    225565:31 0 obj
    225763:32 0 obj
    235424:33 0 obj
    235612:34 0 obj
    235815:35 0 obj
    245138:36 0 obj
    245318:37 0 obj
    245520:39 0 obj
    ...

    Objects 29 and 38 were missing, and where their offsets would have been recorded in the xref table, there was a rediculously large number. Changing the “n” to “f” for these xref entries seemed to fix the issue well enough to split the document.

  2. If I understand correctly on how to update an existing PDF to add a signature:

    1. Get the root from trailer
    2. Get the /Pages from root object
    3. Get the /Kids from the pages object
    4. Recreate the first kid reference, including an /Annots key to point to your objects.

    Is that the proper procedure?

  3. Hi mark,

    I’m looking at a file where “xref” starts with 1 instead of 0. It’s a file that has been checked and has no other “xref”. I would like to ask if I am facing a PDF that has been modified or forged?

    Just to complement with more information, using exiftool, I got …

    […]
    MIME Type : application/pdf
    PDF Version : 1.7
    Linearized : No
    Warning : Root object (11 0 obj) not found at offset 3474428
    — press ENTER —

    Thanks in advance.

  4. Yes, I can open the file without any problems. About the tool used in the creation of the file, all the pages of the file were generated by a scanner (without OCR), each one with a unique image.

    Another strange point was the following: starting the object count with the number 1 (according to xref), the Root obj address is 0003474428, as shown in the message “Warning: Root object (11 0 obj) not found at offset 3474428 “. However, the obj found at that address is 10. Finally, /Info is out of information. With all this, is it possible to say that there was some manipulation of the file?

    xref 1 13
    3474846:0000000000 65535 f
    3474866:0000000009 00000 n
    3474886:0001414967 00000 n
    3474906:0001415064 00000 n
    3474926:0003474616 00000 n
    3474946:0001415251 00000 n
    3474966:0002852031 00000 n
    3474986:0002852128 00000 n
    3475006:0002852315 00000 n
    3475026:0003474331 00000 n
    3475046:0003474428 00000 n
    3475066:0003474687 00000 n
    3475086:0003474739 00000 n

    9:1 0 obj
    1414967:2 0 obj
    1415064:3 0 obj
    1415251:5 0 obj
    2852031:6 0 obj
    2852128:7 0 obj
    2852315:8 0 obj
    3474331:9 0 obj
    3474428:10 0 obj
    3474616:4 0 obj
    3474687:11 0 obj
    3474739:12 0 obj

    12 0 obj
    <>
    endobj

    1. It might have been tampered with or it might just have been created by a poor tool (see what the Creator/Producer says). Acrobat is very tolerant of errors in PDF files and will try to open anything. If you save from Acrobat, it should hopefully save a repaired version of the file.

  5. So I can’t say that every original file (not altered, scanned or converted) should start xref with zero, right?

    Depending on the tool, or the scanner used, there may be cases of original files (not updated) that the xref is started with other values?

Comments are closed.