Understanding the PDF file Format – PDF Xref tables explained

Understanding PDF Xref Tables:

Xref tables are part of the original PDF file specification and one of the features which gives the PDF file format its flexibility. If you open a PDF file in a text editor and search for the word ‘xref’ you will find something like this

xref
0 271
0000000000 65535 f
0000000015 00000 n
0000000102 00000 n

This is the xref table. A PDF consists of lots of COS objects and this tells you where they are located in the file. This is actually very useful. A PDF Reader just has to read these values and then it loads the objects only when they are needed. It does not need to parse or load the whole file.

The first line tells you about the table entries. In this case the xref table has 271 entries and the object numbers start at zero. The following lines give the object offset from the start of the file, then the generation number (you can have several revisions of an object) and a flag to say whether the object is in use (n) or not (f). If the PDF file has been edited and objects changed, the changed version is often tagged onto the PDF with an updated xref table showing the new location. So it is possible for a PDF file to contain several xref tables and the later values are used.

If you look at byte offset 15 in the PDF file I took the xref table from you will find the start of object 1

1 0 obj<</Type/Font ...

If you are looking at PDF file created with version 1.6, you may not find an xref entry because they introduced an alternative way to store the objects locations – but that is for another article.

Xref tables also explain why if you alter a byte or add a byte to a PDF file it will become corrupted – all the pointers are now wrong.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

3 thoughts on “Understanding the PDF file Format – PDF Xref tables explained

  1. nospampls

    Does the xref list the objects in order?

  2. Joel

    As part of our process where I work, we split incoming PDFs into individual pages before archiving them. We came across a few that broke the library we used to split them. Looking closer at the xref table I noticed these weird offsets:


    >grep -Eabo "[0-9]{10} [0-9]{5} (n|f)" "brokenpdf.pdf"
    ...
    337232:0000225190 00000 n
    337252:0032768500 00000 n
    337272:0000225389 00000 n
    337292:0000225565 00000 n
    337312:0000225763 00000 n
    337332:0000235424 00000 n
    337352:0000235612 00000 n
    337372:0000235815 00000 n
    337392:0000245138 00000 n
    337412:0000245318 00000 n
    337432:0032768500 00000 n
    337452:0000245520 00000 n
    ...

    This offset in hex looks like 0x01f401f4, which is suspicious, plus the fact that the PDF itself was only 200K. Then I searched for all the objects in the document:

    >grep -Eabo "[0-9]+ [0-9]+ obj" "brokenpdf.pdf"
    ...
    225190:28 0 obj
    225389:30 0 obj
    225565:31 0 obj
    225763:32 0 obj
    235424:33 0 obj
    235612:34 0 obj
    235815:35 0 obj
    245138:36 0 obj
    245318:37 0 obj
    245520:39 0 obj
    ...

    Objects 29 and 38 were missing, and where their offsets would have been recorded in the xref table, there was a rediculously large number. Changing the “n” to “f” for these xref entries seemed to fix the issue well enough to split the document.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>