What are PDF Xref tables?

Q: Can an Xref table be repaired if a PDF is corrupted?

Yes. Since PDF objects begin with standard markers (like obj), recovery tools can scan the file's raw bytes to find these markers, recalculate their locations, and generate a new Xref table from scratch to restore the file.

Table of Contents show

TL;DR:

A PDF Xref table is an internal “map” of byte offsets that tells a reader exactly where to find specific objects (like fonts or images) within a file. It allows for fast, random access to data without loading the whole document, but any manual change to the file’s bytes will break these pointers and corrupt the PDF. They are a fundamental component defined in the ISO 32000-1 (PDF 1.7) Specification.

What are PDF Xref tables?

Xref tables are part of the original PDF file specification and one of the features which gives the PDF file format its flexibility. It is found in the trailer. A PDF file can contain multiple linked xref tables which all need to be read.

The pointer to the first xref table is usually found at the end of the PDF files in the last 1024 bytes. Linearized PDF files includes some data at the start of the file so you can display the file before it has been fully read.

Identifying the Xref Table in a Text Editor

If you open a PDF file in a text editor and search for the word ‘xref’ you will find something like this

xref
0 271
0000000000 65535 f
0000000015 00000 n
0000000102 00000 n

This is the xref table. A PDF consists of lots of COS objects and this tells you where they are located in the file. This is actually very useful. A PDF Reader just has to read these values and then it loads the objects only when they are needed. It does not need to parse or load the whole file.

Understanding the Xref Entry Structure

The first line tells you about the table entries. In this case the xref table has 271 entries and the object numbers start at zero. The following lines give:

The object offset from the start of the file.
Then the generation number (you can have several revisions of an object).
A flag to say whether the object is in use (n) or not (f).

If the PDF file has been edited and objects changed, the changed version is often tagged onto the PDF with an updated xref table showing the new location. So it is possible for a PDF file to contain several xref tables and the later values are used.

Practical Example of Byte Offsets

If you look at byte offset 15 in the PDF file I took the xref table from you will find the start of object 1

1 0 obj<</Type/Font ...

PDF Version 1.5 and Compressed Streams

If you are looking at PDF file created with version 1.5 and above, you may not find an xref entry because they introduced an alternative way to store the objects inside Compressed streams.

The Importance of Precise File Structure

Xref tables also explain why if you alter a byte or add a byte to a PDF file it will become corrupted – all the pointers are now wrong.

Tools for Exploring PDF Xref Tables

Is there an easy way to explore PDF Xref tables? Our JPedal Java PDF Viewer has a inspection mode which you can use to view Xref tables and see what data they point to. JPedal is the best Java PDF library for developers.

It works in trial version as well (so you do not need to buy our software to use it). Find out more…

FAQs

Q: How do “Incremental Updates” relate to Xref tables?

A: Instead of rewriting the whole file, PDF software appends new data to the end. A new Xref table is added that “shadows” the original, pointing the reader to the most recent version of edited objects without losing the document’s history.

Q: Can an Xref table be repaired if a PDF is corrupted?

A: Yes. Since PDF objects begin with standard markers (like obj), recovery tools can scan the file’s raw bytes to find these markers, recalculate their locations, and generate a new Xref table from scratch to restore the file.

Q: What is the difference between an Xref Table and an Xref Stream?

A: A Table is a human-readable text list (PDF 1.4 and older), whereas a Stream is a compressed binary object (PDF 1.5+). Streams are more efficient because they reduce file size and allow multiple objects to be grouped together.

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

10 Replies to “What are PDF Xref tables?”

nospampls says:
April 2, 2017 at 9:36 pm
Does the xref list the objects in order?
1. Mark Stephens says:
  April 3, 2017 at 8:30 am
  It is in object order from the first number listed
Joel says:
July 25, 2017 at 10:46 pm
As part of our process where I work, we split incoming PDFs into individual pages before archiving them. We came across a few that broke the library we used to split them. Looking closer at the xref table I noticed these weird offsets:
>grep -Eabo "[0-9]{10} [0-9]{5} (n|f)" "brokenpdf.pdf" ... 337232:0000225190 00000 n 337252:0032768500 00000 n 337272:0000225389 00000 n 337292:0000225565 00000 n 337312:0000225763 00000 n 337332:0000235424 00000 n 337352:0000235612 00000 n 337372:0000235815 00000 n 337392:0000245138 00000 n 337412:0000245318 00000 n 337432:0032768500 00000 n 337452:0000245520 00000 n ...
This offset in hex looks like 0x01f401f4, which is suspicious, plus the fact that the PDF itself was only 200K. Then I searched for all the objects in the document:
>grep -Eabo "[0-9]+ [0-9]+ obj" "brokenpdf.pdf" ... 225190:28 0 obj 225389:30 0 obj 225565:31 0 obj 225763:32 0 obj 235424:33 0 obj 235612:34 0 obj 235815:35 0 obj 245138:36 0 obj 245318:37 0 obj 245520:39 0 obj ...
Objects 29 and 38 were missing, and where their offsets would have been recorded in the xref table, there was a rediculously large number. Changing the “n” to “f” for these xref entries seemed to fix the issue well enough to split the document.
MICHAEL CHOURDAKIS says:
October 3, 2018 at 8:06 pm
If I understand correctly on how to update an existing PDF to add a signature:
1. Get the root from trailer
2. Get the /Pages from root object
3. Get the /Kids from the pages object
4. Recreate the first kid reference, including an /Annots key to point to your objects.
Is that the proper procedure?
Mark Stephens says:
October 19, 2018 at 5:21 am
That would work in theory but you would need to write a new trailer to the end of the file with your new Annotation and with the new Root and Annots
Rui says:
July 26, 2022 at 11:06 pm
Hi mark,
I’m looking at a file where “xref” starts with 1 instead of 0. It’s a file that has been checked and has no other “xref”. I would like to ask if I am facing a PDF that has been modified or forged?
Just to complement with more information, using exiftool, I got …
[…]
MIME Type : application/pdf
PDF Version : 1.7
Linearized : No
Warning : Root object (11 0 obj) not found at offset 3474428
— press ENTER —
Thanks in advance.
1. Mark Stephens says:
  July 27, 2022 at 9:04 am
  Or just broken. Does it open in Acrobat? What tool was it created with?
Rui says:
July 27, 2022 at 1:30 pm
Yes, I can open the file without any problems. About the tool used in the creation of the file, all the pages of the file were generated by a scanner (without OCR), each one with a unique image.
Another strange point was the following: starting the object count with the number 1 (according to xref), the Root obj address is 0003474428, as shown in the message “Warning: Root object (11 0 obj) not found at offset 3474428 “. However, the obj found at that address is 10. Finally, /Info is out of information. With all this, is it possible to say that there was some manipulation of the file?
xref 1 13
3474846:0000000000 65535 f
3474866:0000000009 00000 n
3474886:0001414967 00000 n
3474906:0001415064 00000 n
3474926:0003474616 00000 n
3474946:0001415251 00000 n
3474966:0002852031 00000 n
3474986:0002852128 00000 n
3475006:0002852315 00000 n
3475026:0003474331 00000 n
3475046:0003474428 00000 n
3475066:0003474687 00000 n
3475086:0003474739 00000 n
9:1 0 obj
1414967:2 0 obj
1415064:3 0 obj
1415251:5 0 obj
2852031:6 0 obj
2852128:7 0 obj
2852315:8 0 obj
3474331:9 0 obj
3474428:10 0 obj
3474616:4 0 obj
3474687:11 0 obj
3474739:12 0 obj
12 0 obj
<>
endobj
1. Mark Stephens says:
  July 27, 2022 at 2:01 pm
  It might have been tampered with or it might just have been created by a poor tool (see what the Creator/Producer says). Acrobat is very tolerant of errors in PDF files and will try to open anything. If you save from Acrobat, it should hopefully save a repaired version of the file.
Rui says:
July 27, 2022 at 2:24 pm
So I can’t say that every original file (not altered, scanned or converted) should start xref with zero, right?
Depending on the tool, or the scanner used, there may be cases of original files (not updated) that the xref is started with other values?

Comments are closed.

What are PDF Xref tables?

TL;DR:

What are PDF Xref tables?

Identifying the Xref Table in a Text Editor

Understanding the Xref Entry Structure

Practical Example of Byte Offsets

PDF Version 1.5 and Compressed Streams

The Importance of Precise File Structure

Tools for Exploring PDF Xref Tables

FAQs

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

What is PDF/A?

Apache Commons Imaging Alternative for Java: JDeli

TwelveMonkeys Alternative for Java Image Processing

10 Replies to “What are PDF Xref tables?”

What are PDF Xref tables?

TL;DR:

What are PDF Xref tables?

Identifying the Xref Table in a Text Editor

Understanding the Xref Entry Structure

Practical Example of Byte Offsets

PDF Version 1.5 and Compressed Streams

The Importance of Precise File Structure

Tools for Exploring PDF Xref Tables

FAQs

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

What is PDF/A?

Apache Commons Imaging Alternative for Java: JDeli

TwelveMonkeys Alternative for Java Image Processing

10 Replies to “What are PDF Xref tables?”