Comparing 2 PDF files

It is quite a frequent question on the PDF forums, asking how to compare 2 versions of a PDF file to see  what has changed. This is actually one of those cases where generally the person means something slightly different.

Usually this means ‘how can I see what has changed visually‘. PDF is a flexible file format in which you can do things in many different ways. So you could create 2 different  PDF versions of a file using Acrobat and Ghostscript (as an example). The files would (hopefully) be identical. But the files would be different sizes and the internal structure of each would be very different.

As part of developing a PDF library, we want to do an awful lot of regression testing to make sure that we do not break anything. So we need to compare a lot of files. We also like to test each change individually so we can investigate any problems.

So the way we compare PDF files is to extract the text and to convert the PDF to a png. Here is the Java code we use. We compare this against a baseline. You still need a human to verify any changes, but it does provide very quick regression tests. If the results are identical, we can be confident that the file has not changed. And doing the same with 2 PDF files allows you to quickly review and changes, especially if you get the comparison to highlight the area on the PNG which has changed.

We find that a very good way to compare PDF file results. What works for you?

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

One thought on “Comparing 2 PDF files

  1. I wrote a tool for comparing two PDFs easily, see http://github.com/vslavik/diff-pdf
    It can produce a new PDF with the changes marked up and has an interactive GUI for inspecting the changes as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>