Extracting flattened form data from a PDF file

Like many of my best articles this posting was inspired by a posting on our help forums. We try to answer all questions and often use a blog post to provide a more detailed reply if it is a useful general topic.

The question related to extracting form data from a PDF file. The PDF file format stores interactive form data apart from the general page data in a set of separate objects. So it is very easy to extract from the PDF file. However, it is possible to ‘flatten the forms’. I explained this in more detail in a previous post, “What is form flattening”.

Once this has been done, the forms no longer exist, they are text or shapes inside the stream of commands used to draw the PDF and you cannot interact with them. So is it possible to still extract the data???

The problem is that there is no single defined way to flatten Form data into a PDF file. The data can be stored in the PDF in several possible ways:-

1. As ordinary text (espcially for Form objects which show text data such as comboBoxes, Lists, and Text fields).

2. As a special embedded character (especially for radio buttons or checked boxes where you can define one character as a ticked box and one character as a blank box).

3. As a combination of a text character as a blank box and then the checks or ticks drawn on top for checked boxes.

4. Purely as a set of draw commands.

1 and 4 are probably too complex to enable easy extraction.

You could reconstruct the form data for 2 and 3 for radio buttons or check boxes as follows:-

Embedded character extraction

1. Identify the 2 characters used.

2. Do a page search for the characters.

This would give you the locations of all the boxes.

Blank box drawn over

1. Identify all the blank boxes. This will gibe you the locations of all the boxes.

2. See if XForm co-ordinate falls inside the box (ie its checked). This will give you the status.

So it is a non-trivial task, and it may vary from file to file but it is possible. But if you have access to the original non-flattened PDF files. It is a lot easier.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>