Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Extracting flattened form data from a PDF file

1 min read

Like many of my best articles this posting was inspired by a posting on our help forums. We try to answer all questions and often use a blog post to provide a more detailed reply if it is a useful general topic.

The question related to extracting form data from a PDF file. The PDF file format stores interactive form data apart from the general page data in a set of separate objects. So it is very easy to extract from the PDF file. However, it is possible to ‘flatten the forms’. I explained this in more detail in a previous post, “What is form flattening”.

Once this has been done, the forms no longer exist, they are text or shapes inside the stream of commands used to draw the PDF and you cannot interact with them. So is it possible to still extract the data???

The problem is that there is no single defined way to flatten Form data into a PDF file. The data can be stored in the PDF in several possible ways:-

1. As ordinary text (espcially for Form objects which show text data such as comboBoxes, Lists, and Text fields).

2. As a special embedded character (especially for radio buttons or checked boxes where you can define one character as a ticked box and one character as a blank box).

3. As a combination of a text character as a blank box and then the checks or ticks drawn on top for checked boxes.

4. Purely as a set of draw commands.

1 and 4 are probably too complex to enable easy extraction.

You could reconstruct the form data for 2 and 3 for radio buttons or check boxes as follows:-

Embedded character extraction

1. Identify the 2 characters used.

2. Do a page search for the characters.

This would give you the locations of all the boxes.

Blank box drawn over

1. Identify all the blank boxes. This will gibe you the locations of all the boxes.

2. See if XForm co-ordinate falls inside the box (ie its checked). This will give you the status.

So it is a non-trivial task, and it may vary from file to file but it is possible. But if you have access to the original non-flattened PDF files. It is a lot easier.

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

What Chrome 45 dropping NPAPI Plug-in support means

With the recent news that Google has now killed off NPAPI plugins in Chrome 45, it has left many wondering exactly what NPAPI plugins...
Leon Atherton
3 min read

Improving PDF text search in JPedal

I have been working on PDF search and felt it was time to share some enhancements and changes with you… Just over a month...
Kieran France
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *