Like many of my best articles this posting was inspired by a posting on our help forums. We try to answer all questions and often use a blog post to provide a more detailed reply if it is a useful general topic.
The question related to extracting form data from a PDF file. The PDF file format stores interactive form data apart from the general page data in a set of separate objects. So it is very easy to extract from the PDF file. However, it is possible to ‘flatten the forms’. I explained this in more detail in a previous post, “What is form flattening”.
Once this has been done, the forms no longer exist, they are text or shapes inside the stream of commands used to draw the PDF and you cannot interact with them. So is it possible to still extract the data???
The problem is that there is no single defined way to flatten Form data into a PDF file. The data can be stored in the PDF in several possible ways:-
1. As ordinary text (espcially for Form objects which show text data such as comboBoxes, Lists, and Text fields).
2. As a special embedded character (especially for radio buttons or checked boxes where you can define one character as a ticked box and one character as a blank box).
3. As a combination of a text character as a blank box and then the checks or ticks drawn on top for checked boxes.
4. Purely as a set of draw commands.
1 and 4 are probably too complex to enable easy extraction.
You could reconstruct the form data for 2 and 3 for radio buttons or check boxes as follows:-
Embedded character extraction
1. Identify the 2 characters used.
2. Do a page search for the characters.
This would give you the locations of all the boxes.
Blank box drawn over
1. Identify all the blank boxes. This will gibe you the locations of all the boxes.
2. See if XForm co-ordinate falls inside the box (ie its checked). This will give you the status.
So it is a non-trivial task, and it may vary from file to file but it is possible. But if you have access to the original non-flattened PDF files. It is a lot easier.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.