Like many of my best articles this posting was inspired by a posting on our help forums. We try to answer all questions and often use a blog post to provide a more detailed reply if it is a useful general topic.
The question related to extracting form data from a PDF file. The PDF file format stores interactive form data apart from the general page data in a set of separate objects. So it is very easy to extract from the PDF file. However, it is possible to ‘flatten the forms’. I explained this in more detail in a previous post, “What is form flattening”.
Once this has been done, the forms no longer exist, they are text or shapes inside the stream of commands used to draw the PDF and you cannot interact with them. So is it possible to still extract the data???
The problem is that there is no single defined way to flatten Form data into a PDF file. The data can be stored in the PDF in several possible ways:-
1. As ordinary text (espcially for Form objects which show text data such as comboBoxes, Lists, and Text fields).
2. As a special embedded character (especially for radio buttons or checked boxes where you can define one character as a ticked box and one character as a blank box).
3. As a combination of a text character as a blank box and then the checks or ticks drawn on top for checked boxes.
4. Purely as a set of draw commands.
1 and 4 are probably too complex to enable easy extraction.
You could reconstruct the form data for 2 and 3 for radio buttons or check boxes as follows:-
Embedded character extraction
1. Identify the 2 characters used.
2. Do a page search for the characters.
This would give you the locations of all the boxes.
Blank box drawn over
1. Identify all the blank boxes. This will gibe you the locations of all the boxes.
2. See if XForm co-ordinate falls inside the box (ie its checked). This will give you the status.
So it is a non-trivial task, and it may vary from file to file but it is possible. But if you have access to the original non-flattened PDF files. It is a lot easier.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- My experience of a Turkish bath (visiting Istanbul for DevFest) - November 24, 2017
- My 5 key takeaways from JavaOne 2017 - October 6, 2017
- My notes and pictures from thursday JavaOne 2017 - October 5, 2017
- My notes and pictures from Wednesday JavaOne 2017 - October 5, 2017
- My notes and pictures from Tuesday JavaOne 2017 - October 4, 2017