Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

3 steps to finding a range of numeric values on a PDF page

54 sec read

Over the holiday I read an interesting question asking how to find numbers within a specific range on a PDF page. This set me thinking…

Our perception is very much effected by the tools we generally use. If we spend our lives in spreadsheets, we think of pages as lots of cells which can have a type. If we use XML, we expect everything to be nested and tagged.

The PDF format is very much an output format so it looks great but there is often little or no metadata. There are no numbers, strings or other object types data. It is all text on the page. This does not mean we cannot search for specific types, but we have to alter our thinking. So here is how I would find and values within a number range on a PDF page.

1. Convert the PDF page text data into a wordlist. This will give you all the words and their position on the page. If you want to use JPedal, see PDF to text as a word list.

2. Ignore all values which are clearly not numbers (numbers can only contain characters 0-9, comma, decimal and plus or minus).

3. This will give us a set of possible values and locations. We can then convert them to numbers with Integer.parseint(str) and see if they match our range.

So it is perfectly possible to find a numeric range of values on a PDF page although not as easy as in Excel. Or do you have a better solution?

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Size does matter

Recently I have been looking into an issue in our PDF text extraction. A case was found where text extraction would appear to freeze....
Kieran France
1 min read

Improving the way settings get passed in ExtractPagesAsHTML

Over the last year, we have grown our PDF to HTML5 converter to be increasingly configurable, capable of suiting a huge range of requirements...
Leon Atherton
1 min read

PDF puzzlers – when is a return character significant…

The PDF file format is a very ‘flexible’ file format. You can put returns into the middle of a most objects. There is a...
Mark Stephens
41 sec read

Leave a Reply

Your email address will not be published. Required fields are marked *