With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What follows are the different search features we have set up by using regular expression symbols and how we have achieve them in our own software. It shows just how flexible regular expressions can be and should give you some ideas for your own usage.
Search for Whole Words Only
To ensure each result we find is not part of a larger word we can surround the search term with word boundary tags ( \b ).
e.g. \bsearch term\b
Multi Line Results
In order to find search terms that are split across multiple lines we have set the reglar expressions engine to look for it. We then replace every instance of a space in the search term with a list containg a space and a new line character ( [ \n] ).
This allows for a search result being found across multiple lines.
e.g search[ \n]term
Use Reg Ex
Having this flag on doesn’t actually do anything to the search term. We only do something to the search term if we don’t want to use regular expressions. We surround the search term with a \Q at the start and \E at the end. We then put a \E and the start of a space / separator and \Q at the end. This allows the search term to ignore anything that may be used as a regular expression symbol whilst still able to use the regular expressions required for other options.
e.g. \Qsearch\E[ \n]\Qterm\E
When we perform a search within the JPedal PDF viewer we also generate what we call a teaser. A teaser is a segment of text around the search result to be displayed as a rough idea of the context on the result. To generate these we perform two search, one for the result and another for the teaser which will find the search term and two words before and after the result. This can be done by appending (?:\S+\s)?\S*(?:\S+\s)?\S* before the search term and \S*(?:\s\S+)?\S*(?:\s\S+)? after.
e.g. (?:\S+\s)?\S*(?:\S+\s)?\S*search term\S*(?:\s\S+)?\S*(?:\s\S+)?
What clever regular expressions have you used in your searchs?
This article is part of our Search PDF Files With Regular Expressions series. The articles in this series covers our use of regular expressions with jPedal in order to search PDF files. By using the link above you will find the other articles in the series.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.