Update: JPDF2HTM5 has been rebranded as BuildVu and JPDFForms has been rebranded as FormVu

Search PDF Files With Regular Expressions – Customizing Your Search of PDF files

With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What follows are the different search features we have set up by using regular expression symbols and how we have achieve them in our own software. It shows just how flexible regular expressions can be and should give you some ideas for your own usage.

Search for Whole Words Only

To ensure each result we find is not part of a larger word we can surround the search term with word boundary tags ( \b ).
e.g. \bsearch term\b

Multi Line Results

In order to find search terms that are split across multiple lines we have set the reglar expressions engine to look for it. We then replace every instance of a space in the search term with a list containg a space and a new line character ( [ \n] ).
This allows for a search result being found across multiple lines.
e.g search[ \n]term

Use Reg Ex

Having this flag on doesn’t actually do anything to the search term. We only do something to the search term if we don’t want to use regular expressions. We surround the search term with a \Q at the start and \E at the end. We then put a \E and the start of a space / separator and \Q at the end. This allows the search term to ignore anything that may be used as a regular expression symbol whilst still able to use the regular expressions required for other options.
e.g. \Qsearch\E[ \n]\Qterm\E

Teaser Generation

When we perform a search within the JPedal PDF viewer we also generate what we call a teaser. A teaser is a segment of text around the search result to be displayed as a rough idea of the context on the result. To generate these we perform two search, one for the result and another for the teaser which will find the search term and two words before and after the result. This can be done by appending (?:\S+\s)?\S*(?:\S+\s)?\S* before the search term and \S*(?:\s\S+)?\S*(?:\s\S+)? after.
e.g. (?:\S+\s)?\S*(?:\S+\s)?\S*search term\S*(?:\s\S+)?\S*(?:\s\S+)?

What clever regular expressions have you used in your searchs?

This article is part of our Search PDF Files With Regular Expressions series. The articles in this series covers our use of regular expressions with jPedal in order to search PDF files. By using the link above you will find the other articles in the series.

Related Posts:

The following two tabs change content below.
Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He spends his time working on the the JPedal library and our internal test suite..
KieranF

About Kieran France

Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He spends his time working on the the JPedal library and our internal test suite..

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>