Kieran France Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He spends his time working on the the JPedal library and our internal test suite..

Search PDF Files With Regular Expressions – Customizing Your Search of PDF files

1 min read

With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What follows are the different search features we have set up by using regular expression symbols and how we have achieve them in our own software. It shows just how flexible regular expressions can be and should give you some ideas for your own usage.

Search for Whole Words Only

To ensure each result we find is not part of a larger word we can surround the search term with word boundary tags ( \b ).
e.g. \bsearch term\b

Multi Line Results

In order to find search terms that are split across multiple lines we have set the reglar expressions engine to look for it. We then replace every instance of a space in the search term with a list containg a space and a new line character ( [ \n] ).
This allows for a search result being found across multiple lines.
e.g search[ \n]term

Use Reg Ex

Having this flag on doesn’t actually do anything to the search term. We only do something to the search term if we don’t want to use regular expressions. We surround the search term with a \Q at the start and \E at the end. We then put a \E and the start of a space / separator and \Q at the end. This allows the search term to ignore anything that may be used as a regular expression symbol whilst still able to use the regular expressions required for other options.
e.g. \Qsearch\E[ \n]\Qterm\E

Teaser Generation

When we perform a search within the JPedal PDF viewer we also generate what we call a teaser. A teaser is a segment of text around the search result to be displayed as a rough idea of the context on the result. To generate these we perform two search, one for the result and another for the teaser which will find the search term and two words before and after the result. This can be done by appending (?:\S+\s)?\S*(?:\S+\s)?\S* before the search term and \S*(?:\s\S+)?\S*(?:\s\S+)? after.
e.g. (?:\S+\s)?\S*(?:\S+\s)?\S*search term\S*(?:\s\S+)?\S*(?:\s\S+)?

What clever regular expressions have you used in your searchs?

This article is part of our Search PDF Files With Regular Expressions series. The articles in this series covers our use of regular expressions with jPedal in order to search PDF files. By using the link above you will find the other articles in the series.

Kieran France Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He spends his time working on the the JPedal library and our internal test suite..

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Updates to our Text to Speech support in PDF…

Some time ago we introduced text to speech functionality to the JPedal example viewer. This used the FreeTTS library and its default voices with the option of...
Kieran France
1 min read

Three ways to convert PDF to HTML5: Text and…

There are several ways that you can deal with text and fonts in PDF files when converting to HTML5. Here are there are the...
Leon Atherton
2 min read

Leave a Reply

Your email address will not be published. Required fields are marked *