Kieran France Kieran France is a programmer for IDRSolutions in charge of there internal test suite. In his spare time he enjoys tinkering with gadgets and code.

Search PDF Files With Regular Expressions – Customizing Your Search of PDF files

1 min read

With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What follows are the different search features we have set up by using regular expression symbols and how we have achieve them in our own software. It shows just how flexible regular expressions can be and should give you some ideas for your own usage.

Search for Whole Words Only

To ensure each result we find is not part of a larger word we can surround the search term with word boundary tags ( \b ).
e.g. \bsearch term\b

Multi Line Results

In order to find search terms that are split across multiple lines we have set the reglar expressions engine to look for it. We then replace every instance of a space in the search term with a list containg a space and a new line character ( [ \n] ).
This allows for a search result being found across multiple lines.
e.g search[ \n]term

Use Reg Ex

Having this flag on doesn’t actually do anything to the search term. We only do something to the search term if we don’t want to use regular expressions. We surround the search term with a \Q at the start and \E at the end. We then put a \E and the start of a space / separator and \Q at the end. This allows the search term to ignore anything that may be used as a regular expression symbol whilst still able to use the regular expressions required for other options.
e.g. \Qsearch\E[ \n]\Qterm\E

Teaser Generation

When we perform a search within the JPedal PDF viewer we also generate what we call a teaser. A teaser is a segment of text around the search result to be displayed as a rough idea of the context on the result. To generate these we perform two search, one for the result and another for the teaser which will find the search term and two words before and after the result. This can be done by appending (?:\S+\s)?\S*(?:\S+\s)?\S* before the search term and \S*(?:\s\S+)?\S*(?:\s\S+)? after.
e.g. (?:\S+\s)?\S*(?:\S+\s)?\S*search term\S*(?:\s\S+)?\S*(?:\s\S+)?

What clever regular expressions have you used in your searchs?

This article is part of our Search PDF Files With Regular Expressions series. The articles in this series covers our use of regular expressions with jPedal in order to search PDF files. By using the link above you will find the other articles in the series.

Did you know...

IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?

It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page

Kieran France Kieran France is a programmer for IDRSolutions in charge of there internal test suite. In his spare time he enjoys tinkering with gadgets and code.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Updates to our Text to Speech support in PDF…

Some time ago we introduced text to speech functionality to the JPedal example viewer. This used the FreeTTS library and its default voices with the option of...
Kieran France
53 sec read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.