Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

PDF search

1 min read

PDF search is a topic I have seen some very strange discussions on recently in several places so I felt a blog post would be useful.

Firstly, you cannot generally do PDF search directly on a PDF document. You cannot just grep the file! There are FOUR reasons for this:-

1. The text content is often stored inside binary objects so encoded and invisible.

2. Even if it is not it is often not in a searchable format, the text is not assembled in the correct order and it is not really even text. It is a binary lookup for a value which co-incidentally happens to look like text in if WinAnsi encoding is used.

3. It may often contain other information such as tracking inside it which means you would like find the actual text in your PDF search ie PDF(100)S(10)earch 

4. Even if you could find it, the values you could get would not be very meaningful – all you would know that is is at a certain offset from the start of the PDF file or in a certain PDF object. What you really want is  page number and co-ordinates.

So you really do need to parse the PDF raw content and convert the raw data into textual data. You need a PDF library to do this (Acrobat has some nice search features and there is a library capable of PDF search on just about every language/platform). You can then either dump this text into a raw format to scan or most PDF viewers will allow you to access the page number and actual co-ordinates of the text.

If you are interested in using JPedal for PDF search, we have just revamped the search page with lots of examples, tutorials and hints.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

 

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Updates to our Text to Speech support in PDF…

Some time ago we introduced text to speech functionality to the JPedal example viewer. This used the FreeTTS library and its default voices with the option of...
Kieran France
1 min read

Three ways to convert PDF to HTML5: Text and…

There are several ways that you can deal with text and fonts in PDF files when converting to HTML5. Here are there are the...
Leon Atherton
2 min read

Leave a Reply

Your email address will not be published. Required fields are marked *