PDF search

PDF search is a topic I have seen some very strange discussions on recently in several places so I felt a blog post would be useful.

Firstly, you cannot generally do PDF search directly on a PDF document. You cannot just grep the file! There are FOUR reasons for this:-

1. The text content is often stored inside binary objects so encoded and invisible.

2. Even if it is not it is often not in a searchable format, the text is not assembled in the correct order and it is not really even text. It is a binary lookup for a value which co-incidentally happens to look like text in if WinAnsi encoding is used.

3. It may often contain other information such as tracking inside it which means you would like find the actual text in your PDF search ie PDF(100)S(10)earch 

4. Even if you could find it, the values you could get would not be very meaningful – all you would know that is is at a certain offset from the start of the PDF file or in a certain PDF object. What you really want is  page number and co-ordinates.

So you really do need to parse the PDF raw content and convert the raw data into textual data. You need a PDF library to do this (Acrobat has some nice search features and there is a library capable of PDF search on just about every language/platform). You can then either dump this text into a raw format to scan or most PDF viewers will allow you to access the page number and actual co-ordinates of the text.

If you are interested in using JPedal for PDF search, we have just revamped the search page with lots of examples, tutorials and hints.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

 

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>