PDF search is a topic I have seen some very strange discussions on recently in several places so I felt a blog post would be useful.
Firstly, you cannot generally do a PDF search directly on a PDF document. You cannot just grep the file! There are FOUR reasons for this:-
1. The text content is often stored inside binary objects so encoded and invisible.
2. Even if it is not it is often not in a searchable format, the text is not assembled in the correct order and it is not really even text. It is a binary lookup for a value that coincidentally happens to look like a text if WinAnsi encoding is used.
3. It may often contain other information such as tracking inside it which means you would like to find the actual text in your PDF search ie PDF(100)S(10)each
4. Even if you could find it, the values you could get would not be very meaningful – all you would know that is is at a certain offset from the start of the PDF file or in a certain PDF object. What you really want is page number and co-ordinates.
So you really do need to parse the PDF raw content and convert the raw data into textual data. You need a PDF library to do this (Acrobat has some nice search features and there is a library capable of PDF search on just about every language/platform). You can then either dump this text into a raw format to scan or most PDF viewers will allow you to access the page number and actual coordinates of the text.
If you are interested in using JPedal for PDF search, we have just revamped the search page with lots of examples, tutorials and hints.
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years’ worth of PDF knowledge and tips, so click here to visit our series index!