Can you determine if a PDF is searchable for text without opening it? Well you will need some special software. This might be useful if you want to search for keywords on a PDF, without having to manually open it.
In this tutorial we are going to be using the Java PDF library JPedal.
To search a PDF file for text without opening it:
1. First you need to download a copy of the jar and add it to your project.
2. Then you can call the various API methods
Extract words on a page
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);
Extract unstructured text
ExtractTextInRectangle extract = new ExtractTextInRectangle("inputFile.pdf");
extract.setOutputFormat(OUTPUT_FORMAT.TXT);
if (extract.openPDFFile()) {
int pageCount = extract.getPageCount();
for (int page = 1; page <= pageCount; page++) {
String text = extract.getTextOnPage(page);
}
}
Extract structured text
You need to have a tagged PDF file for this to work
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();
3. You can then search through the returned text
int index = text.indexOf("Java");
This tutorial showed you how you can search a PDF for text without opening it. Learn more on our support site.