How to search for text in a PDF file without opening it

Can you determine if a PDF is searchable for text without opening it? Well you will need some special software. This might be useful if you want to search for keywords on a PDF, without having to manually open it.

In this tutorial we are going to be using the Java PDF library JPedal.

To search a PDF file for text without opening it:

1. First you need to download a copy of the jar and add it to your project.

2. Then you can call the various API methods

Extract words on a page

ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

Extract unstructured text

ExtractTextInRectangle extract = new ExtractTextInRectangle("inputFile.pdf");

extract.setOutputFormat(OUTPUT_FORMAT.TXT);

if (extract.openPDFFile()) {

    int pageCount = extract.getPageCount();

    for (int page = 1; page <= pageCount; page++) {

        String text = extract.getTextOnPage(page);

    }

}

Extract structured text
You need to have a tagged PDF file for this to work

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();

properties.setFileOutputMode(OutputModes.XML);

ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);

if (extract.openPDFFile()) {

    Document anyStructuredText = extract.getStructuredTextContent();

}


extract.closePDFfile();

3. You can then search through the returned text