Apache Tika PDF support in JPedal

Table of Contents show

JPedal now contains an Apache Tika Parser which can parse and extract structured and unstructured text from PDF files.

How to use an Apache Tika PDF Parser

The integration allows you to use Tika’s parse() method with JPedal, giving you streamlined access to robust PDF text extraction and all the additional metadata and error-handling capabilities from both libraries.

How JPedal Implements Apache Tika PDF Parsing

First, you must pass a TikaInputStream containing the path to your PDF file.

Second, you must pass a ContentHandler. It is advisable to set the character limit to -1 otherwise, the whole PDF file may not be parsed.

Next, you pass a Metadata. This can be a blank instance or it can contain the password to the PDF file if it is encrypted.

Finally, a ParseContext is not needed so the last argument can be null.

The extracted text is now stored in the ContentHandler!

Key Parameters

InputStream: Use TikaInputStream for path input. The stream is consumed (not closed automatically), so you should manually close it after parsing.
ContentHandler: Stores extracted content. It’s recommended to set the character limit to -1 to ensure the entire PDF is parsed.
Metadata: Can be a blank instance or contain the PDF password if needed.
ParseContext: Optional; not required for JPedal usage.

Sample Usage

TikaInputStream tikaStream = TikaInputStream.get(Paths.get("input.pdf"));
ContentHandler handler = new BodyContentHandler(-1); // No character limit
Metadata metadata = new Metadata();
metadata.set(PDFParser.PASSWORD, "optionalPasswordIfEncrypted");
PDFParser parser = new PDFParser();
parser.parse(tikaStream, handler, metadata, null);
String extractedText = handler.toString();

Learn More

You can find more information about our Apache Tika Parser here.

By leveraging JPedal’s Apache Tika integration, Java developers gain a fully-supported, commercial-grade solution for PDF parsing with consistent API design, easy handling of unstructured text, and built-in extensibility for more advanced PDF tasks.

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

Apache Tika PDF support in JPedal

How to use an Apache Tika PDF Parser

How JPedal Implements Apache Tika PDF Parsing

Key Parameters

Sample Usage

Learn More

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

JDeli vs Java ImageIO: Benchmarks and Migration

How PDFs work: A Practical Guide to Creating Your…

How to Read PDF files in Java (Step-by-Step Guide)

Apache Tika PDF support in JPedal

How to use an Apache Tika PDF Parser

How JPedal Implements Apache Tika PDF Parsing

Key Parameters

Sample Usage

Learn More

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

JDeli vs Java ImageIO: Benchmarks and Migration

How PDFs work: A Practical Guide to Creating Your…

How to Read PDF files in Java (Step-by-Step Guide)