TL;DR

Java has no native PDF support, so you need a library. Apache PDFBox is free and adequate for basic extraction, and iText adds better generation but has AGPL licensing limitations, whereas JPedal is the commercial option for production workloads which need reliable rendering and an embedded viewer.

Reading a PDF in Java is not a single operation. Depending on what you need, raw text, structured content, images, metadata, or a rendered page, you’ll reach for different libraries and different APIs.

This guide covers the three most widely used options: Apache PDFBox (open source, Apache 2.0), OpenPDF (open source, LGPL/MPL), and JPedal (commercial, with a free trial).

Quick Reference: Which Library Should You Use?

Library	Licence	Min. Java	Maven Group ID	Strengths
Apache PDFBox 3.x	Apache 2.0	Java 8	org.apache.pdfbox	Text extraction, metadata, general PDF processing
OpenPDF 2.x	LGPL + MPL	Java 8	com.github.librepdf	iText 4 fork; good for combined read + write workflows
JPedal	Commercial	Java 17	com.idrsolutions	Pixel-accurate rendering, embedded viewer, enterprise support

We will first look at the open source methods, which are more suited for students and solo developers. If the objective is to use a pure Java library for commercial application, JPedal has the development history to be the most viable Java PDF library.

Reading a PDF with Apache PDFBox

Apache PDFBox 3.x is the most common starting point for Java PDF processing and works great for solo developers and students looking to read PDF files in Java.

Maven Dependency

 <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.3</version>
 </dependency>

Gradle

implementation 'org.apache.pdfbox:pdfbox:3.0.3'<span id="more-39094"></span>

Extract All Text from a PDF

 public static void main(String[] args) throws IOException {
        File file = new File("sample.pdf");
        try (PDDocument document = Loader.loadPDF(file)) {
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            System.out.println(text);
        }
    }
}

Loader.loadPDF() is the correct entry point in PDFBox 3.x. The old PDDocument.load() method was removed in 3.0.

Extract Text from a Specific Page Range

try (PDDocument document = Loader.loadPDF(new File("sample.pdf"))) {
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(2);
    stripper.setEndPage(4);
    String text = stripper.getText(document);
    System.out.println(text);
}

Fix Non-Linear Text Extraction

PDFBox extracts text in the order it appears in the PDF’s content stream, which does not always match reading order, particularly on multi-column layouts or documents with complex positioning. Enable sort-by-position to get left-to-right, top-to-bottom output:

PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);

Note for PDFBox 3.x users: The PDDocument.load() method was removed in version 3.0. Use Loader.loadPDF() instead.

Reading a PDF with OpenPDF

OpenPDF is a fork of iText 4 and is maintained under LGPL and MPL licences. It is a better fit if your project already uses it for PDF generation and you want a single dependency for both reading and writing.

For pure text extraction, PDFBox is generally more capable with complex documents.

Maven Dependency

<dependency>
    <groupId>com.github.librepdf</groupId>
    <artifactId>openpdf</artifactId>
    <version>2.0.3</version>
</dependency>

Extract Text with OpenPDF

OpenPDF does not include a built-in high-level text stripper equivalent to PDFBox's PDFTextStripper. For text extraction from existing PDFs, use it alongside pdf-renderer or use the PdfReader API to iterate over page content:

public static void main(String[] args) throws IOException {
        PdfReader reader = new PdfReader("sample.pdf");
        int pages = reader.getNumberOfPages();
        for (int i = 1; i <= pages; i++) {
            String pageText = PdfTextExtractor.getTextFromPage(reader, i);
            System.out.println("Page " + i + ":\n" + pageText);
        }
        reader.close();
    }
}

If you are already generating PDFs with OpenPDF (form filling, watermarking, page manipulation) and need to read back content from the same documents, keeping a single dependency is reasonable. For reading third-party PDFs with complex structure, prefer PDFBox.

Reading a PDF with JPedal

JPedal is our commercial Java PDF library. It covers the same text extraction and metadata use cases as PDFBox, with additional capabilities that open-source libraries do not offer out of the box: pixel-accurate page rendering, an embeddable Swing/JavaFX viewer component and active commercial support.

You can also access our GitHub repository for text extraction, which includes code examples and various text extraction methods.

Add JPedal to Your Project

Download the JPedal trial JAR and add it to your classpath or local Maven repository.

Extract Text with JPedal

Whether your PDFs contain structured or unstructured text, JPedal can extract both types of text from a PDF.

Extract Unstructured Text from a PDF file

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();

Extract Structured Text from a PDF file

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();

Extract Wordlist from a PDF file

ExtractTextAsWordlist extract = new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
List wordList=extract.getWordsOnPage(page);
}
}
extract.closePDFfile();

Extract Document outline from PDF files

ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document pdfOutline=extract.getPDFTextOutline();
}
extract.closePDFfile();

When to Choose JPedal Over the Open-Source Options

You need accurate rendering of PDFs as images (e.g., for a document viewer in a Java Swing application)
You’re processing high volumes of real-world PDFs, including edge case files
You need production support, not Stack Overflow answers
Your project is closed-source and the AGPL terms of iText are a problem

Which Java PDF library is Right for You?

Use PDFBox if you need a quick, free solution for basic text extraction from clean PDFs in a non-commercial or open-source project. You can use iText if you need both reading and PDF generation, and you’re fine with AGPL or have a commercial license.

Java developers should use JPedal if you’re building a production Java application that needs reliable extraction, accurate rendering, or an embedded viewer, especially if you’re processing real-world PDFs that don’t always conform to spec.

Learn more

Looking for a pure Java PDF library to handle processing your documents? Check out JPedal.
Want to learn more about the PDF file format? We have been developing PDF software for over 20 years!

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

How to Read PDF files in Java (Step-by-Step Guide)

TL;DR

Quick Reference: Which Library Should You Use?

Reading a PDF with Apache PDFBox

Maven Dependency

Gradle

Extract All Text from a PDF

Extract Text from a Specific Page Range

Fix Non-Linear Text Extraction

Reading a PDF with OpenPDF

Maven Dependency

Extract Text with OpenPDF

Reading a PDF with JPedal

Add JPedal to Your Project

Extract Text with JPedal

Extract Unstructured Text from a PDF file

Extract Structured Text from a PDF file

Extract Wordlist from a PDF file

Extract Document outline from PDF files

When to Choose JPedal Over the Open-Source Options

Which Java PDF library is Right for You?

Learn more

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Apache Commons Imaging Alternative for Java: JDeli

TwelveMonkeys Alternative for Java Image Processing

The Best PDF Inspector Tools for Developers

How to Read PDF files in Java (Step-by-Step Guide)

TL;DR

Quick Reference: Which Library Should You Use?

Reading a PDF with Apache PDFBox

Maven Dependency

Gradle

Extract All Text from a PDF

Extract Text from a Specific Page Range

Fix Non-Linear Text Extraction

Reading a PDF with OpenPDF

Maven Dependency

Extract Text with OpenPDF

Reading a PDF with JPedal

Add JPedal to Your Project

Extract Text with JPedal

Extract Unstructured Text from a PDF file

Extract Structured Text from a PDF file

Extract Wordlist from a PDF file

Extract Document outline from PDF files

When to Choose JPedal Over the Open-Source Options

Which Java PDF library is Right for You?

Learn more

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Apache Commons Imaging Alternative for Java: JDeli

TwelveMonkeys Alternative for Java Image Processing

The Best PDF Inspector Tools for Developers