TL;DR
Java has no native PDF support, so you need a library. Apache PDFBox is free and adequate for basic extraction, and iText adds better generation but has AGPL licensing limitations, whereas JPedal is the commercial option for production workloads which need reliable rendering and an embedded viewer.
Reading a PDF in Java is not a single operation. Depending on what you need, raw text, structured content, images, metadata, or a rendered page, you’ll reach for different libraries and different APIs.
This guide covers the three most widely used options: Apache PDFBox (open source, Apache 2.0), OpenPDF (open source, LGPL/MPL), and JPedal (commercial, with a free trial).
Quick Reference: Which Library Should You Use?
| Library | Licence | Min. Java | Maven Group ID | Strengths |
|---|---|---|---|---|
| Apache PDFBox 3.x | Apache 2.0 | Java 8 | org.apache.pdfbox | Text extraction, metadata, general PDF processing |
| OpenPDF 2.x | LGPL + MPL | Java 8 | com.github.librepdf | iText 4 fork; good for combined read + write workflows |
| JPedal | Commercial | Java 17 | com.idrsolutions | Pixel-accurate rendering, embedded viewer, enterprise support |
We will first look at the open source methods, which are more suited for students and solo developers. If the objective is to use a pure Java library for commercial application, JPedal has the development history to be the most viable Java PDF library.
Reading a PDF with Apache PDFBox
Apache PDFBox 3.x is the most common starting point for Java PDF processing and works great for solo developers and students looking to read PDF files in Java.
Maven Dependency
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.3</version>
</dependency>
Gradle
implementation 'org.apache.pdfbox:pdfbox:3.0.3'
Extract All Text from a PDF
public static void main(String[] args) throws IOException {
File file = new File("sample.pdf");
try (PDDocument document = Loader.loadPDF(file)) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
}
}
}
Loader.loadPDF() is the correct entry point in PDFBox 3.x. The old PDDocument.load() method was removed in 3.0.
Extract Text from a Specific Page Range
try (PDDocument document = Loader.loadPDF(new File("sample.pdf"))) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(2);
stripper.setEndPage(4);
String text = stripper.getText(document);
System.out.println(text);
}
Fix Non-Linear Text Extraction
PDFBox extracts text in the order it appears in the PDF’s content stream, which does not always match reading order, particularly on multi-column layouts or documents with complex positioning. Enable sort-by-position to get left-to-right, top-to-bottom output:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
Note for PDFBox 3.x users: The PDDocument.load() method was removed in version 3.0. Use Loader.loadPDF() instead.
Reading a PDF with OpenPDF
OpenPDF is a fork of iText 4 and is maintained under LGPL and MPL licences. It is a better fit if your project already uses it for PDF generation and you want a single dependency for both reading and writing.
For pure text extraction, PDFBox is generally more capable with complex documents.
Maven Dependency
<dependency>
<groupId>com.github.librepdf</groupId>
<artifactId>openpdf</artifactId>
<version>2.0.3</version>
</dependency>
Extract Text with OpenPDF
OpenPDF does not include a built-in high-level text stripper equivalent to PDFBox's PDFTextStripper. For text extraction from existing PDFs, use it alongside pdf-renderer or use the PdfReader API to iterate over page content:
public static void main(String[] args) throws IOException {
PdfReader reader = new PdfReader("sample.pdf");
int pages = reader.getNumberOfPages();
for (int i = 1; i <= pages; i++) {
String pageText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println("Page " + i + ":\n" + pageText);
}
reader.close();
}
}
If you are already generating PDFs with OpenPDF (form filling, watermarking, page manipulation) and need to read back content from the same documents, keeping a single dependency is reasonable. For reading third-party PDFs with complex structure, prefer PDFBox.
Reading a PDF with JPedal
JPedal is our commercial Java PDF library. It covers the same text extraction and metadata use cases as PDFBox, with additional capabilities that open-source libraries do not offer out of the box: pixel-accurate page rendering, an embeddable Swing/JavaFX viewer component and active commercial support.
You can also access our GitHub repository for text extraction, which includes code examples and various text extraction methods.
Add JPedal to Your Project
Download the JPedal trial JAR and add it to your classpath or local Maven repository.
Extract Text with JPedal
Whether your PDFs contain structured or unstructured text, JPedal can extract both types of text from a PDF.
Extract Unstructured Text from a PDF file
ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();
Extract Structured Text from a PDF file
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();
Extract Wordlist from a PDF file
ExtractTextAsWordlist extract = new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
List wordList=extract.getWordsOnPage(page);
}
}
extract.closePDFfile();
Extract Document outline from PDF files
ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document pdfOutline=extract.getPDFTextOutline();
}
extract.closePDFfile();
When to Choose JPedal Over the Open-Source Options
- You need accurate rendering of PDFs as images (e.g., for a document viewer in a Java Swing application)
- You’re processing high volumes of real-world PDFs, including edge case files
- You need production support, not Stack Overflow answers
- Your project is closed-source and the AGPL terms of iText are a problem
Which Java PDF library is Right for You?
Use PDFBox if you need a quick, free solution for basic text extraction from clean PDFs in a non-commercial or open-source project. You can use iText if you need both reading and PDF generation, and you’re fine with AGPL or have a commercial license.
Java developers should use JPedal if you’re building a production Java application that needs reliable extraction, accurate rendering, or an embedded viewer, especially if you’re processing real-world PDFs that don’t always conform to spec.
Learn more
Looking for a pure Java PDF library to handle processing your documents? Check out JPedal.
Want to learn more about the PDF file format? We have been developing PDF software for over 20 years!