How to extract text from PDF files in Java

PDF files are not directly supported by Java. This tutorial shows you how to extract text from a PDF file in simple steps using JPedal Java PDF library.

Why use a third party library to handle PDF files?

PDF files are a very complex binary/text hybrid data structure. The data needs to be parsed and assembled from many sources to create the text in a PDF file. In this example, we will use our JPedal Java PDF library to make this task simple.

How to extract Unstructured Text from a PDF file

If a PDF contains unstructured, extractable text, this API will allow it to be extracted from the page (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {

        String text=extract.getTextOnPage(page);
     }
 }

 extract.closePDFfile();

How to extract Structured Text from a tagged PDF file

If a PDF document was correctly created with structured, extractable text (click here to find out how to find out), then this API will allow the text content to be extracted from the page as Structured content in a Java Document (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {

        String text=extract.getTextOnPage(page);
     }
 }

 extract.closePDFfile();

How to extract Wordlist from a PDF file

Many customers use JPedal to pre-index the text content of their PDF files in a database. This API makes it easy to extract all the words from a PDF file with their text positions onscreen (Javadoc).

ExtractTextAsWordlist extract=new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
      int pageCount=extract.getPageCount();
      for (int page=1; page<=pageCount; page++) {

        List wordList=extract.getWordsOnPage(page);
      } 
 }

 extract.closePDFfile();

How to extract Document outline from PDF files

PDF file often contain a Document outline to provide a Table of Contents (Javadoc).

ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     Document pdfOutline=extract.getPDFTextOutline();
 }

 extract.closePDFfile();

Is there a way to extract text from a PDF?

To extract text, options include JPedal, a java PDF library specifically designed for high-quality PDF text extraction and rendering.

The JPedal PDF library allows you to

Display PDF files in Java Apps

View PDF files in Java

Convert PDF Files to image

Are you a Java Developer working with PDF files?

How to extract text from PDF files in Java

Why use a third party library to handle PDF files?

How to extract Unstructured Text from a PDF file

How to extract Structured Text from a tagged PDF file

How to extract Wordlist from a PDF file

How to extract Document outline from PDF files

Is there a way to extract text from a PDF?

The JPedal PDF library allows you to

How to read PDF files in Java?

How JPedal allows you to view the commands in…

How to copy bookmarks from one PDF to another

Viewing Products

SDK Products

Are you a Java Developer working with PDF files?

How to extract text from PDF files in Java

Why use a third party library to handle PDF files?

How to extract Unstructured Text from a PDF file

How to extract Structured Text from a tagged PDF file

How to extract Wordlist from a PDF file

How to extract Document outline from PDF files

Is there a way to extract text from a PDF?

The JPedal PDF library allows you to

How to read PDF files in Java?

How JPedal allows you to view the commands in…

How to copy bookmarks from one PDF to another