Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to extract text from PDF files in Java

1 min read

jpedal

PDF files are not directly supported by Java. This tutorial shows you how to extract text from a PDF file in simple steps using JPedal PDF library.

Why use a third party library to handle PDF files?

PDF files are a very complex binary/text hybrid data structure. The data needs to be parsed and assembled from many sources to create the text in a PDF file.  In this example, we will use our JPedal PDF library to make this task simple.

Extract Unstructured Text from a PDF file

If a PDF contains unstructured, extractable text,  this API will allow it to be extracted from the page (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {
 
        String text=extract.getTextOnPage(page);
     }
 }
 
 extract.closePDFfile();

Extract Structured Text from a PDF file

If a PDF was correctly created with structured, extractable text, then this API will allow the text content to be extracted from the page as Structured content in a Java Document (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {
 
        String text=extract.getTextOnPage(page);
     }
 }
 
 extract.closePDFfile();

Extract Wordlist from a PDF file

Many customers use JPedal to pre-index the text content of their PDF files in a database. This API makes it easy to extract all the words from a PDF file with their text positions onscreen  (Javadoc).

ExtractTextAsWordlist extract=new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
      int pageCount=extract.getPageCount();
      for (int page=1; page<=pageCount; page++) {
 
        List wordList=extract.getWordsOnPage(page);
      } 
 }
 
 extract.closePDFfile();

Extract Document outline from  PDF files

PDF file often contain a Document outline to provide a Table of Contents (Javadoc).

ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     Document pdfOutline=extract.getPDFTextOutline();
 }
 
 extract.closePDFfile();


JPedal makes it easy to extract text from PDF files


Java PDF SDK for working with PDF filesFind out more



Do you need to...

Display PDF files in Java Apps →

Convert PDF Files to image →

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2022. All rights reserved.