PDF files are not directly supported by Java. This tutorial shows you how to extract text from a PDF file in simple steps using JPedal PDF library.
Why use a third party library to handle PDF files?
PDF files are a very complex binary/text hybrid data structure. The data needs to be parsed and assembled from many sources to create the text in a PDF file. In this example, we will use our JPedal PDF library to make this task simple.
Extract Unstructured Text from a PDF file
If a PDF contains unstructured, extractable text, this API will allow it to be extracted from the page (Javadoc).
ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();
Extract Structured Text from a PDF file
If a PDF was correctly created with structured, extractable text, then this API will allow the text content to be extracted from the page as Structured content in a Java Document (Javadoc).
ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();
Extract Wordlist from a PDF file
Many customers use JPedal to pre-index the text content of their PDF files in a database. This API makes it easy to extract all the words from a PDF file with their text positions onscreen (Javadoc).
ExtractTextAsWordlist extract=new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
List wordList=extract.getWordsOnPage(page);
}
}
extract.closePDFfile();
Extract Document outline from PDF files
PDF file often contain a Document outline to provide a Table of Contents (Javadoc).
ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document pdfOutline=extract.getPDFTextOutline();
}
extract.closePDFfile();