Table of Contents show
This tutorial shows you how to extract text from a PDF file in simple steps using JPedal Java PDF library. JPedal is the best Java PDF library for developers. It covers different formats of text and the Java code to extract it these variations.
How to extract Unstructured Text from a PDF file
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Iterate over the pages to extract the text
- Close the PDF file
and the Java code to extract Unstructured text from PDF…
ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();
Below is an example of original PDF vs extracted unstructured text:
How to extract Structured Text from a tagged PDF file
- Download JPedal trial jar.
- Choose output format
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Extract the Document text
- Close the PDF file
Java code to extract Structured Text…
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();
For demonstration purpose, I’ve added a simple check to see if structured texts exist in my sample PDF.
How to extract Wordlist from a PDF file
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Iterate over the pages to extract the text
- Close the PDF file
and the Java code to extract a wordlist text from PDF…
ExtractTextAsWordlist extract = new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
List wordList=extract.getWordsOnPage(page);
}
}
extract.closePDFfile();
Below is an example of original PDF vs extracted wordlist:
How to extract Document outline from PDF files
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Extract the document outline
- Close the PDF file
and the Java code to extract a Document outline from PDF…
ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document pdfOutline=extract.getPDFTextOutline();
}
extract.closePDFfile();
For demonstration purpose, I’ve added a simple check to see if the outline has been extracted from my sample PDF.