This tutorial shows you how to extract text from a PDF file in simple steps using JPedal Java PDF library. It covers different formats of text and the Java code to extract it these variations.
How to extract Unstructured Text from a PDF file
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Iterate over the pages to extract the text
- Close the PDF file
and the Java code to extract Unstructured text from PDF…
ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
String text=extract.getTextOnPage(page);
}
}
extract.closePDFfile();
Below is an example of original PDF vs extracted unstructured text:
How to extract Structured Text from a tagged PDF file
- Download JPedal trial jar.
- Choose output format
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Extract the Document text
- Close the PDF file
Java code to extract Structured Text…
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();
For demonstration purpose, I’ve added a simple check to see if structured texts exist in my sample PDF.
How to extract Wordlist from a PDF file
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Iterate over the pages to extract the text
- Close the PDF file
and the Java code to extract a wordlist text from PDF…
ExtractTextAsWordlist extract = new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount=extract.getPageCount();
for (int page=1; page<=pageCount; page++) {
List wordList=extract.getWordsOnPage(page);
}
}
extract.closePDFfile();
Below is an example of original PDF vs extracted wordlist:
How to extract Document outline from PDF files
- Download JPedal trial jar.
- Create a File handle, InputStream or URL pointing to the PDF file
- Include a password if file password protected
- Open the PDF file
- Extract the document outline
- Close the PDF file
and the Java code to extract a Document outline from PDF…
ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document pdfOutline=extract.getPDFTextOutline();
}
extract.closePDFfile();
For demonstration purpose, I’ve added a simple check to see if the outline has been extracted from my sample PDF.
The JPedal PDF library allows you to solve these problems in Java
Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");
//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});
//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);
//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");
PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");
if (print.openPDFFile()) {
print.printAllPages("Printer Name");
}
//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});
Why do developers choose JPedal over alternatives?
- Actively developed commercial library with full support and no third party dependencies.
- Simple licensing options and source code access for OEM users.
- Process PDF files up to 3x faster than alternative Java PDF libraries.