Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to extract text from PDF files in Java (Tutorial)

1 min read

jpedal

This tutorial shows you how to extract text from a PDF file in simple steps using JPedal Java PDF library. It covers different formats of text and the Java code to extract it these variations.

How to extract Unstructured Text from a PDF file

  1. Download JPedal trial jar.
  2. Create a File handle, InputStream or URL pointing to the PDF file
  3. Include a password if file password protected
  4. Open the PDF file
  5. Iterate over the pages to extract the text
  6. Close the PDF file

and the Java code to extract Unstructured text from PDF…

Below is an example of original PDF vs extracted unstructured text:

How to extract Structured Text from a tagged PDF file

  1. Download JPedal trial jar.
  2. Choose output format
  3. Create a File handle, InputStream or URL pointing to the PDF file
  4. Include a password if file password protected
  5. Open the PDF file
  6. Extract the Document text
  7. Close the PDF file

Java code to extract Structured Text…

For demonstration purpose, I’ve added a simple check to see if structured texts exist in my sample PDF.

How to extract Wordlist from a PDF file

  1. Download JPedal trial jar.
  2. Create a File handle, InputStream or URL pointing to the PDF file
  3. Include a password if file password protected
  4. Open the PDF file
  5. Iterate over the pages to extract the text
  6. Close the PDF file

and the Java code to extract a wordlist text from PDF…

Below is an example of original PDF vs extracted wordlist:

How to extract Document outline from  PDF files

  1. Download JPedal trial jar.
  2. Create a File handle, InputStream or URL pointing to the PDF file
  3. Include a password if file password protected
  4. Open the PDF file
  5. Extract the document outline
  6. Close the PDF file

and the Java code to extract a Document outline from PDF…

For demonstration purpose, I’ve added a simple check to see if the outline has been extracted from my sample PDF.



The JPedal PDF library allows you to solve these problems in Java


Why do developers choose JPedal over alternatives?

  1. Actively developed commercial library with full support and no third party dependencies.
  2. Simple licensing options and source code access for OEM users.
  3. Process PDF files up to 3x faster than alternative Java PDF libraries.

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.