Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to extract Structured text from PDF files in Java (Tutorial)

1 min read

Developers hoping to extract content from PDF documents whilst maintaining the structure of the text should follow this tutorial. Some (but not all) PDF files contain text content which can be extracted in a structured format, retaining paragraphs and other layout and formatting information.

How to extract Structured text from PDF files in Java

  1. Download JPedal trial jar.
  2. Choose output format
  3. Create a File handle, InputStream or URL pointing to the PDF file
  4. Include a password if file password protected
  5. Open the PDF file
  6. Extract the Document text
  7. Close the PDF file

Java code to extract Structured Text…

How do I know if a PDF file contains Structured text?

You can find out if it is present by reading is this blog post.

What is Structured text?

When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces. It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.

It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back. So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it. This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created.



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.