How to extract Structured text from PDF files in Java (Tutorial)

Table of Contents show

TL;DR:

PDFs use complex binary/compressed data that standard text editors can’t read. To inspect the internal structure, use JPedal (for debugging content streams), RUPS (for a visual object hierarchy), or PDFXplorer (for Windows-based JavaScript/image extraction).

Why Structure Matters in PDF Text Extraction

Developers hoping to extract content from PDF documents whilst maintaining the structure of the text should follow this tutorial. Some (but not all) PDF files contain text content which can be extracted in a structured format, retaining paragraphs and other layout and formatting information.

Step-by-Step Extraction Guide

Initial Setup and File Preperation

Download JPedal trial jar.
Choose output format
Create a File handle, InputStream or URL pointing to the PDF file
Include a password if file password protected
Open the PDF file
Extract the Document text
Close the PDF file

The Extraction Process

Open the PDF file
Extract the Document text
Close the PDF file

Java code to extract Structured Text…

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();

 properties.setFileOutputMode(OutputModes.XML);

 //properties.setFileOutputMode(OutputModes.HTML);

 ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);

 //extract.setPassword("password");

 if (extract.openPDFFile()) {

     Document anyStructuredText = extract.getStructuredTextContent();

 }


 extract.closePDFfile();

Understanding PDF Structure and Compatibility

How do I know if a PDF file contains Structured text?

You can find out if it is present by reading is this blog post.

What is Structured text?

When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces.

It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.

To understand why this happens, you can consult the Adobe PDF Reference which explains the framework needed to transform a collection of “draw” commands into a readable document.

The Role of Marked Content

It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back.

So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it.

This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created.

For developers working in sectors that require high compliance (like government or legal), it is also worth referencing PDF/UA (ISO 14289). This is the international standard for “Universal Accessibility,” ensuring that the structured text you extract is accessible to all users and assistive technologies.

FAQs

Q: Can I extract structured text if the PDF wasn’t created with “Marked Content”?

A: Unfortunately, no. If the PDF was not originally generated with the necessary structural metadata, the file only contains raw “draw” commands. In these cases, you would need to use a content grouping or layout engine to “guess” the structure, rather than extracting it directly.

Q: Does structured text extraction work on scanned documents?

A: Standard structured text extraction relies on internal metadata, which scanned documents (essentially just images) lack. To get structured text from a scan, you would first need to run Optical Character Recognition (OCR) to generate the text layer and then apply structure to those results.

Q: How does structured extraction handle multi-column layouts?

A: Because structured text uses “Marked Content,” it understands the reading order defined by the creator. Unlike basic text extraction, which might read across the page and mix sentences from two different columns, structured extraction follows the intended flow of the article or report.

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

How to extract Structured text from PDF files in Java (Tutorial)

TL;DR:

Why Structure Matters in PDF Text Extraction

Step-by-Step Extraction Guide

Initial Setup and File Preperation

The Extraction Process

Java code to extract Structured Text…

Understanding PDF Structure and Compatibility

How do I know if a PDF file contains Structured text?

What is Structured text?

The Role of Marked Content

FAQs

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Java PDF to AVIF conversion (Tutorial)

How to Reorder Pages in a PDF Using Java…

How to remove blank pages from a PDF in…

How to extract Structured text from PDF files in Java (Tutorial)

TL;DR:

Why Structure Matters in PDF Text Extraction

Step-by-Step Extraction Guide

Initial Setup and File Preperation

The Extraction Process

Java code to extract Structured Text…

Understanding PDF Structure and Compatibility

How do I know if a PDF file contains Structured text?

What is Structured text?

The Role of Marked Content

FAQs

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Java PDF to AVIF conversion (Tutorial)

How to Reorder Pages in a PDF Using Java…

How to remove blank pages from a PDF in…