How do I check whether a PDF is tagged before I run extraction?

Read the structure tree root. In JPedal the extractor returns empty output on an untagged file; in PDFBox, getDocumentCatalog().getStructureTreeRoot() returns null. A null root means the file has no logical structure to extract, so an empty result is correct rather than a bug.

Which output format should I use for an AI or RAG pipeline?

OutputModes.MARKDOWN and OutputModes.JSON both preserve the heading and table hierarchy in a shape that drops directly into a RAG workflow, so either is the practical choice for feeding a language model. ePUB, HTML, XML and YAML are also available for other targets.

How to extract Structured text from PDF files in Java (Tutorial)

Q: How does structured extraction handle multi-column layouts?

Because structured text uses Marked Content, it follows the reading order defined by the document creator. Basic coordinate-based extraction can read straight across the page and interleave sentences from adjacent columns, whereas structured extraction follows the intended flow.

Table of Contents show

TL;DR

Structured text extraction only works on tagged PDFs, the ones that carry an internal structure tree describing headings, paragraphs, lists and tables. If a PDF is not tagged (most scanned files and a lot of older documents), there is no structure to pull out and you get an empty result. JPedal extracts the structure in one call and writes it as ePUB, HTML, JSON, Markdown, XML or YAML. Apache PDFBox is free and can reach the same data, but it hands you the raw structure tree and marked content and leaves you to assemble the output yourself.

Why Structure Matters in PDF Text Extraction

A PDF was never designed as an editable text document. Internally it holds draw commands for text, images and shapes, positioned by coordinates. There are no paragraph markers, no reading order, often not even spaces between words.

Later, Adobe added a way to embed the missing information: marked content, also called tagging. A tagged PDF wraps its content in a structure tree, similar to HTML, where a run of text is labelled h1, p, table and so on. When that tree is present, extraction is accurate and the reading order is the one the author intended. When it is absent, there is nothing to extract structurally, and any tool will either return empty output or fall back to guessing layout from coordinates.

For developers working in sectors that require high compliance (like government or legal), it is also worth referencing PDF/UA (ISO 14289). This is the international standard for “Universal Accessibility,” ensuring that the structured text you extract is accessible to all users and assistive technologies.

So before you debug your code, check whether the file is actually tagged. A tool returning nothing is usually correct, not broken.

Extract structured text with JPedal

Initial Setup and File Preparation

Download JPedal trial jar.
Choose output format
Create a File handle, InputStream or URL pointing to the PDF file
Include a password if file password protected
Open the PDF file
Extract the Document text
Close the PDF file

The Extraction Process

Open the PDF file
Extract the Document text
Close the PDF file

Java code to extract Structured Text

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
    Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();

JPedal can extract structured text in various different formats including ePUB, HTML, JSON, Markdown, XML and YAML. You can find the full details on the JPedal structured text extraction docs.

OutputModes.MARKDOWN and OutputModes.JSON are the ones worth knowing if the output is headed into an AI pipeline. Both preserve the heading and table hierarchy in a shape that drops straight into a RAG workflow. There are dedicated walkthroughs for PDF to Markdown and PDF to JSON if that is your target.

JPedal is pure Java with no native binaries and no third-party dependencies, and it needs Java 17 as a minimum.

Extracting structured text with open source (Apache PDFBox)

Apache PDFBox is the standard free option. It is Apache 2.0 licensed, so it is safe to use in commercial software without the copyleft strings that come with some alternatives. The trade-off is that PDFBox gives you the building blocks, not a finished structured document.

Add the dependency:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.8</version>
</dependency>

If you only need the words in roughly the right order and can live without a real hierarchy, PDFTextStripper is the quick answer. It works on any text-based PDF, tagged or not. This is plain text extraction, not structure:

// Apache PDFBox 3.0.8
try (PDDocument doc = Loader.loadPDF(new File("C:/pdfs/mypdf.pdf"))) {
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);   // best-effort reading order
    String text = stripper.getText(doc);
    System.out.println(text);
}

To preserve the actual logical structure, you read the structure tree yourself. This is the part that JPedal does in one call. In PDFBox you walk PDStructureTreeRoot and print the tag of each element:

try (PDDocument doc = Loader.loadPDF(new File("C:/pdfs/mypdf.pdf"))) {
    PDStructureTreeRoot root = doc.getDocumentCatalog().getStructureTreeRoot();
    if (root == null) {
        System.out.println("No structure tree: this PDF is not tagged.");
        return;
    }
    printTags(root, 0);
}
// Recurse through the logical structure, printing the tag hierarchy (H1, P, Table, …)
static void printTags(PDStructureNode node, int depth) {
    for (Object kid : node.getKids()) {
        if (kid instanceof PDStructureElement) {
            PDStructureElement element = (PDStructureElement) kid;
            System.out.println("  ".repeat(depth) + element.getStructureType());
            printTags(element, depth + 1);
        }
    }
}

That prints the skeleton: the order and nesting of headings, paragraphs and tables. It does not print the text inside each tag. For that you pair the walk with PDFMarkedContentExtractor, which gives you the marked content on each page keyed by its MCID (getMarkedContents()), then you match each structure element’s MCID references back to that content. It works, but you are writing and maintaining the assembly logic that a dedicated structured extractor already ships.

iText can also read the structure tree, with the same manual assembly effort, and its licensing is AGPL or commercial rather than permissive. If open source with no copyleft is the requirement, PDFBox is the practical choice.

JPedal vs PDFBox vs iText

Library	License	One-call structured output	Output formats	Setup effort
Apache PDFBox	Apache 2.0	No	Manual assembly	High
iText	AGPL or commercial	No	Manual assembly	High
JPedal	Commercial	Yes	ePUB, HTML, JSON, Markdown, XML, YAML	Low

Which PDF Text Extraction Library to use?

If the job is a one-off script, the PDFs are simple, and you have time to write structure-tree handling, PDFBox costs nothing and does the work. If you are shipping structured extraction as a feature, feeding tagged PDFs into a RAG or documentation pipeline, or you need consistent JSON or Markdown out of the box, the assembly code you would write around PDFBox is the thing JPedal already is.

FAQs

Q: Can I extract structured text if the PDF wasn’t created with “Marked Content”?

A: Unfortunately, no. If the PDF was not originally generated with the necessary structural metadata, the file only contains raw “draw” commands. In these cases, you would need to use a content grouping or layout engine to “guess” the structure, rather than extracting it directly.

Q: Does structured text extraction work on scanned documents?

A: Standard structured text extraction relies on internal metadata, which scanned documents (essentially just images) lack. To get structured text from a scan, you would first need to run Optical Character Recognition (OCR) to generate the text layer and then apply structure to those results.

Q: How does structured extraction handle multi-column layouts?

A: Because structured text uses “Marked Content,” it understands the reading order defined by the creator. Unlike basic text extraction, which might read across the page and mix sentences from two different columns, structured extraction follows the intended flow of the article or report.

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

How to extract Structured text from PDF files in Java (Tutorial)

TL;DR

Why Structure Matters in PDF Text Extraction

Extract structured text with JPedal

Initial Setup and File Preparation

The Extraction Process

Java code to extract Structured Text

Extracting structured text with open source (Apache PDFBox)

JPedal vs PDFBox vs iText

Which PDF Text Extraction Library to use?

FAQs

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

What is PDF/A?

The Best PDF Inspector Tools for Developers

How FormVu Adds Signature Fields to Converted HTML Forms

How to extract Structured text from PDF files in Java (Tutorial)

TL;DR

Why Structure Matters in PDF Text Extraction

Extract structured text with JPedal

Initial Setup and File Preparation

The Extraction Process

Java code to extract Structured Text

Extracting structured text with open source (Apache PDFBox)

JPedal vs PDFBox vs iText

Which PDF Text Extraction Library to use?

FAQs

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

What is PDF/A?

The Best PDF Inspector Tools for Developers

How FormVu Adds Signature Fields to Converted HTML Forms