How to extract text from a PDF as JSON

Contents hide

3 Converting Structured PDF files to JSON

Some PDF files can be “tagged” which means they contain information about the structure of the file. This structure is embedded as metadata within the PDF and is made up of a hierarchy of tags that label elements such as headings, paragraphs, lists, tables, and images.

This is very similar to HTML where text is contained within elements that have meaning, such as <p> for paragraph, or <table> for table.

If a PDF file does contain structured content (also known as marked content), then it can be processed and converted into other formats.

What is JSON?

A JSON file (short for JavaScript Object Notation) is a lightweight, text-based format used for storing and exchanging structured data between systems.

It represents data as key-value pairs and arrays, making it easy for both machines and developers to read and write. JSON is commonly used in web applications to transmit data between server and a client, and is natively supported by most programming languages.

Despite its origin in JavaScript, JSON is language-independent and has become a universal data format across APIs and software systems.

PDF vs JSON

Both formats serve distinct purposes, PDF preserves the fixed visual layout of a document, making it idea for sharing print-ready content like reports, contracts and official documents. It’s widely used when consistent appearance and layout across devices are critical, and is on the most universally supported document formats.

In contrast, JSON is designed for data exchange and storage, not presentation. JSON is commonly used in web development, APIs, and software applications to sent structured data like user profiles, settings, or real-time content between servers and clients.

JSON can be imported directly in text editors like VS studio without affecting the format, with PDFs that is not the case. JSON is also best suited to be crawled by AI technologies like LLMs due to its structural nature.

Converting Structured PDF files to JSON

Recently, we added PDF to JSON support to JPedal. If your PDF file contains structured content (how do I know?), then JPedal will be able to convert it to JSON using the following code snippet:

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.JSON);

ExtractStructuredText.
        writeAllStructuredTextOutlinesToDir("inputFile.pdf", password, "outputFolder", null, null);

Learn more about tagged PDF files.
Learn more about JPedal, our powerful PDF toolkit.

This guide demonstrated how to convert structured PDF files into JSON format using just a few lines of Java code. It also highlighted the key differences between PDF and JSON to help you determine which format best suits your needs.

For more in-depth insights into PDFs, feel free to explore our other articles — we’ve been working with the format for over a decade!