How to process PDFs for use with AI

As Artificial Intelligence becomes more popular for processing large bodies of text, it becomes apparent that PDF files pose a challenge. PDF is a binary format and text within PDF files are often compressed or made up of paint commands which an LLM cannot understand.

For an AI to be able to process and ingest a PDF file, it is necessary to perform some pre-processing to extract the text. LLMs, like GPT-4, rely on scraping plain text, which would need to be extracted.

Our Java PDF library, JPedal, can do just that! It supports many different output formats, including HTML, JSON, TXT, and XML – all of which are AI-compatible formats commonly used for training and processing models.

For most PDF files, you will only be able to extract plain text, however some files contain additional structured content tags which define a semantic structure to the document. For these files, you will be able to generate HTML, JSON, or XML output.

To achieve this with JPedal, you may use the following code snippet:

final String password = null; //null is used when no password required
final ErrorTracker tracker = null; //ErrorTracker implementations can be used to monitor extraction
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);

ExtractStructuredText.
        writeAllStructuredTextOutlinesToDir("inputFileOrFolder", password, "outputFolder", tracker, properties);