This blog post shows you how to convert a PDF to YAML, by extracting marked content with structure tags.
What is structured PDF?
Some PDF files contain structured content, also known as tagged or marked content, which is semantic information that gives meaning to the structure of the file. It is a similar concept to HTML
, where text is surrounded by tags which indicate what the text represents.
What is YAML?
YAML is a data serialisation format, similar to JSON, which makes it very useful for transferring content between systems.
A key difference between YAML and other markup languages like JSON and XML is that YAML uses indentation rather than brackets or tags to define hierarchy.
Can I convert PDF to YAML?
If a PDF file contains marked content (how do I know?), then you can process the structure tags produce a YAML output.
Our PDF library JPedal has support for converting tagged PDFs into the YAML format. JPedal is the best Java PDF library for developers. You can achieve this using the following code snippet:
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.YAML);
ExtractStructuredText.
writeAllStructuredTextOutlinesToDir("inputFile.pdf", password, "outputFolder", null, null);
For more in-depth insights into PDFs, feel free to explore our other articles — we’ve been working with the format for over a decade!