Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Extracting Maths formulae in PDF files

48 sec read

Maths formulae look really good in PDF files but how easy are they to extract? They are actually quite an extreme example of the issues with text extraction because they can include fractions and other special symbols. So the answer really is – it depends on how they were created.

The PDF file format contains a set of specific glyphs for the common fractions (onehalf, onequarter, etc) but some tools generate these by drawing two tiny numbers (one above the other) to create fractions, and there is no standard way to do this – it can be different with each tool.

So as with text extraction in general, the answer may be:-

1. Excellent because it was generated with marked content so that it can be extracted as an XML structure describing the formulae exactly. There is The easy way to discover if a PDF file contains ‘structured content’ telling you how to tell if the file contains marked content.

2. Okay because the extend PDF characterset has been used.

3. Poor because it was drawn in an arbitary way with no real structure and you would need to write a custom extraction routine to pick up what the PDF creation tool is doing.

Do you have any tips on extracting Maths formulae from PDF or recommendations on which types of PDF creator produce the best content for extraction?



Can we help you to solve any of these problems?

IDRsolutions has been helping companies to solve these problems since 1999.

Convert PDF to HTML5 or SVG with BuildVuConvert PDF to HTML5 or SVGConvert AcroForms and XFA to HTML5 with FormVuConvert PDF forms to HTML5
Java Image SDK for working with Image files with JDeliJava SDK for Image files JPedal Java PDF SDK for working with PDF filesJava SDK for PDF files
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2021. All rights reserved.