Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Extracting Maths formulae in PDF files

48 sec read

Maths formulae look really good in PDF files but how easy are they to extract? They are actually quite an extreme example of the issues with text extraction because they can include fractions and other special symbols. So the answer really is – it depends on how they were created.

The PDF file format contains a set of specific glyphs for the common fractions (onehalf, onequarter, etc) but some tools generate these by drawing two tiny numbers (one above the other) to create fractions, and there is no standard way to do this – it can be different with each tool.

So as with text extraction in general, the answer may be:-

1. Excellent because it was generated with marked content so that it can be extracted as an XML structure describing the formulae exactly. There is The easy way to discover if a PDF file contains ‘structured content’ telling you how to tell if the file contains marked content.

2. Okay because the extend PDF characterset has been used.

3. Poor because it was drawn in an arbitary way with no real structure and you would need to write a custom extraction routine to pick up what the PDF creation tool is doing.

Do you have any tips on extracting Maths formulae from PDF or recommendations on which types of PDF creator produce the best content for extraction?

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Updates to our Text to Speech support in PDF…

Some time ago we introduced text to speech functionality to the JPedal example viewer. This used the FreeTTS library and its default voices with the option of...
Kieran France
1 min read

Three ways to convert PDF to HTML5: Text and…

There are several ways that you can deal with text and fonts in PDF files when converting to HTML5. Here are there are the...
Leon Atherton
2 min read

Leave a Reply

Your email address will not be published. Required fields are marked *