} ?>
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Extracting Maths formulae in PDF files

48 sec read

Maths formulae look really good in PDF files but how easy are they to extract? They are actually quite an extreme example of the issues with text extraction because they can include fractions and other special symbols. So the answer really is – it depends on how they were created.

The PDF file format contains a set of specific glyphs for the common fractions (onehalf, onequarter, etc) but some tools generate these by drawing two tiny numbers (one above the other) to create fractions, and there is no standard way to do this – it can be different with each tool.

So as with text extraction in general, the answer may be:-

1. Excellent because it was generated with marked content so that it can be extracted as an XML structure describing the formulae exactly. There is The easy way to discover if a PDF file contains ‘structured content’ telling you how to tell if the file contains marked content.

2. Okay because the extend PDF characterset has been used.

3. Poor because it was drawn in an arbitary way with no real structure and you would need to write a custom extraction routine to pick up what the PDF creation tool is doing.

Do you have any tips on extracting Maths formulae from PDF or recommendations on which types of PDF creator produce the best content for extraction?



Do you need to write or read JPEG in Java?

We have an easy guide on how to write JPEG in Java using ImageIO and JDeli. You can learn how to read/write most of the image files in ImageIO. However, it gives little control over the process.

JDeli is easy to use and offers complete support, so why not give JDeli a try?

Find out:

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why you should care about Unicode support in Java…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Bethan Palmer
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.