Maths formulae look really good in PDF files but how easy are they to extract? They are actually quite an extreme example of the issues with text extraction because they can include fractions and other special symbols. So the answer really is – it depends on how they were created.
The PDF file format contains a set of specific glyphs for the common fractions (onehalf, onequarter, etc) but some tools generate these by drawing two tiny numbers (one above the other) to create fractions, and there is no standard way to do this – it can be different with each tool.
So as with text extraction in general, the answer may be:-
1. Excellent because it was generated with marked content so that it can be extracted as an XML structure describing the formulae exactly. There is The easy way to discover if a PDF file contains ‘structured content’ telling you how to tell if the file contains marked content.
2. Okay because the extend PDF characterset has been used.
3. Poor because it was drawn in an arbitary way with no real structure and you would need to write a custom extraction routine to pick up what the PDF creation tool is doing.
Do you have any tips on extracting Maths formulae from PDF or recommendations on which types of PDF creator produce the best content for extraction?
Do you need to write or read JPEG in Java?
We have an easy guide on how to write JPEG in Java using ImageIO and JDeli.
You can learn how to read/write most of the image files in ImageIO. However, it gives little control over the process.
JDeli is easy to use and offers complete support, so why not give JDeli a try?