Maths formulae look really good in PDF files but how easy are they to extract? They are actually quite an extreme example of the issues with text extraction because they can include fractions and other special symbols. So the answer really is – it depends on how they were created.
The PDF file format contains a set of specific glyphs for the common fractions (onehalf, onequarter, etc) but some tools generate these by drawing two tiny numbers (one above the other) to create fractions, and there is no standard way to do this – it can be different with each tool.
So as with text extraction in general, the answer may be:-
1. Excellent because it was generated with marked content so that it can be extracted as an XML structure describing the formulae exactly. There is The easy way to discover if a PDF file contains ‘structured content’ telling you how to tell if the file contains marked content.
2. Okay because the extend PDF characterset has been used.
3. Poor because it was drawn in an arbitary way with no real structure and you would need to write a custom extraction routine to pick up what the PDF creation tool is doing.
Do you have any tips on extracting Maths formulae from PDF or recommendations on which types of PDF creator produce the best content for extraction?
Can we help you to solve any of these problems?
IDRsolutions has been helping companies to solve these problems since 1999.
|Convert PDF to HTML5 or SVG||Convert PDF forms to HTML5|
|Java SDK for Image files||Java SDK for PDF files|