How do we get the text from a PDF file, epecially if the font is embedded?
Luckily for us, the PDF file format was designed from the beginning to allow extraction of the text. A properly constructed PDF file will specify the unicode extraction values for every glyf used so we can extract this as the text. Most PDF files embed information which provides details on the correct unicode text for all the glyfs. This is what we then use to generate the text in our HTML5 file.
For version 1 we are only looking at substitution (ie using fonts on the system and not extracting the font from the PDF file). So we also allow the user to map the font onto a web font so that they can ensure that the font is suitable. This is achieved by a lookup table and we also you to control how this happens. I will cover this in the next article…
Click here to see all the articles in the PDF to HTML5 conversion series.
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
Did you know...
IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?
It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page