How do we get the text from a PDF file, epecially if the font is embedded?
Luckily for us, the PDF file format was designed from the beginning to allow extraction of the text. A properly constructed PDF file will specify the unicode extraction values for every glyf used so we can extract this as the text. Most PDF files embed information which provides details on the correct unicode text for all the glyfs. This is what we then use to generate the text in our HTML5 file.
For version 1 we are only looking at substitution (ie using fonts on the system and not extracting the font from the PDF file). So we also allow the user to map the font onto a web font so that they can ensure that the font is suitable. This is achieved by a lookup table and we also you to control how this happens. I will cover this in the next article…
Click here to see all the articles in the PDF to HTML5 conversion series.
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
Latest posts by Mark Stephens (see all)
- 3 ways that the European Union is changing the way Companies write software in 2018 - January 31, 2018
- IDRsolutions product range update for 2018 - January 22, 2018
- 4 ways Companies can make remote working successful - December 21, 2017
- My experience of a Turkish bath (visiting Istanbul for DevFest) - November 24, 2017
- My 5 key takeaways from JavaOne 2017 - October 6, 2017