How do we get the text from a PDF file, epecially if the font is embedded?
Luckily for us, the PDF file format was designed from the beginning to allow extraction of the text. A properly constructed PDF file will specify the unicode extraction values for every glyf used so we can extract this as the text. Most PDF files embed information which provides details on the correct unicode text for all the glyfs. This is what we then use to generate the text in our HTML5 file.
For version 1 we are only looking at substitution (ie using fonts on the system and not extracting the font from the PDF file). So we also allow the user to map the font onto a web font so that they can ensure that the font is suitable. This is achieved by a lookup table and we also you to control how this happens. I will cover this in the next article…
Click here to see all the articles in the PDF to HTML5 conversion series.
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.