How do we get the text from a PDF file, epecially if the font is embedded?
Luckily for us, the PDF file format was designed from the beginning to allow extraction of the text. A properly constructed PDF file will specify the unicode extraction values for every glyf used so we can extract this as the text. Most PDF files embed information which provides details on the correct unicode text for all the glyfs. This is what we then use to generate the text in our HTML5 file.
For version 1 we are only looking at substitution (ie using fonts on the system and not extracting the font from the PDF file). So we also allow the user to map the font onto a web font so that they can ensure that the font is suitable. This is achieved by a lookup table and we also you to control how this happens. I will cover this in the next article…
Click here to see all the articles in the PDF to HTML5 conversion series.
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
Latest posts by Mark Stephens (see all)
- Introducing the new XFA Parser in FormVu - May 16, 2018
- Moving to JPedal release 8 - May 2, 2018
- Which version of Java SE should I use? - April 25, 2018
- How we are improving our code quality with IDEA in 2018 - March 7, 2018
- How we are improving our code quality with NetBeans in 2018 - March 1, 2018