PDF to HTML5 conversion – Text and fonts (Part 1)

How do we get the text from a PDF file, epecially if the font is embedded?

Luckily for us, the PDF file format was designed from the beginning to allow extraction of the text. A properly constructed PDF file will specify the unicode extraction values for every glyf used so we can extract this as the text. Most PDF files embed information which provides details on the correct unicode text for all the glyfs. This is what we then use to generate the text in our HTML5 file.

For version 1 we are only looking at substitution (ie using fonts on the system and not extracting the font from the PDF file). So we also allow the user to map the font onto a web font so that they can ensure that the font is suitable. This is achieved by a lookup table and we also you to control how this happens. I will cover this in the next article…

Click here to see all the articles in the PDF to HTML5 conversion series.

This post is part of our “Fonts Articles Index” in these articles we explore Fonts.

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>