Sam Howard

Sam is a developer at IDRsolutions who mostly specialises in font support and conversion. He’s also enjoyed working with Java 3D, Java FX and Swing. His other interests include music and game design.

PDF to HTML5 conversion – Extracting PDF text and mapping glyphs

1 min read

Text works differently in PDFs and in HTML files, which can make it a surprisingly complex problem to get great output during PDF to HTML5 conversion.

PDF text is actually two entirely separate values – a value for choosing what glyph to display (display value), and a value for extraction (extraction value). You might find that a title appears entirely in capitals, but when you copy the text and paste it elsewhere it’s all lower case.

Different values are used for extraction and display.
In PDF text, different values are used for extraction and display.

Slightly less usefully, you sometimes find PDF files which, when you copy the text out, is complete gibberish. The extraction values might be completely wrong, but because the display values are correct it looks right in Adobe Reader.

Equally, you sometimes find that the extraction values are fine, but if you open up the font you see that the glyphs are completely mislabelled, so typing ‘U’ could make a ‘!’ appear. Again, it will look right in Adobe Reader if the display values match the wrong values in the font.

HTML treats text much more simply – there’s no concept of separating the display and extraction values, so what you see is what you get.

So, when you’re converting from PDF text to HTML5 output, which value should you use?

It’s a trick question – neither.

If you use the display value, it should look right, but there’s a pretty high chance the text will be completely wrong when you copy it out of the PDF. Amongst other things, that means search engines can’t understand your content and one of the benefits of having HTML documents is lost.

If you use the extraction value, there’s no guarantee that it will map onto the right glyph in the font, or that it will map onto any value in the font. Even if you rewrote the font to get the values to match, you could have the same extraction values being used for multiple different glyphs, which could make for some ugly problems. One file I’ve seen uses the extraction value ‘h’ for bullet points – if we used that mapping, you could see bullet points popping up in the middle of words.

So what we actually do now is use a potentially modified version of the extraction value. If an extraction value has already been used to show a different glyph before, we use a different value. We build up a map of these values and rewrite the font to map them onto the right glyph. This gives us a good shot at preserving all of the content in the file whilst also making sure that the text looks right.

When converting to HTML, we use both values to create the font and the HTML. (This is a simplified version.)
When converting to HTML, we use both values to create the font and the HTML. (This flowchart is very simplified but it does give a general impression of how things work.)

This is just one of the many improvements we’ve been making to our output, with many more – including improving our Type 1 support – on the way.

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (TwitterFacebook and Google+) or the  Blog RSS.

Sam Howard

Sam is a developer at IDRsolutions who mostly specialises in font support and conversion. He’s also enjoyed working with Java 3D, Java FX and Swing. His other interests include music and game design.

Converting your PDF files to HTML5 with BuildVu 

Recently we announced our updated product range for 2018 and are rebranding some existing products, like JPDF2HTML5 which has been renamed to BuildVu. It...
Georgia Ingham
3 min read

Favourite resources from our HTML development team

As the web progresses and grows, so do the technologies that come along with it. Trying to keep on top of everything you need...
Ovidijus Okinskas
1 min read

How HTML5 Javadocs in Java 9 will make your…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Rob
1 min read

One Reply to “PDF to HTML5 conversion – Extracting PDF text and…”

Leave a Reply

Your email address will not be published. Required fields are marked *