Sam Howard Sam is a developer at IDRsolutions who specialises in font rendering and conversion. He's also enjoyed working with SVG, Java 3D, Java FX and Swing.

PDF to HTML5 conversion – Extracting PDF text and mapping glyphs

1 min read

Text works differently in PDFs and in HTML files, which can make it a surprisingly complex problem to get great output during PDF to HTML5 conversion.

PDF text is actually two entirely separate values – a value for choosing what glyph to display (display value), and a value for extraction (extraction value). You might find that a title appears entirely in capitals, but when you copy the text and paste it elsewhere it’s all lower case.

Different values are used for extraction and display.
In PDF text, different values are used for extraction and display.

Slightly less usefully, you sometimes find PDF files which, when you copy the text out, is complete gibberish. The extraction values might be completely wrong, but because the display values are correct it looks right in Adobe Reader.

Equally, you sometimes find that the extraction values are fine, but if you open up the font you see that the glyphs are completely mislabelled, so typing ‘U’ could make a ‘!’ appear. Again, it will look right in Adobe Reader if the display values match the wrong values in the font.

HTML treats text much more simply – there’s no concept of separating the display and extraction values, so what you see is what you get.

So, when you’re converting from PDF text to HTML5 output, which value should you use?

It’s a trick question – neither.

If you use the display value, it should look right, but there’s a pretty high chance the text will be completely wrong when you copy it out of the PDF. Amongst other things, that means search engines can’t understand your content and one of the benefits of having HTML documents is lost.

If you use the extraction value, there’s no guarantee that it will map onto the right glyph in the font, or that it will map onto any value in the font. Even if you rewrote the font to get the values to match, you could have the same extraction values being used for multiple different glyphs, which could make for some ugly problems. One file I’ve seen uses the extraction value ‘h’ for bullet points – if we used that mapping, you could see bullet points popping up in the middle of words.

So what we actually do now is use a potentially modified version of the extraction value. If an extraction value has already been used to show a different glyph before, we use a different value. We build up a map of these values and rewrite the font to map them onto the right glyph. This gives us a good shot at preserving all of the content in the file whilst also making sure that the text looks right.

When converting to HTML, we use both values to create the font and the HTML. (This is a simplified version.)
When converting to HTML, we use both values to create the font and the HTML. (This flowchart is very simplified but it does give a general impression of how things work.)

This is just one of the many improvements we’ve been making to our output, with many more – including improving our Type 1 support – on the way.



BuildVu allows you to

View PDF files in a Web app
Convert PDF documents to HTML5
Parse PDF documents as HTML
Sam Howard Sam is a developer at IDRsolutions who specialises in font rendering and conversion. He's also enjoyed working with SVG, Java 3D, Java FX and Swing.

One Reply to “PDF to HTML5 conversion – Extracting PDF text and…”

Comments are closed.