Text works differently in PDFs and in HTML files, which can make it a surprisingly complex problem to get great output during PDF to HTML5 conversion.
PDF text is actually two entirely separate values – a value for choosing what glyph to display (display value), and a value for extraction (extraction value). You might find that a title appears entirely in capitals, but when you copy the text and paste it elsewhere it’s all lower case.
Slightly less usefully, you sometimes find PDF files which, when you copy the text out, is complete gibberish. The extraction values might be completely wrong, but because the display values are correct it looks right in Adobe Reader.
Equally, you sometimes find that the extraction values are fine, but if you open up the font you see that the glyphs are completely mislabelled, so typing ‘U’ could make a ‘!’ appear. Again, it will look right in Adobe Reader if the display values match the wrong values in the font.
HTML treats text much more simply – there’s no concept of separating the display and extraction values, so what you see is what you get.
So, when you’re converting from PDF text to HTML5 output, which value should you use?
It’s a trick question – neither.
If you use the display value, it should look right, but there’s a pretty high chance the text will be completely wrong when you copy it out of the PDF. Amongst other things, that means search engines can’t understand your content and one of the benefits of having HTML documents is lost.
If you use the extraction value, there’s no guarantee that it will map onto the right glyph in the font, or that it will map onto any value in the font. Even if you rewrote the font to get the values to match, you could have the same extraction values being used for multiple different glyphs, which could make for some ugly problems. One file I’ve seen uses the extraction value ‘h’ for bullet points – if we used that mapping, you could see bullet points popping up in the middle of words.
So what we actually do now is use a potentially modified version of the extraction value. If an extraction value has already been used to show a different glyph before, we use a different value. We build up a map of these values and rewrite the font to map them onto the right glyph. This gives us a good shot at preserving all of the content in the file whilst also making sure that the text looks right.
This is just one of the many improvements we’ve been making to our output, with many more – including improving our Type 1 support – on the way.
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
|Display PDF documents in a Web app|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|
One Reply to “PDF to HTML5 conversion – Extracting PDF text and…”
Wow. Very interesting. I think that is a great solution.