PDF to HTML5 conversion – Invisible text appears in HTML5

I have been looking at some interesting PDF files recently… In one of them we had a PDF file where we could see extra text on the HTML5 pages – in the PDF we had a blank white box while the HTNL5 page had a white box with text on it. Very odd 🙁

 

So we drilled down into the PDF to see what was going on.

It actually turns out that the text is selectable on the PDF and you can extract it with a copy paste. It is just not visible, because the white box is drawn ontop of the text, hiding it.

When we generate an HTML5 version of the page we usually use TWO separate layers – the canvas layer and the text divs. These layers do not intermesh as closely so the text layer is effectively drawn ontop of the canvas. This is why we can see the text. There are several solutions to this problem.

Firstly, we could draw all the text on the canvas layer. We have actually added a method to rasterize the text where the page looks as it does on the PDF.

There are also 2 future fixes we could also add. We could write all the text out to SVG or we could scan the page twice – the first time to find all the shapes so we could then decide if the text was hidden. The downside of this would be it would slow do the processing speed. They are both possible commercial enhancements we could add for clients in the future.

Do you have any interesting PDF to HTML5 conversion issues you would like us to investigate?

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>