I have been looking at some interesting PDF files recently… In one of them we had a PDF file where we could see extra text on the HTML5 pages – in the PDF we had a blank white box while the HTNL5 page had a white box with text on it. Very odd 🙁
It actually turns out that the text is selectable on the PDF and you can extract it with a copy paste. It is just not visible, because the white box is drawn ontop of the text, hiding it.
When we generate an HTML5 version of the page we usually use TWO separate layers – the canvas layer and the text divs. These layers do not intermesh as closely so the text layer is effectively drawn ontop of the canvas. This is why we can see the text. There are several solutions to this problem.
Firstly, we could draw all the text on the canvas layer. We have actually added a method to rasterize the text where the page looks as it does on the PDF.
There are also 2 future fixes we could also add. We could write all the text out to SVG or we could scan the page twice – the first time to find all the shapes so we could then decide if the text was hidden. The downside of this would be it would slow do the processing speed. They are both possible commercial enhancements we could add for clients in the future.
Do you have any interesting PDF to HTML5 conversion issues you would like us to investigate?
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.