While debugging our PDF to HTML5 we have come across alsorts of interesting ‘PDF’ features which need conversion to an HTML5 equivalent.
Today, I have been looking at a PDF page which had extra text on the HTML5 version. It turns out that the text is also on the PDF but it is just invisible. You can select it but you cannot see it. In the PDF a white box has been drawn over it…
In general this is not a good way to delete PDF text (especially if it is sensitive or confidential!). The text is still there in the PDF and can be easily extracted.
The white box is also drawn in the HTML5 but because the shape is on the canvas layer (and the text is in a div on the separate text layer) the text is not hidden.
The practical fix is to put the text onto the canvas and we have a flag to do this. This is not totally satisfactory because text on the canvas acts like a bitmap. It does not scale without pixellation.
As is often the case, the quality of the PDF effects what we can do in HTML5.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.