Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

PDF hacks and HTML5 – ‘hidden’ PDF text

42 sec read

While debugging our PDF to HTML5 we have come across alsorts of interesting ‘PDF’ features which need conversion to an HTML5 equivalent.

Today, I have been looking at a PDF page which had extra text on the HTML5 version. It turns out that the text is also on the PDF but it is just invisible. You can select it but you cannot see it. In the PDF a white box has been drawn over it…

In general this is not a good way to delete PDF text (especially if it is sensitive or confidential!). The text is still there in the PDF and can be easily extracted.

The white box is also drawn in the HTML5 but because the shape is on the canvas layer (and the text is in a div on the separate text layer) the text is not hidden.

The practical fix is to put the text onto the canvas and we have a flag to do this. This is not totally satisfactory because text on the canvas acts like a bitmap. It does not scale without pixellation.

As is often the case, the quality of the PDF effects what we can do in HTML5.



Converting PDF/ Office Documents to HTML?

Convert PDF to HTML Find out why our customers use BuildVu for HTML conversion

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2021. All rights reserved.