A potential customer asked how PDF to HTML conversion works so here is an explanation…
A PDF file is more like a traditional computer program than a traditional file format. You execute the instructions in the PDF file which writes out shapes, text and images to a display and the finished result is your page. It was developed from Postscript – the programming language which revolutionised printing by letting computers and cheap printers produce beautiful copy (in the right hands – you need some design talents to produce great design).
So the first thing you need is a PDF parser. Luckily we happen to have one of those lying around which we have been developing for the last 11 years. So it is robust, powerful and tested. We provided hooks so that we could link in to the points where it would write out the text, shapes and images and altered those to generate the required HTML5 instead. Usefully, because the code was designed to write to Java (which works in sRGB), we also had all the conversions in place to we could use RGB as the display format whatever was in the PDF.
Sometimes you need to make changes to the HTML code to allow for differences in the way it works to PDF – for example there is a clip in PDF but not HTML so images need to be preclipped. That’s where we are now – testing a lot of files and improving the output. And there are lots of features we are adding in – I am currently working on Truetype fonts. Hope that helps explain it. Give the convertor a try and let us know what you think… There are instructions on using it here.
Click here to see all the articles in the PDF to HTML5 conversion series.
This post is part of our “HTML5 Article index” in these articles, we aim to help you understand the world of HTML5.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.