Why PDF to HTML conversion does not work very well

When people convert PDF files into HTML files, they tend to be disappointed with the results. The main reason for this tends to be that a straight conversion is not possible. PDF files can contain a large number of structures which have no direct equivalent in HTML (even in the new HTML5). PDF was designed as a format to be viewed – the file is painted onto the page and the user sees the end result. Many PDFs are generated from strips of images or overlapping overlays which need to fit together exactly.

With the latest PDF versions, this is even worse. How do you translate XML, transparency, colorspace models, Javascript and interactive elements into HTML correctly?

People also expect the text in an HTML file to be in the correct order. Because a PDF is generating a ‘picture’ this is not always going to happen. Some PDF creation tools draw the text in very odd ways – I explained this in more detail in a previous article: PDF text. The text looks correct because your brain sees he finished output and interprets it.

If image quality in the HTML is important, you could convert the PDF into a image and display that, but then all interaction is lost and you need big files for high resolution.

When HTML5 , CSS and Javascript are well support, this may well change. But in the meantime (2010),  be careful about trying to turn PDF into HTML – try to keep it in PDF if possible or be prepared to live with less than perfect results.

Updated 2012 – since I wrote this, I have indeed had a look at HTML5 and you can read the results in other blog articles.

This post is part of our “HTML5 Article index” in these articles, we aim to help you understand the world of HTML5.

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

4 thoughts on “Why PDF to HTML conversion does not work very well

  1. Samraj

    Hi
    I am trying to convert a file in Pagemaker 7.0 into html. Since a direct conversion is not yielding the desired results, I converted it into pdf first. This step is ok. But in the next step, when I convert pdf to html, not all fonts are properly displayed. Although the system has all the fonts, html does not recognize these fonts. I tired to embed the fonts into the document and then convert. Still html does not recognize the fonts. Does it have anything to do with the font type such as true type, open type etc? Please help.

    • Are you trying to convert in PageMaker 7.0 or using JPDF2GTML5?

      • Samraj Kirubaharan

        I used pagemaker 7.0 to create my first and original files. By the way what isJPDF2GTML5? Is it a tool to convert pdf to html 5?
        I am trying to convert into html 4.01 from pdf. And the original files are in pagemaker 7.0 as I mentioned. Do you need any other information.

        • JPDF2GTML5 is our PDF to HTML5 converter. There is a free online version at https://convert.idrsolutions.com

          You read need a PageMaker forum to ask questions related to PageMaker 7.0 – I am sorry but we do not use it

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>