Having spent nearly 3 years building our own PDF to HTML5/SVG/Android converter (which itself relies on our development of a Java PDF library), we know how hard this conversion is and we have a grudging respect for any others trying it. In particular we have been interested in the development of PDF.js…
1. Save PDF as HTML5
What PDF.js does is take the PDF file, decode and extract the content, and display it as HTML5 with some eye candy and controls around the edge. Wouldn’t it be great if you could save that HTML5 version and put it on your website instead of the PDF?
One of the downsides of the PDF file format is that it’s not particularly web friendly. Whilst it’s possible to create PDF files with Fast Web View, the majority of PDF files don’t have it, and the user is left to download the entire PDF file (which could be very large), regardless of how many pages they actually view. There’s an opportunity here to alleviate the issue by storing the HTML5 version on a page by page basis.
One of the things that PDF does particularly well is store very complex documents using very little data. The “markup” (I use this term loosely), is very well optimised. Unfortunately, HTML’s is not. Nor is it suited (in any way) to such fine control of text positioning and layout as PDF. What this adds up to is (depending on how you optimise) very large file sizes.
There’s a penalty if you want an immaculate representation of the PDF file, and at the time time a lot of saving to be had if you are prepared to compromise. In PDF.js’s case, it’s probably best that the content stays as PDF – it doesn’t matter how large your converted files are if they are stored locally.
The screenshot below shows an example of this. It demonstrates the HTML markup required to display a short phrase as the PDF intended. (Höchste Zeit für einen Original VELUX Aussenrollladen.)
For reference, here’s how it looks in the PDF file:
Unfortunately Velux have not paid me for this…
2. Copy and Paste Text from PDF files legibly
Following on from my above point, my tests have shown that it is very common for PDF files to contain finely positioned (or spaced) text that has resulted in the creation of many divs to position correctly. When you copy text from across multiple divs, a new line gets inserted for each div, and what this means for the above text is that it should appear over 4 lines but is actually appearing over 22, with a couple of characters on each.
3. Edit & Save The HTML5 Version
This is a similar groan to point 1, except that I will focus on the ability to edit. It is a well known fact that if you made a typing error in a document you have saved as a PDF, and you have lost the original, it is generally a huge pain to correct this. There are many attempts at software that allow you to edit PDF files, but none that are particularly good.
Regardless, saving out as HTML5 is the perfect opportunity to allow you to edit any text or images, and then share a widely compatible HTML5 version of your content. However, the text that you think you can see, and highlight, and copy/paste is actually a very good illusion.
If you were to edit the text, you would notice that visually nothing is changing, however when you select and copy you are selecting the updated text. The reason for this is that to ensure the perfect visual representation, PDF.js is actually using invisible text overlaid on the text drawn as an image.
The reason for this is the huge issue that is font conversion. PDF allows for a very wide spectrum of fonts, however web browsers are definitely not font friendly. This means that in order to display fonts from PDF files in a web browser, they need to be converted to a supported font format, which is fraught with difficulty as each browser has its own expectations of how a font should be defined.
You can see this optimisation in action if you highlight text, and focus on another window. Here’s the same example from earlier:
4. Layout the pages differently (e.g. magazine style)
It’s quite common for publishers to save magazines and books as PDF files which are intended to be viewed as if it were an open book or magazine, with even numbered pages on the left, and odd numbered pages on the right.
Here’s a nice example:
Unfortunately, PDF.js only has one viewing mode – a single column.
5. Support Interactive Elements
As previously mentioned, PDF is a very powerful format, and allows for lots of interactivity by allowing you to support interactive elements such as forms, audio and video. These are all things that HTML5 can also do well, so perhaps we will see them in a future Firefox release?
If these features are of interest to you, we are the creators a PDF to HTML5 converter with many of these features that may be worth taking for a spin. You can find out more about it and try it online here.
What do you think of the PDF.js viewer?
This post is part of our “SVG Article Index” in these articles, we aim to help you build knowledge and understand SVG.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.