Ever since we began writing our PDF to HTML5 converter a little over 2 years ago, we have chosen the HTML5 canvas as the way to present a PDF file as HTML5. It allows us to output PDF vector graphics and images as JavaScript commands to draw onto the canvas when the file is loaded. We can then add selectable text and form components on top too. At the time this made sense – the canvas is well supported and works on nearly all mobile devices giving us good compatibility.
But since then, we have discovered many ‘features’ of the canvas that have caused us to get creative with how we convert PDF files into HTML5. For example, currently the HTML5 Canvas does not support filling shapes with the EvenOdd rule, or specifying settings for dashed lines. To solve these issues, we have had to instead output those shapes as images. Unfortunately this can lead to bloat in the output of some pages having many images.
There are also other interesting issues – for example using Save and Restore on Chrome on Android will result in a shape being incorrectly repeated, and using a scale CSS transform in Safari on Mac rasterizes text when you scale rather than redrawing at the correct size. These are all things I will go into more detail about in coming weeks.
But perhaps the biggest flaw with the canvas is that it’s a raster format. If you draw shapes to canvas, they get rasterized and do not scale well. In many cases, we could actually get a better result if we just provide an image of the page, and we do already offer this as an option in our converter.
There are several advantages to providing an image instead:
1. Lower file size – The file size of the image representation of the page can actually be smaller than the draw commands.
2. Visual feedback when loading – Browsers display images as they are loading – when using the canvas you don’t see anything till everything has loaded.
3. Faster load times – As the page is pre-rasterized there is no longer the overhead of having to rasterize the page to canvas each time it is loaded.
4. Everything is simplified – Currently we have some not so nice JavaScript to load and draw the page, we can replace all of this with a simple HTML image tag. It also greatly tidies up our conversion code.
5. Better IE support for older versions of IE (even IE6).
Outputting content as an image is a very nice compromise if you want fast loading files at the cost of not so nice zoom, and we will continue to offer this as an output option.
The PDF file format is a vector file format, and rasterizing the output is a very poor way to convert – it doesn’t make full use of HTML5 features and it does not scale well. We are planning to replace Canvas with inline SVG to produce vector HTML5 representations of PDF pages. SVG support in all mainstream browsers has improved vastly over the last 3 years. It is now a viable (even superior) alternative to Canvas.
This means that if you choose the SVG conversion option, instead of an image tag, you will in fact get an object tag that will displaying the content of an SVG file. This has a significant advantage in that it offers flawless zooming, as you would expect from a PDF file. Like images, SVG also displays the content as it is loaded, making for improved user experience.
In fact, what we will actually output is both an image and SVG representation of the page. If SVG is supported, the SVG will be used, otherwise the image will be used. This means that even when using the SVG mode, the output can use the fallback image and will even work on Internet Explorer 6!
We see this as a huge improvement over our current modes, a significant advantage over other available conversion tools, and very deserving of being announced as part of our version 5 release, inline with version 5 of the Java PDF Library that we also produce.
As this is quite a major change, we would like to take the opportunity to request your feedback. Do you think we are wrong to drop the canvas? Please let us know!
If you are curious about how our output may look in the future, a preliminary example has been created to preview. Please do zoom in to the map in the bottom left!
BuildVu allows you to
View PDF files in a Web app |
Convert PDF documents to HTML5 |
Parse PDF documents as HTML |
Hi,
I think you are totally right. And SVG is the right path !
I heard good things about something called Flash. Did you look into that?
We thought we would rewrite it all in assembler with Flash support.
What method do you use to detect SVG support so you can fall back to an image?
Hi Dan, we will likely use document.implementation.hasFeature(“http://www.w3.org/TR/SVG11/feature#Image”, “1.1”).
I have tested on a range of browsers/platforms including Mac/Windows/Linux/IE6-10/Android 2.2, 4.1/iPhone/iPad and it has been flawless every time.
Leon
What features, if any, would not be supported by SVG and would have to be found a workaround for, e.g. using rasterized images instead?
If there’s any significant amount of workarounds required, I’m thinking it might be a bad idea after all. If ‘none’, more power to ya.
I can’t think of any features we need to create workarounds for SVG output.
Canvas on the other hand had a growing list, e.g. EvenOdd shape filling and dashed lines. EvenOdd filled shapes are especially problematic as it’s very difficult to (quickly) determine if filling with EvenOdd actually makes a difference and is required to be converted to image.
We have several hundred PDF files we test on and only in about 5 it matters if you use EvenOdd instead of NonZero filling – the required fix actually made changes to the majority of our files however.
Switching to SVG has several advantages for this issue alone:
1. Cleaner output. No unnecessary images required to be loaded and drawn.
2. They are now shapes again and can scale nicely (unlike the rasters).
3. Cleaner conversion code as we don’t need additional code for EvenOdd filled shapes.
4. Faster conversion as we don’t need to use slow image classes.