Mark Stephens I have been working with Java and PDF since 1999 and am a big NetBeans fan. I enjoy speaking at conferences. I have an MA in Medieval History and a passion for reading.

PDF to HTML5 conversion – Newspaper page layout

48 sec read

I started IDRsolutions while working for the Times Newspaper group in the 1990s. So I know that the complex page layout on Newspaper pages tends to raise a whole load of special issues. But also that it provides some really good case studies to hone our technology. Here is an example I would like to share.

In our PDF to HTML5 conversion process, there is a trade-off. We can position every glyf on its own. In this case we get accurate but very large HTML files. Or we can roll the text together into lines – losing a little accuracy but producing much smaller files. This grouping is also important because our Javascript will attempt to auto-fit the text blocks into their correct spaces – one  long line will look much better than 2 blocks.

Here is an example with one line highlighted. You will notice there are big spaces between the words on the highlighted line. It comes from a live Newspaper page (reproduced with permission) to show the issue.

paper from pdf

If we split out the individual words, we get this which does not look too good.

first HTML5 attempt

So let us be more fussy on what breaks we allow and try to keep the text as a single block.

 

It needs some more work and tuning but definitely a step in the right direction. What do you think?

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens I have been working with Java and PDF since 1999 and am a big NetBeans fan. I enjoy speaking at conferences. I have an MA in Medieval History and a passion for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2019. All rights reserved.