Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

PDF to HTML5 conversion – Newspaper page layout

48 sec read

I started IDRsolutions while working for the Times Newspaper group in the 1990s. So I know that the complex page layout on Newspaper pages tends to raise a whole load of special issues. But also that it provides some really good case studies to hone our technology. Here is an example I would like to share.

In our PDF to HTML5 conversion process, there is a trade-off. We can position every glyf on its own. In this case we get accurate but very large HTML files. Or we can roll the text together into lines – losing a little accuracy but producing much smaller files. This grouping is also important because our Javascript will attempt to auto-fit the text blocks into their correct spaces – one  long line will look much better than 2 blocks.

Here is an example with one line highlighted. You will notice there are big spaces between the words on the highlighted line. It comes from a live Newspaper page (reproduced with permission) to show the issue.

paper from pdf

If we split out the individual words, we get this which does not look too good.

first HTML5 attempt

So let us be more fussy on what breaks we allow and try to keep the text as a single block.

 

It needs some more work and tuning but definitely a step in the right direction. What do you think?



Are you a Developer working with PDF files?

Our developers guide contains a large number of technical posts to help you understand the PDF file Format.

Do you need to solve any of these problems?

Display PDF documents in a Web app
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2022. All rights reserved.