I started IDRsolutions while working for the Times Newspaper group in the 1990s. So I know that the complex page layout on Newspaper pages tends to raise a whole load of special issues. But also that it provides some really good case studies to hone our technology. Here is an example I would like to share.
In our PDF to HTML5 conversion process, there is a trade-off. We can position every glyf on its own. In this case we get accurate but very large HTML files. Or we can roll the text together into lines – losing a little accuracy but producing much smaller files. This grouping is also important because our Javascript will attempt to auto-fit the text blocks into their correct spaces – one long line will look much better than 2 blocks.
Here is an example with one line highlighted. You will notice there are big spaces between the words on the highlighted line. It comes from a live Newspaper page (reproduced with permission) to show the issue.
If we split out the individual words, we get this which does not look too good.
So let us be more fussy on what breaks we allow and try to keep the text as a single block.
It needs some more work and tuning but definitely a step in the right direction. What do you think?
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
Display PDF documents in a Web app |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |