Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

PDF to HTML5 conversion – Tradeoff of precision versus filesize

1 min read

When I was at school we used to have endless arguments in our Maths lessons about whether 3.000000000 was a more correct answer than 3 on its own. The answer is that it depends on the scenario.

We have a similar issue in PDF to HTML conversion where we have to decide how precise the answer should be. Internally we can work to about 6-8 decimal places with a reasonable degree of confidence?  But what should we put into the HTML output. Consider this example of 2 versions of HTML generated from the same PDF file. The first example is arguably less accurate but because an HTML file is a text file, the second example will produce a much larger file. We have some large sample PDF files where it can make a big difference.

#t105 {
position:absolute;
left:213px;
top:506px;
FONT-SIZE: 12px;
FONT-FAMILY: 'Times New Roman', Times, serif;
color:rgb(0,0,0);
}

pdf_context.moveTo(90,703);
pdf_context.lineTo(234,703);
pdf_context.lineTo(234,702);
pdf_context.lineTo(90,702);
pdf_context.lineTo(90,703);
#t105 {
position:absolute;
left:213.07166px;
top:506.77997px;
FONT-SIZE: 12px;
FONT-FAMILY: 'Times New Roman', Times, serif;
color:rgb(0,0,0);

pdf_context.moveTo(90.0,703.199996948);
pdf_context.lineTo(234.0,703.199996948);
pdf_context.lineTo(234.0,702.599998474);
pdf_context.lineTo(90.0,702.599998474);
pdf_context.lineTo(90.0,703.199996948);

So which is better? That answer depends on the PDF and the tradeoffs that the user is prepared to make. In cases like this, we set a default and allow the user to choose. The latest release of our PDF to HTML conversion software, adds this new line in the example so you can set it as you wish

DynamicVectorRenderer HTMLoutput=new HTMLDisplay(page, cropBox ,false,100, new ObjectStore());
HTMLoutput.setMaxNumberOfDecimalPlaces(0); //let use select max number of decimal places
HTMLoutput.setOutputDir(output_dir,outputName); //root for output
FormFactory HTMLFormFactory=new HTMLFormFactory(HTMLoutput, decode_pdf.getPdfPageData().getMediaBoxHeight(page));
HTMLFormFactory.setDecoder(decode_pdf);
decode_pdf.addExternalHandler(HTMLoutput, Options.CustomOutput); //custom object to draw PDF
decode_pdf.addExternalHandler(HTMLFormFactory, Options.FormFactory); //custom object to draw Forms

What do you think is the best tradeoff?

Click here to see all the articles in the PDF to HTML5 conversion series.

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Converting your PDF files to HTML5 with BuildVu 

Recently we announced our updated product range for 2018 and are rebranding some existing products, like JPDF2HTML5 which has been renamed to BuildVu. It...
Georgia Ingham
2 min read

Favourite resources from our HTML development team

As the web progresses and grows, so do the technologies that come along with it. Trying to keep on top of everything you need...
Ovidijus Okinskas
1 min read

How HTML5 Javadocs in Java 9 will make your…

Here at IDRsolutions we are very excited about Java 9 and have written a series of articles explaining some of the main features. In...
Rob
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *