PDF to HTML5 conversion – Tradeoff of precision versus filesize

When I was at school we used to have endless arguments in our Maths lessons about whether 3.000000000 was a more correct answer than 3 on its own. The answer is that it depends on the scenario.

We have a similar issue in PDF to HTML conversion where we have to decide how precise the answer should be. Internally we can work to about 6-8 decimal places with a reasonable degree of confidence?  But what should we put into the HTML output. Consider this example of 2 versions of HTML generated from the same PDF file. The first example is arguably less accurate but because an HTML file is a text file, the second example will produce a much larger file. We have some large sample PDF files where it can make a big difference.

#t105 {
position:absolute;
left:213px;
top:506px;
FONT-SIZE: 12px;
FONT-FAMILY: 'Times New Roman', Times, serif;
color:rgb(0,0,0);
}

pdf_context.moveTo(90,703);
pdf_context.lineTo(234,703);
pdf_context.lineTo(234,702);
pdf_context.lineTo(90,702);
pdf_context.lineTo(90,703);
#t105 {
position:absolute;
left:213.07166px;
top:506.77997px;
FONT-SIZE: 12px;
FONT-FAMILY: 'Times New Roman', Times, serif;
color:rgb(0,0,0);

pdf_context.moveTo(90.0,703.199996948);
pdf_context.lineTo(234.0,703.199996948);
pdf_context.lineTo(234.0,702.599998474);
pdf_context.lineTo(90.0,702.599998474);
pdf_context.lineTo(90.0,703.199996948);

So which is better? That answer depends on the PDF and the tradeoffs that the user is prepared to make. In cases like this, we set a default and allow the user to choose. The latest release of our PDF to HTML conversion software, adds this new line in the example so you can set it as you wish

DynamicVectorRenderer HTMLoutput=new HTMLDisplay(page, cropBox ,false,100, new ObjectStore());
HTMLoutput.setMaxNumberOfDecimalPlaces(0); //let use select max number of decimal places
HTMLoutput.setOutputDir(output_dir,outputName); //root for output
FormFactory HTMLFormFactory=new HTMLFormFactory(HTMLoutput, decode_pdf.getPdfPageData().getMediaBoxHeight(page));
HTMLFormFactory.setDecoder(decode_pdf);
decode_pdf.addExternalHandler(HTMLoutput, Options.CustomOutput); //custom object to draw PDF
decode_pdf.addExternalHandler(HTMLFormFactory, Options.FormFactory); //custom object to draw Forms

What do you think is the best tradeoff?

Click here to see all the articles in the PDF to HTML5 conversion series.

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>