Site iconJava PDF Blog

How to convert Microsoft Office documents to PDF, HTML5 or SVG

Office to PDF, HTML5, and SVG

As this is a question we get asked a lot at IDRsolutions, I decided to write a blog article on the topic, which may well develop into a series…

Microsoft Office files are an industry standard and lots of people want to convert them into PDF or HTML5 or SVG. One option is to use Microsoft Office but there is an alternative that is cross-platform and free  – LibreOffice. It is a version of the Open Source library OpenOffice which has excellent support for Word, PowerPoint and other office file formats. They are both very similar with slightly different strengths and weaknesses (and both are free so try both yourself and choose).

LibreOffice has TWO very useful features. Firstly, it is cross-platform so it will run on Linux and OS X boxes and not just Windows. Secondly, it does not need a user to run it – the software can be called from your programs as a library. This is really easy to do. So

libreoffice --headless --convert-to pdf myFile.docx

will turn the Word file myFile.docx into a PDF file. We get to see a lot of PDF files and the PDF files created by LibreOffice are generally very good.

LibreOffice has several APIs (including Java) or you can just call it as an external process with this code in Java.

// Get an instance of shell
            Process pqShell = Runtime.getRuntime().exec("sh");
            
            String shellCommand = "libreoffice --headless --convert-to pdf " + fileName;
            try {
                java.io.DataOutputStream dos = new java.io.DataOutputStream(pqShell.getOutputStream());
                dos.writeBytes("cd " + userInputDirPath + "\n");
                dos.writeBytes(shellCommand + "\n");
                dos.writeBytes("exit\n");
                dos.flush();
                dos.close();
                pqShell.waitFor();
            } catch (Exception ex) {
                ex.printStackTrace();
            } finally {
                pqShell.destroy();
            }

The –convert-to parameter can take any filetype as a parameter (ie txt for Office to Text, HTML for Office to HTML), etc. There are lots of additional features which we may document in later articles…

The HTML output is quite simple, so we have been linking the PDF files created via LibreOffice to our PDF to HTML5 converter and testing for several months now. We (and our test customers) have been very pleased with the results and we know of lots of companies using LibreOffice internally for file conversion.

So we have added LibreOffice to our free online converter which now allows people to convert not just PDF files but also convert Word Documents to HTML5, Excel Documents to HTML5 and Powerpoint to HTML5.

We recommend this additional functionality to our commercial clients who want to process a wider range of documents with our PDF to HTML5 converter.

We are very impressed with the possibilities of LibreOffice as part of a two-stage conversion process to turn Office Documents into HTML5 via PDF. I was less enthusiastic about Office to HTML direct conversion.  I hope that if you are doing anything with Office documents on a server or desktop, you have a look and experiment with it as part of your solution.