As this is a question we get asked a lot at IDRsolutions, I decided to write a blog article on the topic, which may well develop into a series…
Microsoft Office files are an industry standard and lots of people want to convert them into PDF or HTML5 or SVG. One option is to use Microsoft Office but there is an alternative that is cross-platform and free – LibreOffice. It is a version of the Open Source library OpenOffice which has excellent support for Word, PowerPoint and other office file formats. They are both very similar with slightly different strengths and weaknesses (and both are free so try both yourself and choose).
LibreOffice has TWO very useful features. Firstly, it is cross-platform so it will run on Linux and OS X boxes and not just Windows. Secondly, it does not need a user to run it – the software can be called from your programs as a library. This is really easy to do. So
libreoffice --headless --convert-to pdf myFile.docx
will turn the Word file myFile.docx into a PDF file. We get to see a lot of PDF files and the PDF files created by LibreOffice are generally very good.
LibreOffice has several APIs (including Java) or you can just call it as an external process with this code in Java.
// Get an instance of shell Process pqShell = Runtime.getRuntime().exec("sh"); String shellCommand = "libreoffice --headless --convert-to pdf " + fileName; try { java.io.DataOutputStream dos = new java.io.DataOutputStream(pqShell.getOutputStream()); dos.writeBytes("cd " + userInputDirPath + "\n"); dos.writeBytes(shellCommand + "\n"); dos.writeBytes("exit\n"); dos.flush(); dos.close(); pqShell.waitFor(); } catch (Exception ex) { ex.printStackTrace(); } finally { pqShell.destroy(); }
The –convert-to parameter can take any filetype as a parameter (ie txt for Office to Text, HTML for Office to HTML), etc. There are lots of additional features which we may document in later articles…
The HTML output is quite simple, so we have been linking the PDF files created via LibreOffice to our PDF to HTML5 converter and testing for several months now. We (and our test customers) have been very pleased with the results and we know of lots of companies using LibreOffice internally for file conversion.
So we have added LibreOffice to our free online converter which now allows people to convert not just PDF files but also convert Word Documents to HTML5, Excel Documents to HTML5 and Powerpoint to HTML5.
We recommend this additional functionality to our commercial clients who want to process a wider range of documents with our PDF to HTML5 converter.
We are very impressed with the possibilities of LibreOffice as part of a two-stage conversion process to turn Office Documents into HTML5 via PDF. I was less enthusiastic about Office to HTML direct conversion. I hope that if you are doing anything with Office documents on a server or desktop, you have a look and experiment with it as part of your solution.
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |
Useful Post.
Behind the scenes, Office Documents are printed to PDF before making them viewable. For Excel files, this will often lead to weird pagination. Unfortunately, there is no setting to dynamically scale the PDF output as these settings are stored per document. Therefore, it is important that page settings are made for each Excel document before uploading it to WebCenter.
You can view Microsoft Office documents and other file types in WebCenter (16.1 onwards) by specifying the file extensions and the corresponding OBGE Workflow tickets. The system administrator can define the extensions in the configuration file. For each of these extensions, you need a custom view generation ticket. This is a ticket on the OBGE that will generate the required files to make a document viewable in the WebCenter 2D Viewer.