As the developers of a PDF to HTML5 Conversion tool (JPDF2HTML5), one of the topics that we regularly get asked about is if we are able to extend the converter to convert PDF to EPUB. The problem with this task is that PDF file format has a fixed layout, whereas the EPUB file format is generally intended to be reflowable, so what we’re really being asked is whether we are able to make fixed layout PDF content reflowable. The short answer is easy: ‘no’. The long answer is that it depends – here’s why.
To start with, it’s helpful to understand what PDF and EPUB are, and what the differences are.
What is PDF?
PDF is an initialism for Portable Document Format. Contrary to it’s name however, PDF files are more like vector images than they are like traditional documents. PDF is a vector graphic format that has support for text elements. This is significant because it describes the way that content is defined within the PDF file, and therefore what data is available when converting the PDF into a different file format.
At best, text is defined line by line. If any properties change (e.g. font face, text size, color), that requires a new text draw command. The end result is that whilst it is quite obvious to a human which bits of text are headings, which lines combine into paragraphs, and how the text interacts with the rest of the content on the page, that type of information is not actually contained within the document.
PDF files contain a stream of commands such as ‘draw text’, ‘draw image’, ‘draw line’, ‘draw curve’, ‘draw rectangle’, ‘fill shape’, ‘clip shape’, etc. The process is additive, so if you draw a line of blue text and subsequently draw a red rectangle that covers the text, the blue text will still exist in the document, but will no longer be ‘visible’ to the person viewing the document. This method is often wrongly used to ‘delete’ sensitive information from a document. However, the content remains in the document and in many PDF viewers can even still be selected.
What is EPUB?
EPUB (short for Electronic PUBlication), is a file type designed for storing reflowable content, often aimed at being consumed on handheld devices such as mobile phones and e-readers. The format is mainly comprised of styled (CSS2), reflowable text, as well as inline raster or vector images. Feature support varies across devices and applications, with some features such as audio and video support dependent on device support.
Can PDF be made responsive?
If you limited the input to PDF files that contain only the most basic of Document Processor features, it would be fairly trivial to write heuristics that could pick out headings, paragraphs, and perhaps even headers and footers, allowing conversion to EPUB. The strategy very quickly falls apart as soon as any complexity is introduced, however. Here are some pages that would be difficult to make responsive:
If we were to imagine that it was possible to pick out the headings, text and columns on this page, extracting the graph from the bottom right would not be simple. Internally, the graph is a series of lines and shapes. The labels are real text, so it would be difficult to infer that they are in fact labels for a graph.
Magazine covers are particularly difficult to imagine how they would appear if they were to reflow. The only option here would be to rasterise the entire page.
This would be a candidate for ‘relatively trivial’ to reflow, if it was rotated 90 degrees clockwise. Unfortunately, as soon as a document contains text that has been transformed or rotated, it becomes next to impossible even for a human to ascertain how it would be reflowed.
In addition to containing short sections of text that would be difficult to predict what order they should be in, this page also contains text that has been clipped by shapes. This text would need to be rasterised, but even then it’s not clear where the shapes would be inlined in this document.
One of EPUB’s layout options is Fixed Layout. This means that it’s possible to define an EPUB document that has a layout which does not reflow. Unfortunately, even when this is used, the EPUB format is not powerful enough to be able to properly replicate the PDF file format. One such example is that there is no support for transformed text in EPUB. In the case of converting to another fixed-layout format, it would be preferable to use a technology such as HTML5 which offers a better equivalent, which is also natively supported on mobile devices that contain a web browser.
It is possible for PDF documents to contain tagged content. What this means is that in addition to the standard draw commands, the document also contains a marked up version of the text that could be used to better convert to a format such as EPUB. Unfortunately, in the real world this option is not widely used and is therefore not helpful in converting a generic PDF document into EPUB.
Is there an alternative?
We recommend HTML5 as the best format for making PDF content available natively across mobile devices. Although we do not intend to pursue PDF to EPUB conversion, we do have plans to see how we can make better use of tagged content in the future in order to better tag our converted content, as well as to investigate if it can be used to create a text only version of a document where possible.
You can find out more about our PDF to HTML5 converter here, or try it online for free at convert.idrsolutions.com.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.