As the developers of a PDF to HTML5 Conversion tool (BuildVu), one of the topics we are regularly asked about is converting PDF to EPUB. The main problem with this task is that PDF is a fixed-layout file format, whilst EPUB is generally intended to be reflowable. So what we’re really being asked is if fixed-layout PDF content can be made responsive. The short answer is: ‘no’. The long answer is that it depends – here’s why.
To start with, it’s helpful to understand the differences between the PDF and EPUB file formats.
What is PDF?
PDF is an initialism for ‘Portable Document Format’. Contrary to its name, however, PDF files are more like vector images than rich-text documents. It would be more accurate to describe PDF as a vector graphic format with support for text elements.
In PDF, the text is drawn one line at a time. When a new line occurs or any other properties change (e.g. font face, text size, color), then a new text draw command is required. A human can infer which bits of text are headings, which lines combine into paragraphs, and how the text interacts with the rest of the content on the page, but that type of information is not actually contained within the document.
PDF files contain a stream of commands such as ‘draw text’, ‘draw image’, ‘draw line’, ‘draw curve’, ‘draw rectangle’, ‘fill shape’, ‘clip shape’, etc. The process is additive, so drawing a line of blue text followed by a red rectangle that covers the text may obscure the blue text, but it still exists within the document. This method is often wrongly used to ‘delete’ sensitive information from a document. In most PDF viewers the blue text can still be selected even though it’s behind the red rectangle.
What is EPUB?
EPUB (short for Electronic PUBlication), is a file type for reflowable content aimed at being consumed on handheld devices such as mobile phones and e-readers. The format is mainly comprised of styled (CSS2) text and may include inline raster or vector images. Feature support varies across devices and applications, with some features such as audio and video support dependent on device support.
Can PDF be made responsive?
For very simple documents, it may be possible to write heuristics to detect headings, paragraphs, and perhaps even headers and footers, thereby allowing conversion to EPUB. However, the strategy very quickly falls apart as soon as any complexity is introduced. Here are some examples that would be difficult to make responsive:
Detecting the headings, text and columns on this page may be possible, but extracting the graph from the bottom right would not be simple. In this example, the blue gradient background is drawn first, followed by the image in the top left, followed by the other shapes on the page, followed by the rest of the line graph. The graph labels are text commands. The only way to extract this graph would be to draw the page in full and crop it to that rectangle.
Magazine covers are particularly difficult to imagine how they would appear if they were to reflow. The only option here would be to rasterise the entire page.
This would be a candidate for ‘relatively trivial’ to reflow if the page was rotated 90 degrees clockwise. Unfortunately, as soon as a document contains text that has been transformed or rotated, it becomes next to impossible even for a human to ascertain how it would be reflowed.
In addition to containing short sections of text for which it would be difficult to predict reading order, this page also contains text that has been clipped by shapes. This text would need to be rasterised, but even then it’s not clear where the shapes would be inlined in this document.
1. One of EPUB’s layout options is ‘Fixed Layout’. This means that it’s possible to define an EPUB document that has a layout that does not reflow. However, the EPUB format is not powerful enough to be able to properly replicate the PDF file format. One such example is that there is no support for transformed text in EPUB. In the case of converting to another fixed-layout format, it would be preferable to use a technology such as HTML5 which offers a better equivalent and is natively supported on mobile devices that contain a web browser.
2. PDF documents can contain tagged content. In addition to the standard draw commands, such documents also contain a marked-up version of the text that could be used to better convert to a format such as EPUB. Unfortunately, in the real world, this option is not widely used and therefore cannot be relied on for converting arbitrary PDF documents into EPUB.
Is there an alternative?
We recommend HTML5 as the best format for making PDF content available natively across mobile devices. Although we do not intend to pursue PDF to EPUB conversion, in the future we do have plans to see how we can make better use of tagged content, as well as to investigate if it can be used to create a text-only version of a document where possible.
Do you need to solve any of these problems?
|Display PDF documents in a Web app|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|
Are you a Developer working with PDF files?
|Learn more about PDF file format|