Leon Atherton Leon has over 15 years’ Java experience and leads BuildVu, also contributing to cloud services and internal tooling. Wearing many hats across technical and growth roles as a core member, he enjoys motorsport, strategy games, and software side projects.

How to convert PDF to HTML in Python

4 min read

PDF forms to HTML forms using Java (Icon)

Converting a PDF to HTML in Python means making a real trade-off between setup complexity and output quality, and whether you’re building something throwaway or something that ships to clients.

The pdftohtml CLI tool (part of the Poppler suite) is the lowest-friction starting point: one system install, one terminal command, and a subprocess call to wire it into Python. PyMuPDF (fitz) skips the system dependency and gives you a proper Python API with per-page HTML extraction, solid for internal tooling and solo projects.However, If you’re handling client-facing documents where layout and font fidelity actually matter, BuildVu is a reliable, battle tested solution.

This tutorial covers all three. Each section includes a working code example, the exact install step, and the specific trade-offs that will affect your choice. The comparison table at the below maps each library against cost, integration complexity and output quality so you can make the call in under a minute.

Featurepdftohtml (CLI)PyMuPDFBuildVu
CostFree / Open SourceFree / Open SourceCommercial / Paid
InstallationRequires Poppler system installpip install pymupdfpip install IDRCloudClient
Visual FidelityBasicGoodHigh (layout, font & visual accuracy)
Python IntegrationVia subprocess moduleNative Python libraryREST API wrapper
Best ForQuick local conversionsSolo developers / studentsEnterprise / commercial workflows

Using the pdftohtml CLI Tool

If you prefer to perform conversions locally using open-source tools, pdftohtml (part of the Poppler utility suite) is a widely-used command-line interface (CLI) tool. It is good option for quick conversions and can be easily triggered from within a Python script.

Step 1: Install Poppler

Before using the command, you need to install the Poppler library which contains the pdftohtml utility.

On Windows:
Download the latest binary from the Poppler for Windows releases and add the bin folder to your System PATH.

On macOS (via Homebrew)



brew install poppler

 

Step 2: Basic Command Usage

Once installed, you can convert a PDF to HTML directly from your terminal.

The basic syntax is:



pdftohtml input.pdf output.html

 

Step 3: Automating pdftohtml with Python

To integrate this into a Python workflow, you can use the built-in subprocess module. This allows your Python application to execute terminal commands and handle the output.



import subprocess

def convert_pdf_to_html_cli(input_path, output_path):
    try:
        # Running the pdftohtml command with the -c (complex) flag
        command = ['pdftohtml', '-c', input_path, output_path]
        
        result = subprocess.run(command, capture_output=True, text=True)

        if result.returncode == 0:
            print(f'Success! Conversion saved to: {output_path}')
        else:
            print(f'Error during conversion: {result.stderr}')
            
    except FileNotFoundError:
        print('Error: 'pdftohtml' is not installed or not in your PATH.')

# Example usage
convert_pdf_to_html_cli('my_document.pdf', 'output_folder/my_document.html')

 

Using PyMuPDF

Another option could be the PyMuPDF python library which is imported as fitz. It is a much better option if developers want to consider the quality of conversion and visual replication.

Step 1: Install pymupdf

The package is listed on PyPI as pymupdf. You can install it via pip:



pip install pymupdf

 

Step 2: Running the Conversion

In PyMuPDF, the conversion happens page by page. You can extract the content of a page directly as an HTML string using the .get_text("html") method.


import fitz  # PyMuPDF is imported as fitz

def simple_pdf_to_html(pdf_path, output_html):
    # 1. Open the PDF document
    doc = fitz.open(pdf_path)
    full_html = ''

    # 2. Iterate through each page and extract HTML
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        page_html = page.get_text('html')
        full_html += page_html

    # 3. Save to a file
    with open(output_html, 'w', encoding='utf-8') as f:
        f.write(full_html)

    print(f'Conversion complete: {output_html}')
    doc.close()

simple_pdf_to_html('sample.pdf', 'output.html')

 

Convert PDF to HTML using BuildVu

If you're looking for true recreation of your documents in HTML with regards to layout, font and visual fidelity then a commercial library might be more suited to your need. We've developed BuildVu from the last 20 years to focus on robust performance and enterprise-reliability. This tutorial uses our Python IDRCloudClient open source which provides a simple Python wrapper around the REST API.

Prerequisites

Using pip, install the IDRCloudClient package with the following command:


 

Code Examples

Below is a basic code example for converting PDF files to HTML. Additional configuration options and advanced features are detailed below.

 

Return result to a callback url

The BuildVu Microservice supports a callback URL to send the status of a conversion on completion. Using a callback URL eliminates the need to continually check the service for updates. You can provide the callback URL to the `convert` method as shown below:

 

Configuration Options

The BuildVu API allows for conversion customization using a stringified JSON object with key-value pair configuration options. Provide these settings to the convert method. A comprehensive list of options for converting PDF files to HTML is available here.


 

Upload by URL

In addition to uploading a local file, you can provide a URL for the BuildVu Microservice to download and convert. Simply replace the input and file values in the convert method with the following.

 

Using Authentication

For deployments of your own BuildVu Microservice that require a username and password for PDF-to-HTML conversions, provide these credentials with each conversion. Pass a variable named auth to the convert method as demonstrated below.

 

Which Method for PDF to HTML Conversion using Python?

If you're a student or a solo developer looking to setup a quick automation from PDF to HTML conversion using Python then using PyMuPDF or the pdftohtml CLI Tool might be a better option for you. However if you're looking to integrate a commercially viable solution, BuildVu would be the more sensible option, as it has a dedicated support team of developers (the same ones who originally created the library).



BuildVu allows you to

View PDF files in a Web app
Convert PDF documents to HTML5
Parse PDF documents as HTML
Leon Atherton Leon has over 15 years’ Java experience and leads BuildVu, also contributing to cloud services and internal tooling. Wearing many hats across technical and growth roles as a core member, he enjoys motorsport, strategy games, and software side projects.

PDF Performance and UX Issues in Web Publishing (Why…

Why PDFs Cause Performance and UX Issues in Web Publishing Embedding PDFs in websites is a common approach for digital publishing platforms, but it...
chika
2 min read

PDF vs HTML5 for Web Publishing: Which Is Better?

TL;DR PDFs are not ideal for digital publishing platforms They are slow to load and render in web applications They do not work well...
chika
1 min read