How to convert PDF to HTML in Python

Converting a PDF to HTML in Python means making a real trade-off between setup complexity and output quality, and whether you’re building something throwaway or something that ships to clients.

The pdftohtml CLI tool (part of the Poppler suite) is the lowest-friction starting point: one system install, one terminal command, and a subprocess call to wire it into Python. PyMuPDF (fitz) skips the system dependency and gives you a proper Python API with per-page HTML extraction, solid for internal tooling and solo projects.However, If you’re handling client-facing documents where layout and font fidelity actually matter, BuildVu is a reliable, battle tested solution.

This tutorial covers all three. Each section includes a working code example, the exact install step, and the specific trade-offs that will affect your choice. The comparison table at the below maps each library against cost, integration complexity and output quality so you can make the call in under a minute.

Feature	pdftohtml (CLI)	PyMuPDF	BuildVu
Cost	Free / Open Source	Free / Open Source	Commercial / Paid
Installation	Requires Poppler system install	pip install pymupdf	pip install IDRCloudClient
Visual Fidelity	Basic	Good	High (layout, font & visual accuracy)
Python Integration	Via subprocess module	Native Python library	REST API wrapper
Best For	Quick local conversions	Solo developers / students	Enterprise / commercial workflows

Using the `pdftohtml` CLI Tool

If you prefer to perform conversions locally using open-source tools, pdftohtml (part of the Poppler utility suite) is a widely-used command-line interface (CLI) tool. It is good option for quick conversions and can be easily triggered from within a Python script.

Step 1: Install Poppler

Before using the command, you need to install the Poppler library which contains the pdftohtml utility.

On Windows:
Download the latest binary from the Poppler for Windows releases and add the bin folder to your System PATH.

On macOS (via Homebrew)

brew install poppler

Step 2: Basic Command Usage

Once installed, you can convert a PDF to HTML directly from your terminal.

The basic syntax is:

pdftohtml input.pdf output.html

Step 3: Automating pdftohtml with Python

To integrate this into a Python workflow, you can use the built-in subprocess module. This allows your Python application to execute terminal commands and handle the output.

import subprocess
def convert_pdf_to_html_cli(input_path, output_path):
    try:
        # Running the pdftohtml command with the -c (complex) flag
        command = ['pdftohtml', '-c', input_path, output_path]
        result = subprocess.run(command, capture_output=True, text=True)
        if result.returncode == 0:
            print(f'Success! Conversion saved to: {output_path}')
        else:
            print(f'Error during conversion: {result.stderr}')
    except FileNotFoundError:
        print('Error: 'pdftohtml' is not installed or not in your PATH.')
# Example usage
convert_pdf_to_html_cli('my_document.pdf', 'output_folder/my_document.html')

Using PyMuPDF

Another option could be the PyMuPDF python library which is imported as fitz. It is a much better option if developers want to consider the quality of conversion and visual replication.

Step 1: Install `pymupdf`

The package is listed on PyPI as pymupdf. You can install it via pip:

pip install pymupdf

Step 2: Running the Conversion

In PyMuPDF, the conversion happens page by page. You can extract the content of a page directly as an HTML string using the .get_text("html") method.

import fitz  # PyMuPDF is imported as fitz
def simple_pdf_to_html(pdf_path, output_html):
    # 1. Open the PDF document
    doc = fitz.open(pdf_path)
    full_html = "
    # 2. Iterate through each page and extract HTML
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        page_html = page.get_text('html')
        full_html += page_html
    # 3. Save to a file
    with open(output_html, 'w', encoding='utf-8′) as f:
        f.write(full_html)
    print(f'Conversion complete: {output_html}')
    doc.close()
simple_pdf_to_html('sample.pdf', 'output.html')

Convert PDF to HTML using BuildVu

If you’re looking for true recreation of your documents in HTML with regards to layout, font and visual fidelity then a commercial library might be more suited to your need. We’ve developed BuildVu from the last 20 years to focus on robust performance and enterprise-reliability. This tutorial uses our Python IDRCloudClient open source which provides a simple Python wrapper around the REST API.

Prerequisites

Using pip, install the IDRCloudClient package with the following command:

pip install IDRCloudClient

Code Examples

Below is a basic code example for converting PDF files to HTML. Additional configuration options and advanced features are detailed below.

from IDRSolutions import IDRCloudClient
client = IDRCloudClient('https://cloud.idrsolutions.com/cloud/' + IDRCloudClient.BUILDVU)
try:
    result = client.convert(
        # token='Token', # Required only when connecting to the IDRsolutions trial and cloud subscription service
        input=IDRCloudClient.UPLOAD,
        file='/path/to/exampleFile.pdf'
    )
    outputURL = result['downloadUrl']
    client.downloadResult(result, 'path/to/output/dir')
    if outputURL is not None:
        print("Download URL: " + outputURL)
except Exception as error:
    print(error)

Return result to a callback url

The BuildVu Microservice supports a callback URL to send the status of a conversion on completion. Using a callback URL eliminates the need to continually check the service for updates. You can provide the callback URL to the `convert` method as shown below:

result = client.convert(
    # token='Token', # Required only when connecting to the IDRsolutions trial and cloud subscription service
    input=IDRCloudClient.UPLOAD,
    callbackUrl='http://listener.url',
    file='/path/to/exampleFile.pdf'
)

Configuration Options

The BuildVu API allows for conversion customization using a stringified JSON object with key-value pair configuration options. Provide these settings to the convert method. A comprehensive list of options for converting PDF files to HTML is available here.

settings='{"key":"value","key":"value"}'

Upload by URL

In addition to uploading a local file, you can provide a URL for the BuildVu Microservice to download and convert. Simply replace the input and file values in the convert method with the following.

input=IDRCloudClient.DOWNLOAD
url='http://exampleURL/exampleFile.pdf'

Using Authentication

For deployments of your own BuildVu Microservice that require a username and password for PDF-to-HTML conversions, provide these credentials with each conversion. Pass a variable named auth to the convert method as demonstrated below.

auth=('username', 'password'))

Which Method for PDF to HTML Conversion using Python?

If you’re a student or a solo developer looking to setup a quick automation from PDF to HTML conversion using Python then using PyMuPDF or the pdftohtml CLI Tool might be a better option for you. However if you’re looking to integrate a commercially viable solution, BuildVu would be the more sensible option, as it has a dedicated support team of developers (the same ones who originally created the library).

BuildVu allows you to

View PDF files in a Web app

Convert PDF documents to HTML5

Parse PDF documents as HTML