Converting a PDF to HTML in Python means making a real trade-off between setup complexity and output quality, and whether you’re building something throwaway or something that ships to clients.
The pdftohtml CLI tool (part of the Poppler suite) is the lowest-friction starting point: one system install, one terminal command, and a subprocess call to wire it into Python. PyMuPDF (fitz) skips the system dependency and gives you a proper Python API with per-page HTML extraction, solid for internal tooling and solo projects.However, If you’re handling client-facing documents where layout and font fidelity actually matter, BuildVu is a reliable, battle tested solution.
This tutorial covers all three. Each section includes a working code example, the exact install step, and the specific trade-offs that will affect your choice. The comparison table at the below maps each library against cost, integration complexity and output quality so you can make the call in under a minute.
| Feature | pdftohtml (CLI) | PyMuPDF | BuildVu |
|---|---|---|---|
| Cost | Free / Open Source | Free / Open Source | Commercial / Paid |
| Installation | Requires Poppler system install | pip install pymupdf | pip install IDRCloudClient |
| Visual Fidelity | Basic | Good | High (layout, font & visual accuracy) |
| Python Integration | Via subprocess module | Native Python library | REST API wrapper |
| Best For | Quick local conversions | Solo developers / students | Enterprise / commercial workflows |
Using the pdftohtml CLI Tool
If you prefer to perform conversions locally using open-source tools, pdftohtml (part of the Poppler utility suite) is a widely-used command-line interface (CLI) tool. It is good option for quick conversions and can be easily triggered from within a Python script.
Step 1: Install Poppler
Before using the command, you need to install the Poppler library which contains the pdftohtml utility.
On Windows:
Download the latest binary from the Poppler for Windows releases and add the bin folder to your System PATH.
On macOS (via Homebrew)
brew install poppler
Step 2: Basic Command Usage
Once installed, you can convert a PDF to HTML directly from your terminal.
The basic syntax is:
pdftohtml input.pdf output.html
Step 3: Automating pdftohtml with Python
To integrate this into a Python workflow, you can use the built-in subprocess module. This allows your Python application to execute terminal commands and handle the output.
import subprocess
def convert_pdf_to_html_cli(input_path, output_path):
try:
# Running the pdftohtml command with the -c (complex) flag
command = ['pdftohtml', '-c', input_path, output_path]
result = subprocess.run(command, capture_output=True, text=True)
if result.returncode == 0:
print(f'Success! Conversion saved to: {output_path}')
else:
print(f'Error during conversion: {result.stderr}')
except FileNotFoundError:
print('Error: 'pdftohtml' is not installed or not in your PATH.')
# Example usage
convert_pdf_to_html_cli('my_document.pdf', 'output_folder/my_document.html')
Using PyMuPDF
Another option could be the PyMuPDF python library which is imported as fitz. It is a much better option if developers want to consider the quality of conversion and visual replication.
Step 1: Install pymupdf
The package is listed on PyPI as pymupdf. You can install it via pip:
pip install pymupdf
Step 2: Running the Conversion
In PyMuPDF, the conversion happens page by page. You can extract the content of a page directly as an HTML string using the .get_text("html") method.
import fitz # PyMuPDF is imported as fitz def simple_pdf_to_html(pdf_path, output_html): # 1. Open the PDF document doc = fitz.open(pdf_path) full_html = '' # 2. Iterate through each page and extract HTML for page_num in range(len(doc)): page = doc.load_page(page_num) page_html = page.get_text('html') full_html += page_html # 3. Save to a file with open(output_html, 'w', encoding='utf-8') as f: f.write(full_html) print(f'Conversion complete: {output_html}') doc.close() simple_pdf_to_html('sample.pdf', 'output.html')
Convert PDF to HTML using BuildVu
If you're looking for true recreation of your documents in HTML with regards to layout, font and visual fidelity then a commercial library might be more suited to your need. We've developed BuildVu from the last 20 years to focus on robust performance and enterprise-reliability. This tutorial uses our Python IDRCloudClient open source which provides a simple Python wrapper around the REST API.
Prerequisites
Using pip, install the IDRCloudClient package with the following command:
pip install IDRCloudClientCode Examples
Below is a basic code example for converting PDF files to HTML. Additional configuration options and advanced features are detailed below.
from IDRSolutions import IDRCloudClient client = IDRCloudClient('https://cloud.idrsolutions.com/cloud/' + IDRCloudClient.BUILDVU) try: result = client.convert( # token='Token', # Required only when connecting to the IDRsolutions trial and cloud subscription service input=IDRCloudClient.UPLOAD, file='/path/to/exampleFile.pdf' ) outputURL = result['downloadUrl'] client.downloadResult(result, 'path/to/output/dir') if outputURL is not None: print("Download URL: " + outputURL) except Exception as error: print(error)
Return result to a callback url
The BuildVu Microservice supports a callback URL to send the status of a conversion on completion. Using a callback URL eliminates the need to continually check the service for updates. You can provide the callback URL to the `convert` method as shown below:
result = client.convert( # token='Token', # Required only when connecting to the IDRsolutions trial and cloud subscription service input=IDRCloudClient.UPLOAD, callbackUrl='http://listener.url', file='/path/to/exampleFile.pdf' )
Configuration Options
The BuildVu API allows for conversion customization using a stringified JSON object with key-value pair configuration options. Provide these settings to the convert method. A comprehensive list of options for converting PDF files to HTML is available here.
settings='{"key":"value","key":"value"}'Upload by URL
In addition to uploading a local file, you can provide a URL for the BuildVu Microservice to download and convert. Simply replace the input and file values in the convert method with the following.
input=IDRCloudClient.DOWNLOAD url='http://exampleURL/exampleFile.pdf'
Using Authentication
For deployments of your own BuildVu Microservice that require a username and password for PDF-to-HTML conversions, provide these credentials with each conversion. Pass a variable named auth to the convert method as demonstrated below.
auth=('username', 'password'))
Which Method for PDF to HTML Conversion using Python?
If you're a student or a solo developer looking to setup a quick automation from PDF to HTML conversion using Python then using PyMuPDF or the pdftohtml CLI Tool might be a better option for you. However if you're looking to integrate a commercially viable solution, BuildVu would be the more sensible option, as it has a dedicated support team of developers (the same ones who originally created the library).
BuildVu allows you to
| View PDF files in a Web app |
| Convert PDF documents to HTML5 |
| Parse PDF documents as HTML |
What is BuildVu?
BuildVu is a commercial SDK for converting PDF files into standalone HTML or SVG.
Why use BuildVu?
BuildVu allows you to integrate PDF into your HTML workflow effortlessly and securely by producing clean HTML that is easy for developers to work with.
What licenses are available?
We have 3 licenses available:
Cloud for conversion using the shared IDRsolutions cloud server, Self hosted server option for your own cloud or on-premise servers, and Enterprise for more demanding requirements.
How to use BuildVu?
Want to learn more about BuildVu and how to use it, we have plenty of tutorials and guides to help you.