TL;DR
True PDF redaction in Java requires two things: hiding the text visually and removing it from the content stream. This tutorial shows how to do both with JPedal in under 20 lines of code.
Why remove text from a PDF file?
Removing text from a PDF in Java is a common requirement when dealing with sensitive information, names, email addresses, phone numbers, and other personally identifiable information. Whether you are meeting GDPR redaction obligations, preparing documents for external sharing, or sanitising files before archiving, this tutorial explains how to do it using the JPedal PDF library.
What redaction actually means
Removing text from a PDF is a two-part problem. First, you find the text. Then you redact it, which itself has two layers:
- Hide the text visually, usually done by drawing an opaque box over it
- Remove it from the underlying content stream so it cannot be extracted by a PDF reader or copy-paste
Both steps are critical. Drawing a black box without editing the content stream is not true redaction. The text is still there, just invisible, and people will be able to copy and paste it. JPedal handles both steps, and together these are called redaction.
Choosing a Java PDF library for text removal
Most developers reach for Apache PDFBox first, but programmatically removing text from a PDF in Java, rather than just drawing over it, requires direct access to the content stream. JPedal exposes this through a clean API, handling both the search and the redaction in a few lines of code without manual stream manipulation.
Find, delete and redact text from a PDF in Java using JPedal
Open the PDF, scan each page for the target text, redact every match, then write out the modified document. The key methods are findTextOnPage() to locate matches and redact() to remove them. pdf.apply() commits the redaction operations to the document before writing.
- Download JPedal trial jar.
- Create a File handle to the PDF file
- Include a password if file password protected
- Open the PDF file
- Scan the pages for text
- Redact each match
- Write the output and close
final File inputFile = new File("inputFile.pdf");
final FindTextInRectangle extract = new FindTextInRectangle(inputFile);
final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(inputFile);
if (extract.openPDFFile()) {
final int pageCount = extract.getPageCount();
for (int page = 1; page <= pageCount; page++) {
final float[] coords = extract.findTextOnPage(page, "the", SearchType.MUTLI_LINE_RESULTS);
for (int val = 0; val < coords.length; val = val + 5) {
pdf.redact(page, new float[] {coords[val], coords[val + 1], coords[val + 2], coords[val + 3]});
}
}
}
extract.closePDFfile();
//apply changes and write out
pdf.apply();
final File outputFile = new File("redactedFile.pdf");
pdf.writeDocument(outputFile);
pdf.closeDocument();
findTextOnPage() returns a flat float array of coordinates for each match, x1, y1, x2, y2, plus a fifth value (magic number documented here) at index 4, which is why the loop increments by 5. The output is a new PDF with every instance of the search term permanently removed from both the visual layer and the content stream.The original file is not modified unless you overwrite it. Add try-catch blocks around the file operations and PDF calls for production use. For other PDF text manipulation tasks in Java, extracting, searching, or modifying content programmatically, see the JPedal tutorials.
You can expand your understanding of the PDF format by reading our other articles. Similarly, if there is a specific term for PDF you would like to know more about, our PDF Glossary has an extensive list of common terms.
