PDF is one of the most common formats for sharing documents. PDF files are portable and universally supported, but you may be unaware that PDF files can contain hidden content and functionality which may pose security or privacy risks.
Malicious actors can use PDF files to deliver malware through embedded JavaScript or file attachments, and private information may be present in document metadata even after the visible content has been removed. To ensure your PDF is safe to share, or that a third-party PDF is safe to process, you should sanitize it.
In this post, we will cover three aspects of PDF sanitization and how you can achieve these using our Java PDF toolkit JPedal.
Setup
For all of these examples, we will be using JPedal’s PDF Manipulator API.
You can get started with the following code snippet:
final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
// Add each sanitization operation before calling apply
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));
pdf.closeDocument();
pdf.reset();
Removing Embedded JavaScript
JavaScript (also known as ECMAScript) is often used in PDF files for form validation or for dynamic behavior, however it can also be used maliciously. Searching the web for PDF JavaScript vulnerabilities produces many examples of this happening in the past.
Removing JavaScript with JPedal can be achieved by calling the remove JavaScript method.
pdf.removeJavaScript();
Removing Embedded Files and Attachments
A PDF file can carry other files inside it, like images, ZIP archives, or executables. While often legitimate, these attachments can be used to hide malware.
Removing files with JPedal can be achieved by calling the remove embedded files method.
pdf.removeEmbeddedFiles();
Removing Metadata
PDF metadata often contains hidden sensitive information such as author names, software versions, or even GPS location data if the document came from a camera or mobile device.
Removing metadata with JPedal can be achieved by calling the remove metadata method.
pdf.removeMetadata();
Learn more about the PDF Manipulator API.
We can help you better understand the PDF format as developers who have been working with the format for more than 2 decades!