There are several ways to corrupt (or break) a PDF file:-
Common issues with corrupted PDF files
- Broken xref table
- Corrupted COS objects
- Content added to the start (which also breaks the xref table)
- Content added to the end of the file (the end file marker in a PDF file is supposed to be in the last 1k of the file).
- Truncated file with content deleted at the end (often the critical Catalog).
Why is the xref table so important?
A PDF file contains a map at the end of the file (xref) table showing the byte offset of all the COS objects. This makes for very fast access. But if these values are wrong, the PDF parser would have to manually scan the PDF file and try to figure this out.
How to break a PDF file?
The easiest way to break a PDF file is to open it in a text editor and resave. This will alter all the offsets and break the xref table.
Can I still use a corrupt PDF file?
Many PDF parsers will attempt to handle corrupt PDF files. There are no standards on how to implement this.
Our PDF parser will manually try to figure of a file xref table if needed (which is much slower than just reading the xref table). We also make a lot of allowances for missing or additional content and wrong values.
Ideally you should stick to the PDF file format specification.
How to repair a corrupt PDF file?
Adobe Acrobat will attempt to repair a broken PDF file (if possible) and allow you to resave the fixed version.
Our software libraries allow you to
Convert PDF to HTML in Java |
Convert PDF Forms to HTML5 in Java |
Convert PDF Documents to an image in Java |
Work with PDF Documents in Java |
Read and Write AVIF, HEIC, WEBP and other image formats |