On more than one occasion we have had clients coming to us reporting that JPedal tells them that their files are corrupt. This is due to a common misconception about the PDF format – unlike HTML, whitespace and special characters matter. You need to treat the PDF file as a Binary object (Blob).
Different platforms, by default, use different character encodings. Because of this, when Java reads in a file it believes to be text, it makes certain assumptions. This is fantastic if you actually are dealing with text, but not so good if you’re dealing with binary data.
Because PDF files often contain raw image data, and the size of each section of the file is specified at the start, if any characters are changed or removed JPedal starts reading further into the file than it should. This is what causes the error message saying the file is corrupt – because by the time it gets to JPedal it is!
Unfortunately there’s nothing we can do about this slightly counter-intuitive way of doing things – it doesn’t help that the Java class which converts characters is called FileReader, while the class which doesn’t is called FileInputStream. Not the clearest of names!
So, for the record, the correct way of reading a PDF file into a byte array is the following:
//Set up stream
File file = new File(filename);
FileInputStream stream = new FileInputStream(file);//Read file into byte array
int a;
int count=0;
byte[] pdf = new byte[(int)(file.length())];
while ((a=stream.read()) != -1) {pdf[count] = (byte)a;
count++;}
stream.close();
Are you a Java Developer working with PDF files?
Free: The Java Developer's Guide to PDF |
Convert PDF to HTML in Java |
Convert PDF Forms to HTML5 in Java |
Convert PDF Documents to an image in Java |
Work with PDF Documents in Java |