Corrupt PDFs? Maybe this is your problem.

On more than one occasion we have had clients coming to us reporting that JPedal tells them that their files are corrupt. This is due to a common misconception about the PDF format – unlike HTML, whitespace and special characters matter.

Different platforms, by default, use different character encodings. Because of this, when Java reads in a file it believes to be text, it makes certain assumptions. This is fantastic if you actually are dealing with text, but not so good if you’re dealing with binary data.

Because PDF’s often contain raw image data, and the size of each section of the file is specified at the start, if any characters are changed or removed JPedal starts reading further into the file than it should. This is what causes the error message saying the file is corrupt – because by the time it gets to JPedal it is!

Unfortunately there’s nothing we can do about this slightly counter-intuitive way of doing things – it doesn’t help that the Java class which converts characters is called FileReader, while the class which doesn’t is called FileInputStream. Not the clearest of names!

So, for the record, the correct way of reading a PDF file into a byte array is the following:

//Set up stream
File file = new File(filename);
FileInputStream stream = new FileInputStream(file);

//Read file into byte array
int a;
int count=0;
byte[] pdf = new byte[(int)(file.length())];
while ((a=stream.read()) != -1) {

pdf[count] = (byte)a;
count++;

}
stream.close();

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.
Sam is a developer at IDRsolutions who mostly specialises in font support and conversion. He's also enjoyed working with Java 3D, Java FX and Swing. His other interests include music and game design.
SamH

About Sam Howard

Sam is a developer at IDRsolutions who mostly specialises in font support and conversion. He’s also enjoyed working with Java 3D, Java FX and Swing. His other interests include music and game design.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>