Check if a PDF is valid using the HTML5 File API

Using the HTML5 File API to validate PDF Files

Recently in my own time I have been familiarizing myself with the HTML5 File API. It allows you to create web applications that read files entirely on the client side without the need to upload a thing which can be useful if you want to read a local file and display it’s data (e.g. A picture slide show application or a spreadsheet application), or if you want to check it conforms to some standard before letting the user upload it to a server. This blog post is an example of the latter, specifically geared towards PDF files since that’s what we specialize in, but the code here can be applied to other file types and should be easy enough to follow.

Exposition

Below is the example situation we will be using:

We offer a cloud based service for our PDF to HTML5 converter that let’s you easily integrate it in your own application (be it web app, desktop or server). This service let’s you send a PDF to us and be returned the converted files. But what if you send an invalid PDF to us unknowingly? You won’t get a reply and will of wasted time and bandwidth on a duff file. Obviously you’d want to prevent this situation from happening.

Understanding what makes a valid PDF

First you need to make sure you know what a valid PDF is, we have a whole plethora of blog articles detailing the PDF specification and what a PDF actually is (indexed here) but for now let’s start with something simple that we can use to check if a file is a valid PDF file, the File Header. (If you really want to understand what makes a PDF valid try the Making your own PDF file set of blog posts by Daniel).

According to the PDF Specification (section 7.5.2):

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
A conforming reader shall accept files with any of the following headers:
%PDF–1.0
%PDF–1.1
%PDF–1.2
%PDF–1.3
%PDF–1.4
%PDF–1.5
%PDF–1.6
%PDF–1.7

This is a simple enough thing to test for using the HTML5 File API, we can expand upon the test later to ensure it catches other things required by a PDF.

My First File API

Before we begin diving into our JavaScript code for the given situation it might be worth explaining the HTML5 File API:

Each <input> element with the type “file” has an attribute called files which is a FileList object that contains a list of the selected files associated with the input.

inputFiles

The FileList object is simply an interface that represents an array of File objects. In the case above the FileList in question only has one entry.

A File object is read only and contains; the last time the file was modified, it’s name, it’s size (in bytes) and it’s type.

File Object

In order to read the contents of a file you need to make use of a FileReader object which contains methods and callbacks you will need to make use of. Most of these are to do with reading the file in a specific way.

Each of those above links link to the Mozilla Developer Network pages for each object in case you want to learn more about them (as a resource in its self it’s great).

The Example Code

Now that you are a bit more familiar with the HTML5 File API it’s time to start programming with it.

First we will create a simple HTML file to use the API with:

Very short and sweet. The majority of our code is within the JavaScript source file.

When using any HTML5 API it’s always important to check that the API is supported by the browser. In the case of the HTML5 File API we do this by checking that the various object I mentioned in the last section have constructors:

/**
 * Returns true if the HTML5 File API is supported by the browser
 * @returns {*}
 */
function supportsFileAPI() {
	return window.File &amp;&amp; window.FileReader &amp;&amp; window.FileList &amp;&amp; window.Blob;
}

The Blob object is another part of the File API that you can read about here, basically it’s like a file, in fact a File object can be seen as a kind of Blob with extra data associated with it.

Now that we have a way of checking if the browser supports the HTML5 File API we should add some event listeners to the <input> element on our page. I’ve done this using the addEventListener and/or attachEvent methods like so:

/**
 * Used to attach events to an element or object in a browser independent way
 * @param element
 * @param event
 * @param callbackFunction
 */
function attachEvent(element, event, callbackFunction) {
	if(element.addEventListener) {
		element.addEventListener(event, callbackFunction, false);
	}
	else if(element.attachEvent)  {
		element.attachEvent('on' + event, callbackFunction);
	}
}
 
function pageLoaded() {
	var fileInput = document.getElementById("fileUpload");
	if(supportsFileAPI()) {
		attachEvent(fileInput, "change", preUpload);
	}
	else {
		alert("Your browser does not support the HTML5 File API.");
	}
 
}
 
attachEvent(window, "load", pageLoaded);

This method will work on most modern browsers, and does not clutter up the HTML file.

The method preUpload is what will be run upon a user changing the contents of the <input> box and will be where we make use of the HTML5 File API.

Before showing you the code to it we will work through what we need it to do first:

  1. First it needs to get a reference to the File object in question.
  2. Then it needs to create a new FileReader object and assign an event handler to be called when the FileReader has loaded the File object.
  3. Then it needs to make the FileReader read the File object.

The first step is easy, we can get the FileList of the <input> element via the calling events target attribute to get the <input> element and then get it’s FileList object and treat it like an array.

function preUpload(event) {
 
	// The file API supports the ability to reference multiple files in one &lt;input&gt; tag
	var file = event.target.files[0];

The next step is a little harder as it contains the logic to check whether the PDF file is actually a valid PDF file. I’ve done this using an anonymous function (actually I’ve done it using two) that reads the first 8 characters of the PDF File and checks that they conform to what the Specification says.

	var reader = new FileReader();
	// Uses two anonymous functions so we can pass the File object to the on load anonymous function.
	attachEvent(reader, "load", (function(fileToCheck) {
		return function (evt) {
			var data = evt.target.result.substr(0, 8); // This gets the first 8 bytes/characters of the file
			var regex = new RegExp("%PDF-1.[0-7]"); // This Regular Expression is used to check if the file is valid
			if(data.match(regex)) {
				alert(fileToCheck.name + " is a valid PDF File.");
			}
		}
	})(file));

The last step is a very simple method call, except I’ve added in an extra condition in order to warn you if your uploading a large PDF file (in this case anything above 10MB).

	var MBSize = file.size / 1024 / 1024;
	if(MBSize &gt; 10) {
		if(!confirm(file.name + " is " + MBSize + "MB big, and may cause your browser to stop responding while it parses it.\nContinue?")) {
			return;
		}
	}
	reader.readAsText(file); // For now we shall read the file as if it were a text file
}

All put together the code looks like this:

Something to note about the HTML5 File API is that the read methods work on an entire File so if you are making use of a particularly large file it can cause problems. A way of solving this is by reading only part of the File, we do this by creating a Blob of only the first 8 bytes of the file and passing that to the FileReader instead, this makes the last section of the code look like this instead:

	var blob = file.slice(0, 8);
	reader.readAsText(blob);
}

Closing

It’s important to note that just because a PDF has a valid header it doesn’t mean it will itself be valid, there are several other things required for a PDF to be valid, viewable and able to be converted by our PDF2HTML5 converter. For more information you can look at the Understanding PDFs blog posts we have.

It is also important to note that the HTML5 File API is not supported everywhere as of yet and that the slice method mentioned may require a vendor specific prefix in some web browsers.

Hopefully this article has given you some ideas on how to use the HTML5 File API to parse files client side before uploading them and, a bit of knowledge on PDFs.

This post is part of our “HTML5 Article index” in these articles, we aim to help you understand the world of HTML5.

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (Twitter, Facebook and Google+) or the Blog RSS.

The following two tabs change content below.
Lyndon is a Developer at IDR Solutions. He currently focuses mostly on the JavaScript in the Viewer and PDF to HTML5 Converter and also the Android PDF Viewer. He gave a short talk at the GlassFish UnConference before JavaOne 2012. Outside of IDR Solutions he has a keen interest in AI and Games Programming and runs a blog that he periodically updates.

Related Posts:

lyndon

About Lyndon Armitage

Lyndon is a Developer at IDR Solutions. He currently focuses mostly on the JavaScript in the Viewer and PDF to HTML5 Converter and also the Android PDF Viewer. He gave a short talk at the GlassFish UnConference before JavaOne 2012. Outside of IDR Solutions he has a keen interest in AI and Games Programming and runs a blog that he periodically updates.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>