Recently I have been looking at an issue for one of our potential clients. The text extraction was not working correctly due to an array out of bounds exception being thrown when certain pdf pages were encountered.
Unicode 3.0 allows you to define any character a 16 bit (2 byte) value. And is used internally by Java. We want to ‘hide’ some data inside the text data so we include so we use a ‘marker’ to flag this. Ideally we want this to be one value to reduce the memory usage and keep things simple. The problem is which value to use?
We used to use the value 65535. This was chosen as the chance of the value being in use was considered to be low.
It soon became apparent that the issue was due the text data becoming out of sync whilst copying the text data into a format that can easily be read once extracted.
The text data is stored with the x coord and the width of each character using a non standard unicode character value such as the unicode value for 65535 to separate these values. This has worked on every file we have encountered until today. Now we have a file that is using this exact value on some of the pages. This threw our extraction off as the correct text value was never found and the extraction code was knocked out of step.
This problem once identified was easily rectified. We have simply changed the marker value. This will continue to allow the text extraction to function.
Now we just need to figure out a way of preventing this issue if any files in the future contain instances of our new marker value.
You have to be very careful if you use characters as a deliminator as the effects can have a long-term impact in years to come. One company we used to work with created a database in the late 1980s for holding details of people (name, address, contact details,etc) and looked for a very rare character to use as a deliminator (one which would never be used). They chose the @ character…