Whilst working on the JPedal Java PDF Library at IDR Solutions I recently came across an interesting issue where Hebrew words could only be found in a PDF if they were spelt in reverse. This was an odd one as we were only able to find a single example of the issue. After some searching we discovered that the issue was that when it comes to text PDFs have no sense of direction.
Within a PDF it is possible to include text without a writing mode, in which case the default to be used is left to right. Unfortunately it is possible to include right to left languages and have them treated as left to right. In theory this should not be a problem as when we wish to perform a search, we enter the text in the search field and the characters are positioned correctly for the language either left to right or right to left. Once we get the string we can just read the characters they appear on screen from left to right. Except this did not appear to work. We had a PDF that would not find any search terms unless it was written in reverse.
When typing in a language that writes from right to left Swing will detect the characters as belonging to such a language and add the characters in order right to left. So as you type the characters ה ,ד ,נ ,ד ,נ they will appear as נדנדה. This holds true for displaying StringBuilder, StringBuffer and Strings.
Without knowing this, should you output the a string to debug an issue, it would be easy to mistake what you are being shown with what the underlying data actually is. The characters are stored in the order they are typed. The character are when displayed in the correct place by the text component.
With the characters actually being stored in display order from the left in the PDF, it means it looks correct if displayed left to right and the underlying structure is identical to how it would be displayed but actual right to left input looks correct and has the underlying structure reversed.
So in a PDF we have the following characters defined in the order נ ,ד ,נ ,ד ,ה, to form the word נדנדה which as you can see the characters are entered and stored in the reverse of how the word would is typed. Now if we search for this term we have to type the characters in the order ה ,ד ,נ ,ד ,נ to form the word correctly in a text component. As you can see the characters have been entered and stored in reverse to what is stored in the pdf.
When we search for text we check characters one at a time in the order through the PDF text and the search term. Here the issue should be obvious as we compare each character in the PDF against the first character from the search term and only check for the following character of the term if the current one matches. As the writing modes do not match the text may look identical on screen but underneath everything is backwards.
In order to resolve this we have had to alter how search terms are accepted to allow for right to left characters being stored in a left to right order within the PDF and reverse the search terms character order to match that used by the PDF.
Hopefully you have found this article useful, let us know what you think.
Did you know...
IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?
It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page