I was sent an interesting PDF file to investigate this week. The issue was that spaces were appearing in the text when translated into HTML5. Intrigued, I dived in to see what was going on…
The PDF file was created using Corel’s PDF Engine. This often has its own unique way of doing things (to put it politely). So I drilled down and found the word in question which was appearing with a space in it. I copied it from Acrobat and it also had a space in it! I looked at the internal PDF command and the text was encoded as a single word with a space in it. But in the viewer (both our PDF viewer and Acrobat) no space is visible.
The reason for this is that there is a command in the PDF text commands called Tw. This allows you to define an additional amount of space (positive or negative) to be added when a space is drawn. In this case, the amount is set to cancel out the space exactly so it is there but does not appear as a gap when the PDF is viewed. I have altered our code to now ignore this when converting the PDF to HTML5 (and extracting text).
So if you are using, our PDF to HTML5 convertor and seeing odd spaces, try today’s release. The bigger mystery is why Corel needs to add spaces in the middle of words and then move the position back to ignore them – any ideas?
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.