Space: The final frontier
These are the challenges of the IDRSolutions team
Its 12 year mission
To explore strange specification interpretations
To seek out new interpretations
To boldly go where no PDF has gone before…
Before we continue I should apologise now for the Star Trek references, I was incredibly tempted and thought, Why not!
So, PDF files containing text will at some point contain a space between words as you would usually expect. Although this is not always the case. Three common ways I have found of representing spaces between words in PDF files are as follows.
1. The Space Character
The text actually contains a space character (the ascii character 32).
2. An “Empty” Character
The text contains a character other than the space character that has no visible glyph. This character has been set as the standard space for the font in the Pdf.
3. No spaces
Instead of handling a space character or an “empty” character the Pdf just begins drawing a new sequence of character leaving a gap where the space should be.
To the end user these three examples all look correct and the spaces are in place. From a software viewpoint each case has to be handled in a different way to ensure consistency. The above three cases are the most common cases we have encountered and actually aren’t to difficult to handle.
The real issue comes when a user extracts the text and finds there is no space when there is a gap onscreen. Sometimes the character set actually defines a space with a width (and sometimes it does not). How much of a gap should there be before it becomes a space?
Recently I came across a file that had an interesting font when it came to the space character…
The font for the Pdf was a mix cases 1 and 3 written above. The file had all the correct positioning for the characters on page including the spaces. We then started to notice issues in the extraction. No spaces were being extracted at all. After checking that the spaces did indeed exist on the page I noticed that they also had a length of 0. A space with no length means that it does not exist but the positioning of text after was handled by the coords for all the text to follow.
It turns out that the file is relying on the viewer using the glyph coords to determine when a space is needed instead of assuring the space character acutally has a length so as to divide up the other characters around it.
It seems the more I look into files containing, for want of a better term, interesting text I think to myself “It’s a space Jim, but not as we know it.”
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.