Space: the final frontier… in PDF

Space: The final frontier
These are the challenges of the IDRSolutions team
Its 12 year mission
To explore strange specification interpretations
To seek out new interpretations
To boldly go where no PDF has gone before…

Before we continue I should apologise now for the Star Trek references, I was incredibly tempted and thought, Why not!

So, PDF files containing text will at some point contain a space between words as you would usually expect. Although this is not always the case. Three common ways I have found of representing spaces between words in PDF files are as follows.

1. The Space Character
The text actually contains a space character (the ascii character 32).

2. An “Empty” Character
The text contains a character other than the space character that has no visible glyph. This character has been set as the standard space for the font in the Pdf.

3. No spaces
Instead of handling a space character or an “empty” character the Pdf just begins drawing a new sequence of character leaving a gap where the space should be.

To the end user these three examples all look correct and the spaces are in place. From a software viewpoint each case has to be handled in a different way to ensure consistency. The above three cases are the most common cases we have encountered and actually aren’t to difficult to handle.

The real issue comes when a user extracts the text and finds there is no space when there is a gap onscreen. Sometimes the character set actually defines a space with a width (and sometimes it does not). How much of a gap should there be before it becomes a space?

Recently I came across a file that had an interesting font when it came to the space character…
The font for the Pdf was a mix cases 1 and 3 written above. The file had all the correct positioning for the characters on page including the spaces. We then started to notice issues in the extraction. No spaces were being extracted at all. After checking that the spaces did indeed exist on the page I noticed that they also had a length of 0. A space with no length means that it does not exist but the positioning of text after was handled by the coords for all the text to follow.
It turns out that the file is relying on the viewer using the glyph coords to determine when a space is needed instead of assuring the space character acutally has a length so as to divide up the other characters around it.

It seems the more I look into files containing, for want of a better term, interesting text I think to myself “It’s a space Jim, but not as we know it.”

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.
Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He often has no idea what to write in his blog posts but tries his hardest to make them interesting and entertaining, he also makes no excuses for his odd sense of humor.
KieranF

About Kieran France

Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He often has no idea what to write in his blog posts but tries his hardest to make them interesting and entertaining, he also makes no excuses for his odd sense of humor.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>