Kieran France Kieran France is a programmer for IDRSolutions in charge of there internal test suite. In his spare time he enjoys tinkering with gadgets and code.

Space: the final frontier… in PDF

1 min read

Space: The final frontier
These are the challenges of the IDRSolutions team
Its 12 year mission
To explore strange specification interpretations
To seek out new interpretations
To boldly go where no PDF has gone before…

Before we continue I should apologise now for the Star Trek references, I was incredibly tempted and thought, Why not!

So, PDF files containing text will at some point contain a space between words as you would usually expect. Although this is not always the case. Three common ways I have found of representing spaces between words in PDF files are as follows.

1. The Space Character
The text actually contains a space character (the ascii character 32).

2. An “Empty” Character
The text contains a character other than the space character that has no visible glyph. This character has been set as the standard space for the font in the Pdf.

3. No spaces
Instead of handling a space character or an “empty” character the Pdf just begins drawing a new sequence of character leaving a gap where the space should be.

To the end user these three examples all look correct and the spaces are in place. From a software viewpoint each case has to be handled in a different way to ensure consistency. The above three cases are the most common cases we have encountered and actually aren’t to difficult to handle.

The real issue comes when a user extracts the text and finds there is no space when there is a gap onscreen. Sometimes the character set actually defines a space with a width (and sometimes it does not). How much of a gap should there be before it becomes a space?

Recently I came across a file that had an interesting font when it came to the space character…
The font for the Pdf was a mix cases 1 and 3 written above. The file had all the correct positioning for the characters on page including the spaces. We then started to notice issues in the extraction. No spaces were being extracted at all. After checking that the spaces did indeed exist on the page I noticed that they also had a length of 0. A space with no length means that it does not exist but the positioning of text after was handled by the coords for all the text to follow.
It turns out that the file is relying on the viewer using the glyph coords to determine when a space is needed instead of assuring the space character acutally has a length so as to divide up the other characters around it.

It seems the more I look into files containing, for want of a better term, interesting text I think to myself “It’s a space Jim, but not as we know it.”

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Watch how to use our PDF Viewer JPedal

Kieran France Kieran France is a programmer for IDRSolutions in charge of there internal test suite. In his spare time he enjoys tinkering with gadgets and code.

Size does matter

Recently I have been looking into an issue in our PDF text extraction. A case was found where text extraction would appear to freeze....
Kieran France
1 min read

3 Replies to “Space: the final frontier… in PDF”

  1. Hi Kieran!

    First and foremost, nice Star Trek joke xD. One question: when dealing with case 3, how do you detect individual whitespaces when the text is justified in the PDF?

    Nice blog post by the way.

  2. Henry,

    When extracting text in that case we can not tell if the text is justified or not but we have an algorithm to sort the text fragments into larger blocks of text. When doing this we add a space character when the gap between text fragments if the gap resembles the width of the space character, if present or a width calculated based on the font as a whole.

    When extracting as plain text we add just a single space character for the output.
    When extracting content as XML we can output a tag to specify the number of spaces between the text.

    For justified text, the real trick is in how you detect that all the text belongs on the same line.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2022. All rights reserved.