Did you know that PDF files can actually define spaces in several different ways?
Three common ways I have found of representing spaces between words in PDF files are as follows.
1. The Space Character
The text actually contains a space character (the ascii character 32) which might be mapped onto another value in the Encoding table.
2. An “Empty” Character
The text contains a character other than the space character that has no visible glyph. This character has been set as the standard space for the font in the Pdf.
3. No spaces
Instead of handling a space character or an “empty” character the Pdf just begins drawing a new sequence of character leaving a gap where the space should be.
Many PDF files do not actually contain any text spaces. They contain gaps between letters and the software has to guess if there is a space in the text. Sometimes there may be spaces but this is more often what you might see in the PDF text. We have 1 genuine space but lots of gaps between characters and some of those are actually spaces.
[ (S) -289.1 (o) -288.9 (k ) -3529.4 (B) -289.1 (e) -289.2 (z) -289.1 (e) -289.2 (i) -289.2 (c) -289.1 (h) -289.2 (n) -289.2 (u) -289.2 (n) -289.2 (g) -0.2 ( ) -3529.4 (E) -289.1 (U) -289.2 (R) ] TJ
Further gaps between letters can be added by the Tc , Tl, Tw commands.
Real world impacts
To the end user these three examples all look correct and the spaces are in place. From a software viewpoint each case has to be handled in a different way to ensure consistency.
The real issue can come when a user extracts the text and finds there is no space when there is a gap onscreen. Sometimes the character set actually defines a space with a width (and sometimes it does not). How much of a gap should there be before it becomes a space?
Recently I came across a file that had an interesting font when it came to the space character…
The font for the Pdf was a mix cases 1 and 3 written above. The file had all the correct positioning for the characters on page including the spaces. We then started to notice issues in the extraction. No spaces were being extracted at all. After checking that the spaces did indeed exist on the page I noticed that they also had a length of 0. A space with no length means that it does not exist but the positioning of text after was handled by the coords for all the text to follow.
It turns out that the file is relying on the viewer using the glyph coords to determine when a space is needed instead of assuring the space character actually has a length so as to divide up the other characters around it.
Are you a Developer working with PDF files?
|Free: The Developer's Guide to PDF|
|Convert PDF files to HTML|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|