Did you know that PDF files can actually define spaces in several different ways?
Three common ways I have found of representing spaces between words in PDF files are as follows.
1. The Space Character
The text actually contains a space character (the ascii character 32) which might be mapped onto another value in the Encoding table.
2. An “Empty” Character
The text contains a character other than the space character that has no visible glyph. This character has been set as the standard space for the font in the Pdf.
3. No spaces
Instead of handling a space character or an “empty” character the Pdf just begins drawing a new sequence of character leaving a gap where the space should be.
Many PDF files do not actually contain any text spaces. They contain gaps between letters and the software has to guess if there is a space in the text. Sometimes there may be spaces but this is more often what you might see in the PDF text. We have 1 genuine space but lots of gaps between characters and some of those are actually spaces.
[ (S) -289.1 (o) -288.9 (k ) -3529.4 (B) -289.1 (e) -289.2 (z) -289.1 (e) -289.2 (i) -289.2 (c) -289.1 (h) -289.2 (n) -289.2 (u) -289.2 (n) -289.2 (g) -0.2 ( ) -3529.4 (E) -289.1 (U) -289.2 (R) ] TJ
Further gaps between letters can be added by the Tc , Tl, Tw commands.
Real world impacts
To the end user these three examples all look correct and the spaces are in place. From a software viewpoint each case has to be handled in a different way to ensure consistency.
The real issue can come when a user extracts the text and finds there is no space when there is a gap onscreen. Sometimes the character set actually defines a space with a width (and sometimes it does not). How much of a gap should there be before it becomes a space?
Recently I came across a file that had an interesting font when it came to the space character…
The font for the Pdf was a mix cases 1 and 3 written above. The file had all the correct positioning for the characters on page including the spaces. We then started to notice issues in the extraction. No spaces were being extracted at all. After checking that the spaces did indeed exist on the page I noticed that they also had a length of 0. A space with no length means that it does not exist but the positioning of text after was handled by the coords for all the text to follow.
It turns out that the file is relying on the viewer using the glyph coords to determine when a space is needed instead of assuring the space character actually has a length so as to divide up the other characters around it.
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |
Hi Kieran!
First and foremost, nice Star Trek joke xD. One question: when dealing with case 3, how do you detect individual whitespaces when the text is justified in the PDF?
Nice blog post by the way.
Cheers
Henry,
When extracting text in that case we can not tell if the text is justified or not but we have an algorithm to sort the text fragments into larger blocks of text. When doing this we add a space character when the gap between text fragments if the gap resembles the width of the space character, if present or a width calculated based on the font as a whole.
When extracting as plain text we add just a single space character for the output.
When extracting content as XML we can output a tag to specify the number of spaces between the text.
For justified text, the real trick is in how you detect that all the text belongs on the same line.
Thank you Kieran!