Text spaces in PDF files

Many PDF files do not actually contain any text spaces. They contain gaps between letters and the software has to guess if there is a space in the text. Sometimes there may be spaces but this is more often what you might see in the PDF text. We have 1 genuine space but lots of gaps between characters and some of those are actually spaces.

[ (S) -289.1 (o) -288.9 (k ) -3529.4 (B) -289.1 (e) -289.2 (z) -289.1 (e) -289.2 (i) -289.2 (c) -289.1 (h) -289.2 (n) -289.2 (u) -289.2 (n) -289.2 (g) -0.2 ( ) -3529.4 (E) -289.1 (U) -289.2 (R) ]

So how do we decide which gaps are spaces? Sometimes the PDF Font data will specify a specific font width and we can use this. But this is not compulsory so we end up looking at the other font values and using the mean value or some other guess.

Over the years, we have found that using 60% of the mean font width works fairly well on most files. But some files work better with slightly different sizings.

So the lesson is that if text extraction matters, make sure you create the PDF with some care. If possible, use Marked Content (explained in The easy way to discover if a PDF file contains ‘structured content’) or at least be aware that some PDF creators produce PDF text with true spaces or at least data on how to calculate it reliably.

Do you have any tips on PDF text extraction?

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>