PDF text co-ordinates

There are several ways to define PDF text co-ordinates with PDF. When you see the text onscreen, you can see the actual character. The outline of this is known as the ‘visible text box’ – if you draw it onscreen it would just touch the edges of the character.

When a font is designed, most characters are deigned with some space around them (for example the letter i is much narrower than the letter m). The characters are all designed to fit a maximum box (known as the fontBox) which includes space for letters which ‘drop down’ such as q,g,y and space to ‘tall’ letters (such as l,h,b). All letters have some widespace, and narrow letters have more white space on each side.

There is a third potential set of text co-ordinates within PDF. Text is set to fit within another rectangle (technically called TRM which is the gap to put the text in. So any letter will fit inside a box, the Font Bounding box, which fits into a slot on the page (the TRM). So there are THREE possible sets of co-ordinates we could use for the Text.

When we wrote our Newspaper extraction software (Storypad), we used the TRM to describe text locations. This worked fine for newspaper content, but in many PDFs the TRM boxes actually overlap which causes major problems with grouping. So JPedal uses the font bounding box (which avoids this problem).

There is an easy way to see the actual text outlines in JPedal – open the file in SimpleViewer and click on CTRL-A to select all the text outlines.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

 

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>