There are several ways to define PDF text co-ordinates with PDF. When you see the text onscreen, you can see the actual character. The outline of this is known as the ‘visible text box’ – if you draw it onscreen it would just touch the edges of the character.
When a font is designed, most characters are deigned with some space around them (for example the letter i is much narrower than the letter m). The characters are all designed to fit a maximum box (known as the fontBox) which includes space for letters which ‘drop down’ such as q,g,y and space to ‘tall’ letters (such as l,h,b). All letters have some widespace, and narrow letters have more white space on each side.
There is a third potential set of text co-ordinates within PDF. Text is set to fit within another rectangle (technically called TRM which is the gap to put the text in. So any letter will fit inside a box, the Font Bounding box, which fits into a slot on the page (the TRM). So there are THREE possible sets of co-ordinates we could use for the Text.
When we wrote our Newspaper extraction software (Storypad), we used the TRM to describe text locations. This worked fine for newspaper content, but in many PDFs the TRM boxes actually overlap which causes major problems with grouping. So JPedal uses the font bounding box (which avoids this problem).
There is an easy way to see the actual text outlines in JPedal – open the file in SimpleViewer and click on CTRL-A to select all the text outlines.
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.