There are several ways to define PDF text coordinates with PDF. When you see the text on the screen, you can see the actual character. The outline of this is known as the ‘visible text box’ – if you draw it onscreen it would just touch the edges of the character.
When a font is designed, most characters are designed with some space around them (for example the letter i is much narrower than the letter m). The characters are all designed to fit a maximum box (known as the fontBox) which includes space for letters which ‘drop down’ such as q,g,y and space to ‘tall’ letters (such as l,h,b). All letters have some widespace, and narrow letters have more white space on each side.
There is a third potential set of text co-ordinates within PDF. Text is set to fit within another rectangle (technically called TRM which is the gap to put the text in. So any letter will fit inside a box, the Font Bounding box, which fits into a slot on the page (the TRM). So there are THREE possible sets of co-ordinates we could use for the Text.
When we wrote our Newspaper extraction software (Storypad), we used the TRM to describe text locations. This worked fine for newspaper content, but in many PDFs the TRM boxes actually overlap which causes major problems with grouping. So JPedal uses the font bounding box (which avoids this problem).
There is an easy way to see the actual text outlines in JPedal – open the file in SimpleViewer and click on CTRL-A to select all the text outlines.
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |