Understanding PDF text objects

Mark Stephens

14 years ago

Understanding PDF Text Objects

Inside a PDF is a Postscript stream of commands which describe the page – they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.

BT (begin a block of text)

/F13 12 Tf (Choose Font F13 and set size to 12)

288 720 Td (move the location relative from where it now is

(ABC) Tj (Draw the Text ABC)

ET (End the text block)

So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding – it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDFReference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitary set of values – they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).

The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded.

However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDFs you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDFs…