Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding PDF text objects

1 min read

Understanding PDF Text Objects

Inside a PDF is a Postscript stream of commands which describe the page – they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.

BT (begin a block of text)

/F13 12 Tf (Choose Font F13 and set size to 12)

288 720 Td (move the location relative from where it now is

(ABC) Tj (Draw the Text ABC)

ET (End the text block)

So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding – it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDFReference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitary set of values – they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).

The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded.

However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDFs you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDFs…



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

2 Replies to “Understanding PDF text objects”

  1. I’m more and more getting into manipulating PDF files with a text editor but am still very much at the beginning. So your blog posts on PDF are just the thing!

    I thought Tj was a stream in many cases. That, when the font uses WIN encoding, it only looks like one was new to me.

    I actually found this blog searching for an answer to this question:
    When only using WIN encoded fonts, is it possible to somehow define a Tj value once (which only contains ASCII characters) and then reference it multiple times throughout different text streams in the document, each time using another font at another size?

Comments are closed.