Understanding PDF Text Objects
Inside a PDF is a Postscript stream of commands which describe the page – they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.
BT (begin a block of text)
/F13 12 Tf (Choose Font F13 and set size to 12)
288 720 Td (move the location relative from where it now is
(ABC) Tj (Draw the Text ABC)
ET (End the text block)
So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding – it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDFReference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitary set of values – they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).
The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded.
However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDFs you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDFs…
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |
I’m more and more getting into manipulating PDF files with a text editor but am still very much at the beginning. So your blog posts on PDF are just the thing!
I thought Tj was a stream in many cases. That, when the font uses WIN encoding, it only looks like one was new to me.
I actually found this blog searching for an answer to this question:
When only using WIN encoded fonts, is it possible to somehow define a Tj value once (which only contains ASCII characters) and then reference it multiple times throughout different text streams in the document, each time using another font at another size?
You could create an Form which contains the content but this cannot be done in a text editor and probably not going to save much space.