Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Glyph names- what is in a name?

1 min read

I came across a rather intriguing problem while debugging a font issue in a PDF file created by a tool called A-PDF.
The PDF specification has a great deal of flexibility. And one of the many tricks you can do is to redefine how glyph values mapped onto indexes. You can create a custom set of encodings to map values onto glyphs. This is really useful if you want to embed a font with just a few characters to save space.  This is done with a differences object and would look something like this if viewed in the raw file
427 0 obj
<<
/Differences [ 2 /A/B/euro ]
So value 2 maps onto glyph ‘A’, value 3 maps onto ‘B’ and value 4 maps onto the euro character.
There is a list of all the standard glyph values, and the standard mappings used if you do not create your own. Appendix D of the PDF Reference lists the standard encodings and all the glyph names. Most of the time, you do not need to define your own values and can just use the already prepared tables – StandardEncoding, MacRomanEncoding, WinAnsiEncoding and so forth.
Where it gets slightly messy is that not only can you define your own Encoding but you can create your own glyphs. The glyph name is just a key value used to lookup font data in other tables. So long as you are consistent, any value should be possible. So you could have a Differences object along the lines of
427 0 obj
<<
/Differences [ 2 /AnyName/SillyName/MyNewGlpyh ]
Most of the time, this works fine, but what about this value, taken from the problem file.
427 0 obj
<<
/Differences [ 2 /#23#234CH2eb0c8ba15de4cce8fa3c169622f8e93 /#23#2346H7a539460a8268e5915c0973dbb05dce1
/;#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323#2323
/g47
Usually, the # character indicates the next 2 characters are a numeric value we use – in theory, so long as we are consistent, it should not matter. But this is what it looks like in Acrobat.
 
So for the values 1 and 2, we need to strip out the number values after the # so that  #23#234CH2eb0c8ba15de4cce8fa3c169622f8e93  becomes ##4CH2eb0c8ba15de4cce8fa3c169622f8e93
But for value 3, we strip the first 2. Why do we need to strip some of the values?
I can only guess that the presence of non-numeric values make the numbers invalid or that we strip the first 2 values, but I can’t find any clear rules – I am just guessing. So if you see the value # is a PDF string, be careful…
And if you the exact rules which should be applied here why not post and explain what’s really going on here…


Are you a Developer working with PDF files?

Free: The Developer's Guide to PDF
Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to insert an image into a PDF

Recently, we released JPedal 2023.07 which contains the ability to insert images into PDF files. All you need is a copy of JPedal, a...
Jacob Collins
18 sec read

One Reply to “Glyph names- what is in a name?”

  1. Mark,

    I think that #23 is supposed to represent the Ascii character 0x23, which is “#”. It’s just confusing because the escape character is also #.

    If you preprocess the “#23#23CH2eb0” replacing every #nn with its ASCII equivalent, you’d get “##CH2eb0”. But somebody stuck the un-preprocessed string in the Differences dictionary.

Comments are closed.