I came across a rather intriguing problem while debugging a font issue in a PDF created by a tool called A-PDF.
The PDF specification has a great deal of flexibility. And one of the many tricks you can do is to redefine how glyph values mapped onto indexes. You can create a custom set of encodings to map values onto glyphs. This is really useful if you want to embed a font with just a few characters to save space. This is done with a differences object and would look something like this if viewed in the raw file
427 0 obj
/Differences [ 2 /A/B/euro ]
So value 2 maps onto glyph ‘A’, value 3 maps onto ‘B’ and value 4 maps onto the euro character.
There is a list of all the standard glyph values, and the standard mappings used if you do not create your own. Appendix D of the PDF Reference lists the standard encodings and all the glyph names. Most of the time, you do not need to define your own values and can just use the already prepared tables – StandardEncoding, MacRomanEncoding, WinAnsiEncoding and so forth.
Where it gets slightly messy is that not only can you define your own Encoding but you can create your own glyphs. The glyph name is just a key value used to lookup font data in other tables. So long as you are consistent, any value should be possible. So you could have a Differences object along the lines of
427 0 obj
/Differences [ 2 /AnyName/SillyName/MyNewGlpyh ]
Most of the time, this works fine, but what about this value, taken from the problem file.
427 0 obj
/Differences [ 2 /#23#234CH2eb0c8ba15de4cce8fa3c169622f8e93 /#23#2346H7a539460a8268e5915c0973dbb05dce1
Usually, the # character indictates the next 2 characters are a numeric value we use – in theory, so long as we are consistent, it should not matter. But this is what it looks like in Acrobat.
So for the values 1 and 2, we need to strip out the number values after the # so that #23#234CH2eb0c8ba15de4cce8fa3c169622f8e93 becomes ##4CH2eb0c8ba15de4cce8fa3c169622f8e93
But for value 3, we strip the first 2. Why do we need to strip some of the values?
I can only guess that the presence of non-numeric values make the numbers invalid or that we strip the first 2 values, but I can’t find any clear rules – I am just guessing. So if you see the value # is a PDF string, be careful…
And if you the exact rules which should be applied here why not post and explain what’s really going on here…
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.