It is one of these features which is broken but it is now too late to fix.
Inside a PDF file, all text data is stored as a binary number and this value is decoded into the actual glyph value (ie the value 65 is converted into the text value ‘A’). Because the PDF file format is ‘multiplatform’, there are a several possible sets of Standard Encoding Formats to use for this conversion (ie WinAnsi for Windows, and MacRoman for standard MAC values). This is because Windows and MAC originally evolved with different charactersets and values. Most of the time values are identical (A is value 65 in both MAC and WIN encoding) but certain accented characters have different values. So values 132 is Ntilde (letter N with a wavy line above in MAC encoding) but quotedblbase (double quotes at bottom of the line) on Windows. So long as we know which translation table to use, this is not a problem of course….
The issue comes with embedded Truetype fonts because they will always list them as MAC encoded in the PDF file (which is what the specification says they should be) when they are actually WIN encoded. Using the wrong look-up table does not matter for most values (as the results are identical) but it does break certain letters.
So what you need to do is to figure out if the font is actually WIN or MAC encoded yourself and ignore the setting in the PDF file. There is (of course) no documented way to do and several values can appear as different values in either…
What we did was to develop some heuristics to work it out which we continually test against known files and tweak as needed looking at the actually font values present, seeing whether WIN or MAC encoding gives a ‘better fit’ and checking certain key values. It also needs to factor in the fact that the font maybe subsetted so only a selection of values will be present.
So if you get some odd characters working with PDF files containing Truetype fonts, this may well be the reason. And if you come across a file displayed in our PDF viewer which has some odd characters, please do send us the file so we can continue to improve our code.
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.