The devil is always in the detail with the PDF spec. I have been working on a PDF file where the Hyphen character was not appearing in the converted HTML5 output. This was odd as I have seen it on loads of other samples. So we drilled down to see what was going on…
When you map glyph indices onto the actual characters that are displayed there are several ways to do this. One of these involves a set of mapping character tables (Appendix D in the PDF spec if you want to look it up). There are then a whole load of exceptions to this and one of these had not been correctly coded by me. The one missing was
The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is “soft hyphen,” but it is typographically the same as hyphen.
A quick fix, regression test and reset the baseline onthe regression tests to lock in the fix and it is all resolved. But it is a really good example about the complexity of the PDF specification. Do you have any favourite gotchas in PDF?
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Find out more about our software for Developers
|Convert PDF to HTML5 or SVG|
|Convert AcroForms and XFA to HTML5|
|Java PDF SDK for working with PDF files|