The devil is always in the detail with the PDF spec. I have been working on a PDF file where the Hyphen character was not appearing in the converted HTML5 output. This was odd as I have seen it on loads of other samples. So we drilled down to see what was going on…
When you map glyph indices onto the actual characters that are displayed there are several ways to do this. One of these involves a set of mapping character tables (Appendix D in the PDF spec if you want to look it up). There are then a whole load of exceptions to this and one of these had not been correctly coded by me. The one missing was
The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning of this duplicate code is “soft hyphen,” but it is typographically the same as hyphen.
A quick fix, regression test and reset the baseline onthe regression tests to lock in the fix and it is all resolved. But it is a really good example about the complexity of the PDF specification. Do you have any favourite gotchas in PDF?
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.