When you have a PDF file you have an Encoding value which defines the exact glyf used. There are some standard settings (MAC, WIN, STD) or you can also build your own Encoding table. There is a standard set of glyf names (A, B, fl, fi, quote) but you can call your glyfs anything you like. It is essentially just used as a unique ID to map the values internally. If you do not use standard values, you might get garbage when the text is extracted but it will be perfect for viewing which is what most users look at.
If you convert a PDF file to HTML5 or SVG things become more complex. If you are mapping the glyfs onto actual text values you need to take more care. Firstly, some browsers will reject certain ranges of characters so they need to be remapped onto sensible values. It also starts to matter if you have used arbitary values.
Here is the data from a PDF I have been looking at. It actually includes some custom small caps characters so it has created some bespoke glyf names (a.sc for SMALL CAPS A, and so on).
So to fix this we would either write out text values 33-50 to map onto the embedded font or move them. Because it is only a limited set of values we could actually map it onto a or A and resturcture the fonts. It would probably need a larger sample size to decide the best approach. Or we could convert the text to shapes.
But it is a good example about how PDF to HTML5 and SVG conversion is not always a straight-forward process…
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.