When you have a PDF file you have an Encoding value which defines the exact glyf used. There are some standard settings (MAC, WIN, STD) or you can also build your own Encoding table. There is a standard set of glyf names (A, B, fl, fi, quote) but you can call your glyfs anything you like. It is essentially just used as a unique ID to map the values internally. If you do not use standard values, you might get garbage when the text is extracted but it will be perfect for viewing which is what most users look at.
If you convert a PDF file to HTML5 or SVG things become more complex. If you are mapping the glyfs onto actual text values you need to take more care. Firstly, some browsers will reject certain ranges of characters so they need to be remapped onto sensible values. It also starts to matter if you have used arbitary values.
Here is the data from a PDF I have been looking at. It actually includes some custom small caps characters so it has created some bespoke glyf names (a.sc for SMALL CAPS A, and so on).
So to fix this we would either write out text values 33-50 to map onto the embedded font or move them. Because it is only a limited set of values we could actually map it onto a or A and resturcture the fonts. It would probably need a larger sample size to decide the best approach. Or we could convert the text to shapes.
But it is a good example about how PDF to HTML5 and SVG conversion is not always a straight-forward process…
Latest posts by Mark Stephens (see all)
- Saving your settings in our online PDF to HTML5 and SVG converter - May 20, 2013
- PDF teasers – how would you handle this stack problem? - May 15, 2013
- Where do your PDF objects start in a PDF file? - May 8, 2013
- Version 5 release – Swing and javaFX - April 26, 2013
- Which languages should have examples when documenting a web service? - April 24, 2013