Why are CID fonts far more complicated than non-CID fonts

We get lots of emails asking about fonts so this articles tries to explain some of the potential issues.

Loading fonts

All fonts used in PDF files need the following information:-

1. A set of definitions defining how to draw each glyf.

2. A way to map the value used onto the actual glyf details (character A is actually glyf 2). This is stored in a font file which can either be already on the machine or embedded in the PDF.

The way we map the character values onto glyf numbers varies from font format to format but often involves a look-up table called a CMAP. Sometimes this is in the font and sometimes this is in the PDF file.

How do we find the font data?

One of the big issues with fonts is that technically Adobe guarantees the 8 font families can be assumed to exist (our old friends Courier, Arial, etc) and in theory all the other fonts will need to be embedded (ie you can assume the bits above are setup for Courier but need to ensure they are provided in the PDF for any other fonts). This is called embedding the font and makes the document truly portable. The problem is that Adobe does not enforce this (and Adobe Acrobat is able to use any local fonts).

As an example, I have this really cute cat font called ‘Malinka the Cat’ on my computer which I could use in a PDF. I could include this in a PDF but not embed it and the PDF would need this font present on your machine to view my PDF correctly if the font was not embedded.

Adobe Acrobat itself downloads some CID fonts as language packs which makes it much more complicated to show these PDF files in any other browser. And Java adds an extra level of complexity because the virtual machine contains very few fonts (and they vary from platform to platform). We try to get around this to some extent by having our own engine and looking at common font locations on the user machine.

So why is CID fonts more complicated?

CID fonts are 16 bit versions of Truetype and Postscript font technologies. What makes the especially complicated is that the way to convert the character value into the glyph index. Not only can the CMAP be embedded in the font or PDF file but Adobe provides a set of CID CMAPs (we provide these in the downloadable cid.jar). There are also several possible ways to do the conversion using different tables. This makes it very hard to display all non-embedded cids CID fonts in Java because you would need access to the CMAP files and also the matching font files (which are sometimes downloaded by Adobe Acrobat on demand and copyright).

We have been looking at ways to support some of these non-embedded fonts but the complication is that you need access to the correct CMAP and the matching font file to make it work. If the correct font was on the computer, you might be able to use just the CMAP to get it working. A possible future enhancement is to code in the font mappings so that the user can download the file.

Hopefully you can now see why we recommend embedding the font!

This post is part of our “Fonts Articles Index” in these articles we explore Fonts.

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

2 thoughts on “Why are CID fonts far more complicated than non-CID fonts

  1. Jim Francis

    Very good article. So I have a question though. I am running into an issue where I am trying to extract text and characters / capitalization properties are not being retained. For the most part it occurs when the font is a CID font, I am assuming that this is because the simple text extraction does not utilize the cmap when copying and pasting. Is there any way around this? Would embedding the font fully versus subsetting make a difference?

    • It could also be the Unicode mappings. We have seen several files where Unicode returns the wrong values.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>