Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Why are CID fonts far more complicated than non-CID fonts?

1 min read

We get lots of emails asking about fonts so this articles tries to explain some of the potential issues.

Loading fonts

All fonts used in PDF files need the following information:-

1. A set of definitions defining how to draw each glyf.

2. A way to map the value used onto the actual glyf details (character A is actually glyf 2). This is stored in a font file which can either be already on the machine or embedded in the PDF.

The way we map the character values onto glyf numbers varies from font format to format but often involves a look-up table called a CMAP. Sometimes this is in the font and sometimes this is in the PDF file.

How do we find the font data?

One of the big issues with fonts is that technically Adobe guarantees the 8 font families can be assumed to exist (our old friends Courier, Arial, etc) and in theory all the other fonts will need to be embedded (ie you can assume the bits above are setup for Courier but need to ensure they are provided in the PDF for any other fonts). This is called embedding the font and makes the document truly portable. The problem is that Adobe does not enforce this (and Adobe Acrobat is able to use any local fonts).

As an example, I have this really cute cat font called ‘Malinka the Cat’ on my computer which I could use in a PDF. I could include this in a PDF but not embed it and the PDF would need this font present on your machine to view my PDF correctly if the font was not embedded.

Adobe Acrobat itself downloads some CID fonts as language packs which makes it much more complicated to show these PDF files in any other browser. And Java adds an extra level of complexity because the virtual machine contains very few fonts (and they vary from platform to platform). We try to get around this to some extent by having our own engine and looking at common font locations on the user machine.

So why is CID fonts more complicated?

CID fonts are 16 bit versions of Truetype and Postscript font technologies. What makes the especially complicated is that the way to convert the character value into the glyph index. Not only can the CMAP be embedded in the font or PDF file but Adobe provides a set of CID CMAPs (we provide these in the downloadable cid.jar). There are also several possible ways to do the conversion using different tables. This makes it very hard to display all non-embedded cids CID fonts in Java because you would need access to the CMAP files and also the matching font files (which are sometimes downloaded by Adobe Acrobat on demand and copyright).

We have been looking at ways to support some of these non-embedded fonts but the complication is that you need access to the correct CMAP and the matching font file to make it work. If the correct font was on the computer, you might be able to use just the CMAP to get it working. A possible future enhancement is to code in the font mappings so that the user can download the file.

Hopefully you can now see why we recommend embedding the font!



Our software libraries allow you to

Convert PDF files to HTML
Use PDF Forms in a web browser
Convert PDF Documents to an image
Work with PDF Documents in Java
Read and write HEIC and other Image formats in Java
Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

2 Replies to “Why are CID fonts far more complicated than non-CID…”

  1. Very good article. So I have a question though. I am running into an issue where I am trying to extract text and characters / capitalization properties are not being retained. For the most part it occurs when the font is a CID font, I am assuming that this is because the simple text extraction does not utilize the cmap when copying and pasting. Is there any way around this? Would embedding the font fully versus subsetting make a difference?

Comments are closed.