Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file format – Embedded CMAP tables

2 min read

Every glyf inside a PDF file can have a display value and a different extraction value. This is useful because often you need to display something different on the screen to what you get if you search the PDF file or extract the text. For example lignatures (fl, fi, etc) look much better when displayed using a special value rather than an f followed by an l or i. But when you search for floor or fine, you want these to be found correctly. So the PDF file format allows you to define separate values for display and actual text value.

One of the ways you can setup the values used for extraction is to use a CMAP table. It can be stored inside a PDF object (and often compressed). Lets see what one looks like if we extract the actual data…

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <02> <b7> endcodespacerange
19 beginbfchar
<07> <03C0>
<09> <0061>
<0a> <006D>
<0b> <0070>
<1e> <02DA>
<20> <0020>
<22> <0022>
<3d> <003D>
<3f> <003F>
<59> <0059>
<5b> <005B>
<5d> <005D>
<5f> <005F>
<7d> <007D>
<84> <2014>
<85> <2013>
<90> <2019>
<b0> <00B0>
<b7> <00B7>
8 beginbfrange
<24> <25> <0024>
<27> <29> <0027>
<2b> <2e> <002B>
<30> <3b> <0030>
<41> <50> <0041>
<52> <57> <0052>
<61> <7b> <0061>
<8d> <8e> <201C>
6 beginbfrange
<02> <02> [<0066006C>]
<03> <03> [<00540068>]
<04> <04> [<00660069>]
<05> <05> [<00660074>]
<06> <06> [<00660066>]
<08> <08> [<006600660069>]
endcmap CMapName currentdict /CMap defineresource pop end end

The first thing to note is that it is a readable text file (always much easier to follow!). The file has a header but the really interesting part for us is the lines between the begin/end tags.  Note we can have multiple tables – in this file there are two beginbfrange sections and we read both in turn. All the values are hex with the unicode values being shown as 4 characters even if the high order byte is zero.

beginbfchar is the simpler of the two. It shows a section of single values with their unicode values. So character 7 maps onto unicode hex value 03C0 for the purpose of search and extraction, character 9 maps onto unicode hex value 0061, etc.

The beginbfrange sections allow you to specify a range of values to fill starting with a certain value (which is incremented for each one). This is what happens in the first section. So the line

<27> <29> <0027>

means 27 is mapped to 27, 28 is mapped to 28, 29 is mapped to 29

But we can also map a single character onto multiple values (allowing us to have a single character for fl or but set the correct values for the actual text). So the line

<02> <02> [<0066006C>]

means that character 2 is actually 2 text characters (unicode hex 66 and 6C). We use these for the text value, but the single glyf value defined in the font for display purposes.

So CMAP files are very useful because we can use them to provide very flexible options for what the text value of any character should be. Are you using them in your PDF files?

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!



Did you know...

IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?

It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

PDF to HTML5’s Holy Grail – Vertical positioning for…

It’s safe to say that if someone designed fonts from scratch today they’d be very different on the inside. As with many technologies, the...
Sam Howard
1 min read

WOFF 2.0: What is it, why is it coming,…

WOFF 2.0 is working its way towards being a standard recommended by the W3C, so it seems like a good time to look at...
Sam Howard
2 min read

Web fonts: A quick introduction to Wrapper and Glyph…

I was planning to write about WOFF 2.0 this week, and wanted to link to a previous article I’d written which explains the structure...
Sam Howard
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.