Understanding the PDF file format – Embedded CMAP tables

Every glyf inside a PDF file can have a display value and a different extraction value. This is useful because often you need to display something different on the screen to what you get if you search the PDF file or extract the text. For example lignatures (fl, fi, etc) look much better when displayed using a special value rather than an f followed by an l or i. But when you search for floor or fine, you want these to be found correctly. So the PDF file format allows you to define separate values for display and actual text value.

One of the ways you can setup the values used for extraction is to use a CMAP table. It can be stored inside a PDF object (and often compressed). Lets see what one looks like if we extract the actual data…

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <02> <b7> endcodespacerange
19 beginbfchar
<07> <03C0>
<09> <0061>
<0a> <006D>
<0b> <0070>
<1e> <02DA>
<20> <0020>
<22> <0022>
<3d> <003D>
<3f> <003F>
<59> <0059>
<5b> <005B>
<5d> <005D>
<5f> <005F>
<7d> <007D>
<84> <2014>
<85> <2013>
<90> <2019>
<b0> <00B0>
<b7> <00B7>
endbfchar
8 beginbfrange
<24> <25> <0024>
<27> <29> <0027>
<2b> <2e> <002B>
<30> <3b> <0030>
<41> <50> <0041>
<52> <57> <0052>
<61> <7b> <0061>
<8d> <8e> <201C>
endbfrange
6 beginbfrange
<02> <02> [<0066006C>]
<03> <03> [<00540068>]
<04> <04> [<00660069>]
<05> <05> [<00660074>]
<06> <06> [<00660066>]
<08> <08> [<006600660069>]
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

The first thing to note is that it is a readable text file (always much easier to follow!). The file has a header but the really interesting part for us is the lines between the begin/end tags.  Note we can have multiple tables – in this file there are two beginbfrange sections and we read both in turn. All the values are hex with the unicode values being shown as 4 characters even if the high order byte is zero.

beginbfchar is the simpler of the two. It shows a section of single values with their unicode values. So character 7 maps onto unicode hex value 03C0 for the purpose of search and extraction, character 9 maps onto unicode hex value 0061, etc.

The beginbfrange sections allow you to specify a range of values to fill starting with a certain value (which is incremented for each one). This is what happens in the first section. So the line

<27> <29> <0027>

means 27 is mapped to 27, 28 is mapped to 28, 29 is mapped to 29

But we can also map a single character onto multiple values (allowing us to have a single character for fl or but set the correct values for the actual text). So the line

<02> <02> [<0066006C>]

means that character 2 is actually 2 text characters (unicode hex 66 and 6C). We use these for the text value, but the single glyf value defined in the font for display purposes.

So CMAP files are very useful because we can use them to provide very flexible options for what the text value of any character should be. Are you using them in your PDF files?

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

 

 

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>