Mapping of Unicode characters

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic of text processing software (1,114,112 = 220 + 216 or 17 × 216, or hexadecimal 110000 code points).

As of Unicode 5.0.0, 102,012 (9.2%) of these code points are assigned, with another 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 872,582 (78.3%) unassigned. The number of assigned code points is made up as follows:

(See the summary table for a more detailed breakdown).

Unicode characters can be categorized in many ways. Every character is assigned a script (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages.

Characters are assigned in blocks of characters. These blocks are usually groups of code points in some multiple of eight: many, for example, are grouped in blocks of 128 or 256 code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character).

The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use.

The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script.

Contents

[edit] Planes

[edit] Graphical characters

[edit] Compatibility characters

[edit] Non-graphical characters

[edit] Other special-purpose characters

The latest Unicode repertoire codifies over a hundred thousand characters. Most of those represent graphemes for processing as linear text. Some, however, either do not represent graphemes, or, as graphemes, require exceptional treatment. Unlike the ASCII control characters and other characters included for legacy round-trip capabilities, these other special-purpose characters endow plain text with important semantics.

Some special characters can alter the layout of text, such as the zero-width joiner and zero-width non-joiner, while others do not affect text layout at all, but instead affect the way text strings are collated, matched or otherwise processed. Other special-purpose characters, such as the mathematical invisibles, generally have no effect on text rendering, though sophisticated text layout software may choose to subtly adjust spacing around them.

Unicode does not specify the division of labor between font and text layout software (or "engine") when rendering Unicode text. Because the more complex font formats, such as OpenType or Apple Advanced Typography, provide for contextual substitution and positioning of glyphs, a simple text layout engine might rely entirely on the font for all decisions of glyph choice and placement. In the same situation a more complex engine may combine information from the font with its own rules to achieve its own idea of best rendering. To implement all recommendations of the Unicode specification, a text engine must be prepared to work with fonts of any level of sophistication, since contextual substitution and positioning rules do not exist in some font formats and are optional in the rest. The fraction slash is an example: complex fonts may or may not supply positioning rules in the presence of the fraction slash character to create a fraction, while fonts in simple formats cannot.

[edit] Byte-order mark

The byte-order mark (U+FEFF) is used only to begin a text document to indicate the byte order of the encoding. With this character applications can distinguish between any of the Unicode transform format encodings (UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32LE, or UTF-32BE) and any file starting with this byte serves as a fairly certain indication that the file is in one of these formats.

[edit] Grapheme joiners and non-joiners

The zero-width joiner (U+200D) and zero-width non-joiner (U+200C) control the joining and ligation of glyphs. The joiner does not cause characters that would not otherwise join or ligate to do so, but when paired with the non-joiner these characters can be used to control the joining and ligating properties of the surrounding two joining or ligating characters. The Combining Grapheme Joiner (U+034F) is used to distinguish two base characters as one common base or digraph, mostly for underlying text processing, collation of strings, case folding and so on.

[edit] Word joiners and separators

The most common word separator is a space (U+0020). However, there are other word joiners and separators that also indicate a break between words and participate in line-breaking algorithms. The No-Break Space (U+00A0) also produces a baseline advance without a glyph but inhibits rather than enabling a line-break. The Zero Width Space (U+200B) allows a line-break but provides no space: in a sense joining, rather than separating, two words. Finally, the Word Joiner (U+2060) inhibits line breaks and also involves none of the white space produced by a baseline advance.

Baseline Advance No Baseline Advance
Allow Line-break
(Separators)
Space U+0020 Zero Width Space U+200B
Inhibit Line-break
(Joiners)
No-Break Space U+00A0 Word Joiner U+2060

[edit] Other Separators

  • Line Separator (U+2028)
  • Paragraph Separator (U+2029)

These provide Unicode with native paragraph and line separators independent of the legacy encoded ASCII control characters such as carriage return (U+000A), linefeed (U+000D), and Next Line (U+0085). Unicode does not provide for other ASCII formatting control characters which presumably then are not part of the Unicode plain text processing model. These legacy formatting control characters include Tab (U+0009), Line Tabulation or Vertical Tab (U+000C), and Form Feed (U+000C) which is also thought of as a page break.

[edit] Spaces

The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. While these spaces of varying width are important in typography, the Unicode processing model calls for such visual effects to be handled by rich text, markup and other such protocols. They are included in the Unicode repertoire primarily to handle lossless roundtrip transcoding from other character set encodings. These spaces include:

  1. En Quad (U+2000)
  2. Em Quad (U+2001)
  3. En Space (U+2002)
  4. Em Space (U+2003)
  5. Three-Per-Em Space (U+2004)
  6. Four-Per-Em Space (U+2005)
  7. Six-Per-Em Space (U+2006)
  8. Figure Space (U+2007)
  9. Punctuation Space (U+2008)
  10. Thin Space (U+2009)
  11. Hair Space (U+200A)
  12. Mathematical Space (U+205F)

Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode’s goals. Rather than using different spaces in different contexts, this styling should instead be handled through intelligent text layout software.

Three other writing-system-specific word separators are:

  • Mongolian Vowel Separator U+180E
  • Ideographic Space (U+3000): behaves as an ideographic separator and generally rendered as white space of the same width as an ideograph.
  • Ogham Space Mark (  U+1680): this character is sometimes displayed with a glyph and other times as only white space.

[edit] Line-break control characters

Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft hyphen (U+00AD) (sometimes called the "shy hyphen"). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible.

Break Inhibiting

  1. Non-breaking hyphen (U+2011)
  2. No-break space (U+00A0)
  3. Tibetan Mark Delimiter Tsheg Bstar (U+0F0C)
  4. Narrow no-break space (U+202F)

The break inhibiting characters are meant to be equivalent to a character sequence wrapped in the Word Joiner U+2060. However, the Word Joiner may be appended before of after any character that would allow a line-break to inhibit such line-breaking.

Break Enabling

  1. Soft hyphen (U+00AD)
  2. Tibetan Mark Intersyllabic Tsheg (U+0F0B)
  3. Zero-width space (U+200B)

Both the break inhibiting and break enabling characters participate with other punctuation and whitespace characters to enable text imaging systems to determine line breaks within the Unicode Line Breaking Algorithm.

[edit] Mathematical invisibles

Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like i⁣j. Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation. Unicode 5.1 introduces the Mathematical Invisible Plus character as well (U+2064).

[edit] Fraction slash

Example of fraction slash use. This font (Apple Chancery) shows the synthesized common fraction on the left and the precomposed fraction glyph on the right as a rendering the plain text string “1 1⁄4 1¼”. Depending on the text environment, the single string “1 1⁄4” might yield either result, the one the right through substitution of the fraction sequence with the single precomposed fraction glyph.
A more elaborate example of fraction slash usage: plain text “4 221⁄225” rendered in Apple Chancery. This font supplies the text layout software with instructions to synthesize the fraction according to the Unicode rule described in this section.

The fraction slash character (U+2044) has special behavior in the Unicode Standard (section 6.2):

The standard form of a fraction built using the fraction slash is defined as follows: any sequence of one or more decimal digits (General Category = Nd), followed by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction should be displayed as a unit, such as ¾. If the displaying software is incapable of mapping the fraction to a unit, then it can also be displayed as a simple linear sequence as a fallback (for example, 3/4).

By following this Unicode recommendation, text processing systems yield sophisticated symbols from plain text alone. Here the presence of the fraction slash character instructs the layout engine to synthesize a fraction from all consecutive digits preceding and following the slash. In practice, results vary because of the complicated interplay between fonts and layout engines. Simple text layout engines tend not to synthesize fractions all, and instead draw the glyphs as a linear sequence as described in the Unicode fallback scheme.

More sophisticated layout engines face two practical choices: they can follow Unicode's recommendation, or they can rely on the font's own instructions for synthesizing fractions. By ignoring the font's instructions, the layout engine can guarantee Unicode's recommended behavior. By following the font's instructions, the layout engine can achieve better typography because placement and shaping of the digits will be tuned to that particular font at that particular size.

The problem with following the font's instructions is that the simpler font formats have no way to specify fraction synthesis behavior. Meanwhile the more complex formats do not require the font to specify fraction synthesis behavior and therefore many do not. Most fonts of complex formats can instruct the layout engine to replace a plain text sequence such as "1⁄2" with the precomposed "½" glyph. But because many of them will not issue instructions to synthesize fractions, a plain text string such as "221⁄225" may well render as 22½25 (with the ½ being the substituted precomposed fraction, rather than synthesized). In the face of problems like this, those who wish to rely on the recommended Unicode behavior should choose fonts known to synthesize fractions or text layout software known to produce Unicode's recommended behavior regardless of font.

[edit] Bidirectional Neutral Formatting

Most letter characters in the Unicode repertoire are characters which are rendered from left-to right. However other scripts—such as Hebrew and Arabic—are rendered from right-to-left. Still other characters such as punctuation, are neutral and inherit their directionality from the adjacent characters. This allows such characters to be unified regardless of directionality and be re-used in either left-to-right or right-to-left scripts. Some of these characters have the bidi-mirrored property which indicates they should be mirrored when used in right-to-left directional text.

Within multilingual documents it is sometimes ambiguous which script a punctuation mark belongs to: especially when it occurs on the boundary between directional changes. So Unicode includes two characters that have strong directionality but are not displayed are ignorable by systems that do not process bidirectional tet and have no glyph associated with them.

  • Left-to-right mark (U+200E)
  • Right-to-left mark (U+200F)

Surrounding a bidirectionally neutral character by the left-to-right mark will force the character to behave as a left-to-right character while surrounding it by the right-to-left mark will force it to heave as a right-to-left character. The behavior of these characters is detailed in Unicode's Bidirectional Algorithm.

[edit] Bidirectional General Formatting

While Unicode is designed to handle multiple languages, multiple writing systems and even text that flows either left-to-right or right-to-left with minimal author intervention, there are special circumstances where the mix of bidirectional text can become intricate—requiring more author control. For these circumstances, Unicode includes five other characters to control the complex embedding of left-to-right text within right-to-left text and vice versa:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)
  • Left-to-right override (U+202D)
  • Right-to-left override (U+202E)

[edit] Interlinear annotation characters

  • Interlinear Annotation Character (U+FFF9)
  • Interlinear Annotation Separator (U+FFFA)
  • Interlinear Annotation Terminator (U+FFFB)

[edit] Script specific

  • Prefixed format control
    • Arabic Number Sign (U+0600)
    • Arabic Sign Sanah (U+0601)
    • Arabic Footnote Marker (U+0602)
    • Arabic Sign Safha (U+0603)
    • Arabic End of Ayah (U+06DD)
    • Syriac Abbreviation Mark (U+070F)
  • Brahmi-derived script dead-character formation
    • Devanangari Sign Virama (U+094D)
    • Bengali Sign Virama (U+09CD)
    • Gurmukhi Sign Virama (U+0A4D)
    • Gujarati Sign Virama (U+0ACD)
    • Oriya Sign Virama (U+0B4D)
    • Tamil Sign Virama (U+0BCD)
    • Teluga Sign Virama (U+0C4D)
    • Kannada Sign Virama (U+0CCD)
    • Malayalam Sign Virama (U+0D4D)
    • Sinhala Sign Al-Lakuna (U+0DCA)
    • Thai Character Phinthu (U+0E3A)
    • Myanmar Sign Virama (U+1039)
    • Tagalog Sign Virama (U+1714)
    • Hanunoo Sign Pamudpod (U+1734)
    • Khmer Sign Coeng (U+17D2)
    • Balinese Adeg Adeg (U+1B44)
    • Syloti Nagri Sign Hasanta (U+A806)
    • Kharoshthi Virama (U+10A3F)
  • Historical Viramas with other functions
    • Tibetan Mark Halanta (U+0F84)
    • Limbu Sign SA-1 (U+193B)
  • Mongolian Variation Selectors
    • Mongolian Free Variation Selector One (U+180B)
    • Mongolian Free Variation Selector Two (U+180C)
    • Mongolian Free Variation Selector Three (U+180D)
    • Mongolian Vowel Separator (U+180E)
  • Ogham
    • Ogham Space Mark (  U+1680)
  • Ideographic
    • Ideographic variation indicator (U+303E)
    • Ideographic Description (U+2FF0..U+2FFB)
  • Musical Format Control
    • Musical Symbol Begin Beam (U+1D173)
    • Musical Symbol End Beam (U+1D174)
    • Musical Symbol Begin Tie (U+1D175)
    • Musical Symbol End Tie (U+1D176)
    • Musical Symbol Begin Slur (U+1D177)
    • Musical Symbol End Slur (U+1D178)
    • Musical Symbol Begin Phrase (U+1D179)
    • Musical Symbol End Phrase (U+1D17A)

[edit] Others

  • Object Replacement Character (U+FFFC)
  • Replacement Character (U+FFFD)

[edit] Whitespace characters

Whitespace characters are not a separate group of characters, but instead Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as white space characters as well as the Unicode introduced line separator and paragraph separator. Also the core space character (U+0020) is designated as a whitespace character, but none of the other styling spaces.

[edit] Private use characters

The UCS includes 137,468 code points for private use. This means these code points can be assigned characters with specific properties by individuals, organizations and software vendors outside the ISO and Unicode Consortium. A Private Use Area (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters.

The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). Plane Fifteen (U+F0000 to U+FFFFD), and Plane Sixteen (U+100000 to U+10FFFD) are completely reserved for private use as well.

The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. One example of usage of the Private Use Area is Apple's usage of U+F8FF for the Apple logo.

Schemes and initiatives that use the PUA include:

Standardization Initiative Uses

Vendor Use

  • The Adobe Glyph List uses the PUA for some of its glyphs.
  • Apple lists a range of 2,274 characters in its developer documentation of 0xF400–0xF8FF within the PUA for Apple’s use.
  • WGL4 uses the PUA (U+F001 and U+F002) to encode two characters which are duplicates of the ligatures fi (U+FB01) fl (U+FB02)[1].

[edit] Special code points

At the simplest level, each character in the UCS represents a code point and a particular semantic function: For graphical characters, the semantic function is often implied by its name, and the script or block it is included within. A graphical character may also have a recommended glyph that helps define the meaning of the character. Han characters, used in China, Japan, Korea, Vietnam and their respective diaspora, include many other rich properties that participate in defining the semantic role for a character.

However, the UCS and Unicode designate other code points for other purposes. Those code points may have no or few character properties associated with them.

[edit] Surrogates

The 2,048 surrogates are not characters, but are reserved for use in UTF-16 to specify code points outside the Basic Multilingual Plane. They are divided into "high surrogates" (D800–DBFF) and "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point.

A surrogate pair denotes the code point

1000016 + (H - D80016 ) × 40016 + (L - DC0016)

where H and L are the numeric values of the high and low surrogates respectively.

Since high surrogate values in the range DB80 to DBFF always produce values in the Private Use planes, the high surrogate range can be further divided into (normal) high surrogates (D800–DB7F) and "high private use surrogates" (DB80–DBFF).

[edit] Noncharacters

Unicode reserves sixty-six code points as noncharacters. These code points are guaranteed to never have a character assigned to them. Software implementations are therefore free to use these code points for internal use. However, these noncharacters should never be included in text interchange between implementations. One inherently useful example of a noncharacter is the code point U+FFFE. This code point has the reverse binary sequence of the byte order mark (U+FEFF). If a stream of text contains this noncharacter, this is a good indication the text has been interpreted with the incorrect endianness.

[edit] Character properties

Every character in Unicode is defined by a large and growing set of properties. The properties facilitate text processing including collation or sorting of text, identifying words, sentences and graphemes, rendering or imaging text and so on. Below is a list of some of the core properties. There are many others documented in the Unicode Character Database.

Property Example Details
Name LATIN CAPITAL LETTER A This is a permanent name assigned by the joint cooperation of Unicode and the ISO UCS
Code Point U+0041 The Unicode code point is a number also permanently assigned along with the "Name" property and included in the companion UCS. The usual custom is to represent the code point as hexadecimal number with the prefix "U+" in front.
Representative Glyph Image:A-RepresentativeGlyph.png The representative glyphs are provided in code charts.
General Category Uppercase_Letter The general category is expressed as a two-letter sequence such as "Lu" for uppercase letter or "Nd", for decimal digit number.
Combining Class Not_Reordered (0) Since diacritics and other combining marks can be expressed with multiple characters in Unicode the "Combining Class" property allows characters to be differentiated by the type of combining character it represents. The combining class can be expressed as an integer between 0 and 255 or as a named value. The integer values allow the combining marks to be reordered into a canonical order to make string comparison of identical strings possible.
Bidirectional Category Left_To_Right Indicates the type of character for applying the Unicode bidirectional algorithm.
Bidirectional Mirrored no Indicates the character's glyph must be reversed or mirrored within the bidirectional algorithm. Mirrored glyphs can be provided by font makers, extracted from other characters related through the “Bidirectional Mirroring Glyph” property or synthesized by the text rendering system.
Bidirectional Mirroring Glyph N/A This property indicates the code point of another character whose glyph can serve as the mirrored glyph for the present character when mirroring within the bidirectional algorithm.
Decimal Digit Value NaN For numerals, this property indicates the numeric value of the character. Decimal digits have all three values set to the same value, presentational rich text compatibility characters and other Arabic-Indic non-decimal digits typically have only the latter two properties set to the numeric value of the character while numerals unrelated to Arabic Indic digits such as Roman Numerals or Hanzhou/Suzhou numerals typically have only the "Numeric Value" indicated.
Digit Value NaN
Numeric Value NaN
Ideographic False Indicates the character is an ideograph.
Default Ignorable False Indicates the character is ignorable for implementations and that no glyph, last resort glyph, or replacement character need be displayed.
Deprecated False Unicode never removes characters from the repertoire, but on occasion Unicode has deprecated a small number of characters.

[edit] Additional examples

Bidirectional Numeric Value
Name Code
Point
Repre-
sentative
Glyph
[2]
General
Category
Combining
Class
Category Mirrored Mirroring Glyph Decimal Digit Numeric
DIGIT FOUR U+0034 4 Decimal_Number_Digit (Nd) Not_Reordered (0) European_Number no n/a 4 4 4
DEVANAGARI DIGIT FOUR U+096A Decimal_Number_Digit (Nd) Not_Reordered (0) Left_To_Right no n/a 4 4 4
CIRCLED DIGIT FOUR U+2463 Other_Number (Nd) Not_Reordered (0) Other_Neutral no n/a n/a 4 4
ROMAN NUMERAL FOUR U+2163 Letter_Number (Nd) Not_Reordered (0) Left_To_Right no n/a n/a n/a 4
LEFT CURLY BRACKET U+007B { Open_Punctuation (Ps) Not_Reordered (0) Other_Neutral (On) yes “}” U+007D NaN NaN NaN
COMBINING CIRCUMFLEX ACCENT U+0302   ̂ Nonspacing_Mark (Mn) Above (230) Nonspacing_Mark (NSM) no n/a NaN NaN NaN
COMBINING GRAVE ACCENT BELOW U+0316    ̖  Nonspacing_Mark (Mn) Below (220) Nonspacing_Mark (NSM) no n/a NaN NaN NaN
ARABIC LETTER BEH U+0628 ب Other_Letter (Lo) Not_Reordered (0) Arabic_Letter (AL) no n/a n/a n/a n/a
HEBREW LETTER BET U+05D1 ב Other_Letter (Lo) Not_Reordered (0) Right_To_Left (R) no n/a n/a n/a n/a
<none> (kDefinition = parapet; invisible) U+4E0F Other_Letter (Lo) Not_Reordered (0) Left_To_Right (L) no n/a n/a n/a n/a

Characters include many other properties. Some properties are strings, some are booleans, some are relations to other characters. For example cased letters include properties that map those characters to their upper case, lower case and title case equivalents (title case is only used for ligatures). Some characters (canonical and compatibility decomposable characters) include mappings to canonical and compatibility equivalents. Characters have many boolean properties to indicate whether they are included as white space, or used as pattern syntax within programming languages and more. Many of these properties are exposed through regular expressions to perform complex queries on text. These properties are also used in the many Unicode text processing algorithms and also might be used by text imaging and font technologies to display text (like the bidirectional algorithm).

Unicode provides an online database to interactively query the entire Unicode character repertoire by the various properties.

[edit] Summary table of UCS characters assignments

[edit] See also

[edit] Tables

Unicode mapping tables
BMP SMP SIP SSP
0000—0FFF 8000—8FFF 10000—10FFF 20000—20FFF 28000—28FFF E0000—E0FFF
1000—1FFF 9000—9FFF   21000—21FFF 29000—29FFF
2000—2FFF A000—AFFF 12000—12FFF 22000—22FFF 2A000—2AFFF
3000—3FFF B000—BFFF   23000—23FFF  
4000—4FFF C000—CFFF   24000—24FFF 2F000—2FFFF
5000—5FFF D000—DFFF 1D000—1DFFF 25000—25FFF  
6000—6FFF E000—EFFF   26000—26FFF  
7000—7FFF F000—FFFF 1F000—1FFFF 27000—27FFF

[edit] External links

[edit] References

  1. ^ See WGL4 Unicode Range U+2013 through U+FB02
  2. ^ Not the official Unicode representative glyph, but merely a representative glyph. To see the official Unicode representative glyph, see the code charts.
Personal tools