UTF-8

From Wikipedia, the free encyclopedia

Jump to: navigation, search

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages,[1] and other places where characters are stored or streamed.

UTF-8 encodes each character (code point) in one to four octets (8-bit bytes), with the 1-byte encoding used for the 128 US-ASCII characters. See the Description section below for details.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[2] The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8.[3]

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Contents

[edit] History

By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i.e., those where the high bit was set.

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Labs, then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.[4]

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.

[edit] Description

The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte. In this table x represent the lowest 8 bits of the Unicode value, y represent the next higher 8 bits, and z represent the bits higher than that:

Unicode Byte1 Byte2 Byte3 Byte4 example
U+000000-U+00007F 0xxxxxxx '$' U+0024
00100100
0x24
U+000080-U+0007FF 110yyyxx 10xxxxxx '¢' U+00A2
11000010,10100010
0xC2,0xA2
U+000800-U+00FFFF 1110yyyy 10yyyyxx 10xxxxxx '€' U+20AC
11100010,10000010,10101100
0xE2,0x82,0xAC
U+010000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx  U+10ABCD
11110100,10001010,10101111,10001101
0xF4,0x8A,0xAF,0x8D

So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.

With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:

binary hex decimal notes
00000000-01111111 00-7F 0-127 US-ASCII (single byte)
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point <= 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification

Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16 surrogate pairs). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.

[edit] Naming

The official name is "UTF-8", which is used in all the documents relating to the encoding. There are many instances, particularly for documents to be transmitted across the internet, where the character set in which the document is encoded is declared by the name near the start of the document. In this case, the correct name to use is "UTF-8". In addition, all standards conforming to the Internet Assigned Numbers Authority (IANA) list[5], which include CSS, HTML, XML, and HTTP headers[6] may also use the name "utf-8", as the declaration is case insensitive. Despite this, alternative forms, usually "utf8" or "UTF8", are seen; while this is incorrect and should be avoided, most agents such as browsers can understand this.

[edit] Rationale behind UTF-8's design

UTF-8 was designed to satisfy the following properties:

The ASCII characters are represented by themselves as single bytes that do not appear anywhere else
This makes UTF-8 work with the large number of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a UTF-8 version of an API. More importantly it removes the need to identify whether the text is ASCII or UTF-8. This makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
No first byte can appear as a later byte in a character
This makes UTF-8 "self-synchronizing". If one or more complete bytes are lost due to error or corruption, one can always locate the beginning of the next character and thus limit the damage.
The first byte of a character determines the number of bytes
These two properties guarantee that no byte sequence of one character is contained within a longer byte sequence that is not that character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated.
The bytes 0xfe and 0xff do not appear
This means a UTF-8 stream never matches the UTF-16 byte-order mark and thus cannot be confused with it (this property may not have been intended but is quite useful in practice).

These properties add redundancy to UTF-8–encoded text. Redundancy makes it very unlikely that a random sequence of bytes will validate as UTF-8. The lack of a practical validity test results in errors like mojibake for Shift JIS and ISO/IEC 8859-1 and requires usage of a byte-order mark in UTF-16. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence.[7] While natural languages encoded in traditional encodings are not random byte sequences, they are even less likely to pass a UTF-8 validity test and then be misinterpreted. For example, for ISO/IEC 8859-1 text to be mis-recognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol (pure ASCII text would pass a UTF-8 validity test, but it is UTF-8 by definition).

Redundancy means that UTF-8 text does not use memory as efficiently as possible. On average UTF-8 is more efficient than other Unicode encodings; details are listed in the advantages/disadvantages below. If minimum size is of importance, any modern data compression algorithm such as that used by gzip will compress Unicode text to far less than one byte per character (on average), making size arguments about Unicode encodings largely irrelevant.

[edit] UTF-8 derivations

The following implementations are slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification.

[edit] CESU-8

Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16. The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6 bytes rather than 4 for characters outside the Basic Multilingual Plane. Oracle databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.

[edit] Modified UTF-8

In Modified UTF-8 the null character (U+0000) is encoded as 0xc0,0x80 rather than 0x00. (0xc0,0x80 is not valid UTF-8 because it is not the shortest possible representation.) This means that the encoding of an array of Unicode containing the null character will not have a null byte in it, and thus will not be truncated if processed in a language such as C using traditional ASCIIZ string functions.

All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter. However it uses modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data.

[edit] Byte-order mark

Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte-order mark. This causes interoperability problems with software that does not expect the BOM. In particular:

  • It removes the desirable feature that UTF-8 is identical to ASCII for ASCII-only text. For instance a text editor that does not recognize UTF-8 will display "" at the start of the document, even if the UTF-8 contains only ASCII and would otherwise display correctly. It is for example possible to introduce Unicode into an existing computer language and compiler by adding some API for input/output, and use an external UTF-8 text editor when editing text strings. However such a compiler would not accept the BOM, which must be removed manually.
  • Programs that identify file types by special leading characters will fail to identify the UTF-8 files, even for file types that can otherwise contain UTF-8. A notable example is the Unix shebang syntax.

Some Windows software (including Notepad) will sometimes misidentify UTF-8 (and thus plain ASCII) documents as UTF-16LE if this BOM is missing, a bug commonly known as "Bush hid the facts" after a particular phrase that can trigger it.

[edit] Overlong forms, invalid input, and security considerations

The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Not notice and decode as if the bytes were some similar bit of UTF-8.
  2. Replace the bytes with a replacement character (usually '?' or '�' (U+FFFD)).
  3. Ignore the bytes.
  4. Interpret the bytes according to another encoding (often ISO-8859-1 or CP1252).
  5. Translate each byte to purposely-invalid Unicode such as 0xdcxx.
  6. Act like the string ends at that point and report an error
  7. Undo (or avoid) any result of the already-decoded part and report an error

Decoders may also differ in what bytes are part of the error. The sequence 0xF0,0x20,0x20,0x20 might be considered a single 4-byte error, or a 1-byte error followed by 3 space characters (the second interpretation is preferable).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."[8] The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." This implies that only the last version (throwing an error and avoiding any side effects) is allowed. This is quite impractical in actual usage (for instance a string comparison function is much more usable if it returns false rather than throws an error if one of the strings is invalid UTF-8). Therefore most practical uses of UTF-8 ignore the RFC and instead translate invalid sequences to something "safe".

Overlong forms are one of the most troublesome types of UTF-8 data. Many simpler decoders will happily decode them, producing ASCII slash or NUL at unexpected places. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server.

[edit] Advantages and disadvantages

[edit] General

[edit] Advantages

  • UTF-8 is a superset of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional code-page-specific character sets can generally be used with UTF-8 with few or no changes.
  • Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.) For the sorting to work correctly, the bytes must be treated as unsigned values.
  • UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration. [1]
  • UTF-8 and UTF-16 are the standard encodings for having Unicode in HTML documents, with UTF-8 as the preferred and most used encoding.
  • Any byte oriented string searching algorithm can be used with UTF-8 data (as long as one ensures that the inputs only consist of complete UTF-8 characters). Care must be taken with regular expressions and other constructs that count characters, however.
  • UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, and F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the W3 FAQ: Multilingual Forms for a Perl regular expression to validate a UTF-8 string). This is an advantage that most other encodings do not have, causing errors (mojibake) if the encoding is not stated in the file and wrongly guessed.

[edit] Disadvantages

  • A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.

[edit] Compared to single-byte encodings

[edit] Advantages

  • UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.

[edit] Disadvantages

  • UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters. In the case of languages which commonly used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), UTF-8 text will be almost double the size of the same text in a single-byte encoding. For alphabets used in South Asia, such as Hindi and Thai, the text size is up to three times that for traditional single-byte encodings. This has caused objections in India and other countries.
  • Single byte per character encodings make string cutting easy even with simple-minded APIs.

[edit] Compared to other multi-byte encodings

[edit] Advantages

  • UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.
  • Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle of a multi-byte sequence, only the information represented by the partial sequence is lost and decoding can begin correctly on the next character. Similarly, if a number of bytes are corrupted or dropped, then correct decoding can resume on the next character boundary. Many multi-byte encodings are much harder to resynchronise.
  • A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift JIS (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
  • The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence. This makes it extremely simple to extract a sub-string from a given string without elaborate parsing. This was often not the case in many multi-byte encodings.
  • Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1 encoding).

[edit] Disadvantages

  • UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

[edit] Compared to UTF-7

[edit] Advantages

  • UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
  • UTF-8 encodes "+" as itself whereas UTF-7 encodes it as "+-".

[edit] Disadvantages

  • UTF-8 requires the transmission system to be 8-bit clean. In the case of e-mail this means it has to be further encoded using quoted-printable or base64 in some cases. This extra stage of encoding carries a significant size penalty. The importance of this disadvantage has declined as mail transfer agents' support for eight-bit clean transports and the 8BITMIME SMTP extension as specified in RFC 1869 rises.

[edit] Compared to UTF-16

[edit] Advantages

  • Converting to UTF-16 while maintaining compatibility with existing programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated. Handling of invalid encodings in each API makes this much more difficult than it may first appear.
  • Conversion of a string of random 16-bit values that is assumed to be UTF-16 to UTF-8 is lossless. But invalid UTF-8 cannot be converted losslessly to UTF-16. This makes UTF-8 a safe way to hold data that might be text; this can be surprisingly important. For instance an API to control filesystems with both UTF-8 and UTF-16 filenames, but where either one may contain names with invalid encodings, can be written using UTF-8 but not UTF-16.
  • In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-BMP characters. Retrofitting code tends to be hard, so it's better to implement support for the entire range of Unicode from the start.
  • Text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in UTF-16/UCS-2. Text in all languages using codepoints below U+0800 (which includes all modern European languages) will be smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces, newlines, numbers, and ASCII punctuation, all of which are encoded in one byte per character.
  • Most communication and storage protocols were designed for a stream of bytes. A UTF-16 string must use a pair of bytes for each code word, which introduces a couple of potential problems:
    • The order of those two bytes becomes an issue. One can say that UTF-16 has two variants when used for text files. A variety of mechanisms can be used to deal with this issue (for example, the byte-order mark), but they still present an added complication for software and protocol design.
    • If a byte is missing from a character in UTF-16, the whole rest of the string (or the entire text file) will be either invalid UTF-16 or meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text. i.e. UTF-8 was made to be self-synchronizing, whereas UTF-16 was not.

[edit] Disadvantages

  • Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi will take more space in UTF-8 if there are more of these characters than there are ASCII characters (ASCII includes spaces, numbers, newlines, some punctuation, and XML markup, so it is not unusual for ASCII characters to dominate).
  • Although both UTF-8 and UTF-16 suffer from the need to handle invalid sequences, as described above under general disadvantages, a simplistic parser for UTF-16 is unlikely to convert invalid sequences to ASCII. Since the dangerous characters in most situations are generally ASCII, a simplistic UTF-16 parser is much less dangerous than a simplistic UTF-8 parser.
  • In UCS-2 (but not UTF-16) Unicode code points are all the same size, making measurements of a fixed number of them easy. However most people who think this is important are being confused by old documentation written for ASCII, where "character" was used as a synonym for "byte". In fact if you measure strings using bytes or words instead of "characters" most algorithms can be easily and efficiently adapted for UTF-8 or real UTF-16.

[edit] See also

[edit] References

  1. ^ "Moving to Unicode 5.1". Official Google Blog. May 5, 2008. http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html. Retrieved on 2008-05-08. 
  2. ^ Alvestrand, H. (1998), "IETF Policy on Character Sets and Languages", RFC 2277, Internet Engineering Task Force 
  3. ^ "Using International Characters in Internet Mail". Internet Mail Consortium. August 1, 1998. http://www.imc.org/mail-i18n.html. Retrieved on 2007-11-08. 
  4. ^ Pipe, Rob (2003-04-03). "UTF-8 history". http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt. 
  5. ^ Internet Assigned Numbers Authority Character Sets
  6. ^ W3C: Setting the HTTP charset parameter notes that the IANA list is used for HTTP
  7. ^ There are 256 × 256 − 128 × 128 not-pure-ASCII two-byte sequences, and of those, only 1920 encode valid UTF-8 characters (the range U+0080 to U+07FF), so the proportion of valid not-pure-ASCII two-byte sequences is 3.9%. Similarly, there are 256 × 256 × 256 − 128 × 128 × 128 not-pure-ASCII three-byte sequences, and 61,406 valid three-byte UTF-8 sequences (U+000800 to U+00FFFF minus surrogate pairs and non-characters), so the proportion is 0.41%; finally, there are 2564 − 1284 non-ASCII four-byte sequences, and 1,048,544 valid four-byte UTF-8 sequences (U+010000 to U+10FFFF minus non-characters), so the proportion is 0.026%. Note that this assumes that control characters pass as ASCII; without the control characters, the percentage proportions drop somewhat).
  8. ^ Yergeau, F. (2003), "UTF-8, a transformation format of ISO 10646", RFC 3629, Internet Engineering Task Force 

[edit] External links

There are several current definitions of UTF-8 in various standards documents:

  • RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
  • The Unicode Standard, Version 5.0, §3.9 D92, $3.10 D95 (2007)
  • The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
  • ISO/IEC 10646:2003 Annex D (2003)

They supersede the definitions given in the following obsolete works:

  • ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
  • The Unicode Standard, Version 2.0, Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1 : UTF-8 Shortest Form (2000)
  • Unicode Standard Annex #27: Unicode 3.1 (2001)

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

Personal tools