Code page

From Wikipedia, the free encyclopedia

Code page is the traditional IBM term used to map a specific set of characters to numerical code point values ^[1]. This is slightly different in meaning from the related terms encoding and character set. IBM and Microsoft often allocate a code page number to a character set even if that charset is better known by another name.

Whilst the term code page originated from IBM's EBCDIC-based mainframe systems, the term is most commonly associated with the IBM PC code pages. Microsoft, a maker of PC operating systems, refers to these code pages as OEM code pages, and supplements them with its own "ANSI" code pages.

Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single bitmap; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching [1]. There were a selection of code pages that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this system entirely. The character encodings used by these graphical systems (particularly MS-Windows) are sometimes called code pages as well.

[edit] Relationship to ASCII

The basis of the IBM1 PC1 code pages is ASCII, a 7-bit code representing 128 characters and control codes. In the past, 8-bit extensions to the ASCII code often either set the top bit to zero, or used it as a parity bit in network data transmissions. When this bit was instead made available for representing character data, another 128 characters and control codes could be represented. IBM used this extended range to encode characters used by various languages. No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.

[edit] IBM PC (OEM) code pages

These code pages are most often used under MS-DOS-like operating systems. They include a lot of box-drawing characters. Since the original IBM PC code page (number 437) was not really designed for international use, several incompatible variants emerged. Microsoft refers to these as the OEM code pages. Examples include:

437 — The original IBM PC code page
737 — Greek
775 — Estonian, Lithuanian and Latvian
850 — "Multilingual (Latin-1)" (Western European languages)
852 — "Slavic (Latin-2)" (Central and Eastern European languages)
855 — Cyrillic
857 — Turkish
858 — "Multilingual" with euro symbol
860 — Portuguese
861 — Icelandic
862 — Hebrew
863 — French Canadian
865 — Nordic
866 — Cyrillic
869 — Greek

In modern applications, operating systems and programming languages, the IBM code pages have been rendered obsolete by newer and better international standards, such as Unicode.

[edit] Other code pages of note

The following code page numbers are specific to Microsoft Windows. IBM uses different numbers for these code pages.

10000 — Macintosh Roman encoding (followed by several other Mac character sets)
10007 — Macintosh Cyrillic encoding
10029 — Macintosh Central European encoding
932 — Supports Japanese
936 — GBK Supports Simplified Chinese
949 — Supports Korean
950 — Supports Traditional Chinese
1200 — UCS-2LE Unicode little-endian
1201 — UCS-2BE Unicode big-endian
65000 — UTF-7 Unicode
65001 — UTF-8 Unicode
ASMO449+ — Supports Arabic
MIK — Supports Bulgarian and Russian as well

[edit] Windows (ANSI) code pages

Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.

1250 — Central and East European Latin
1251 — Cyrillic
1252 — West European Latin
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Baltic
1258 — Vietnamese

[edit] Criticism

Many code pages, except Unicode, suffer from several problems.

Some code page vendors insufficiently document the meaning of all code point values. This decreases the reliability of handling textual data through various computer systems consistently.
Some vendors add proprietary extensions to some code pages to add or change certain code point values. For example, byte \x5C in Shift-JIS can represent either a back slash or a yen currency symbol depending on the platform.
Multiple languages can not be handled in the same program.

Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.

Applications may also mislabel text in Windows-1252 as ISO-8859-1, the default character set for HTML. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.

[edit] Private code pages

When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using Terminate and Stay Resident utilities or by re-programming BIOS EPROMs. In some cases, unofficial code page numbers were invented (e.g., cp895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets. Another character set is Iran System encoding standard that was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.

[edit] See also

Windows code page
Character encoding
CCSID More precise definition of how "codepages" are used in IBM.

[edit] References

^ "IBM CDRA glossary". http://www.ibm.com/software/globalization/cdra/glossary.jsp#SPTGLCDPG.

[edit] External links

IBM code pages
IBM/ICU Charset Information
Microsoft code page identifiers (Microsoft's list contains only code pages actively used by normal apps on Windows. See also Torsten Mohrin's list for the full list of supported code pages)
Shorter Microsoft list containing only the ANSI and OEM code pages but with links to more detail on each
Character Sets And Code Pages At The Push Of A Button
Microsoft Chcp command: Display and set the console active code page

[0] "IBM CDRA glossary". http://www.ibm.com/software/globalization/cdra/glossary.jsp#SPTGLCDPG.

[1]

Code page

From Wikipedia, the free encyclopedia

Contents

[edit] Relationship to ASCII

[edit] IBM PC (OEM) code pages

[edit] Other code pages of note

[edit] Windows (ANSI) code pages

[edit] Criticism

[edit] Private code pages

[edit] See also

[edit] References

[edit] External links

Views

Personal tools

Navigation

Search

Interaction

Toolbox

Languages