- provides a unique number for every character - in all languages eventually -
- see also the UCS page
www.unicode.org Unicode FAQ Unicode Charts UTF-8 FAQ (Unicode for Unix/Linux)
Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.
Unicode Charts and Ranges
The easiest comparison to help with understanding Unicode is ASCII code, which also assigns numbers to each character. ASCII is based on LATIN-1 and uses a single Hexadecimal byte (0x00 format) for each character. Unicode has two main flavors - UCS-2 (2 bytes) and UCS-4 (4 bytes). UCS-2 "Basic Latin" numbers have the same numeric value as ASCII - except that it uses two bytes per character (in the format 0x0000) instead of 1 byte.
Therefore, an ASCII or Latin-1 code file can be easily transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, which has 4 bytes per character - we have to insert three 0x00 bytes before every ASCII byte.
Each Unicode table uses different sets of 4-digit number assignments with
specific ranges. For example, ASCII character set is taken from English,
which is based on the Latin character set. In ASCII, the letter
"b" (lower case) is Decimal 98, which is Hex 62. If you look in
the Unicode table for
"Basic Latin" you will see that the letter "b" unicode
designation is "Hex 0062" (0x0062). So unicode has a
corresponding number of every ASCII character. It also has corresponding
number for all major languages and character sets.
The Ranges - examples:
- Basic Latin (ASCII) Unicode numbers go from Hex 0000-007F (they all begin with 00).
- Arabic Unicode numbers go from 0600-06FF (they all begin with 06)
- Thai codes range from 0E00-0E7F
This seems to indicate there is a maximum of 9999 codes. But other charts go way above that. For example, the Mathematic Numerical Symbols chart range is Hex 1D400-1D7FF. At the extreme hign-end range, the Supplementary Private Use Area-B chart ranges from Hex 100000-10FFFD (but no actual characters exist in this set and it is for special use only).