Unicode

- provides a unique number for every character - in all languages eventually -

- see also the UCS page

www.unicode.org    Unicode FAQ    Unicode Charts    UTF-8 FAQ (Unicode for Unix/Linux)

Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.

Today's Unicode is v1.1 uses UCS-2 (2 bytes) Level 3 as it's base.  It is identical to UCS-2 except that it has the addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. 

Tomorrows Unicode v2.0 wil, be based on UCS-2 (4 bytes).

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

Unicode Charts and Ranges

The easiest comparison to help with understanding Unicode is ASCII code, which also assigns numbers to each character.  ASCII is based on LATIN-1 and uses a single Hexadecimal byte (0x00 format) for each character.  Unicode has two main flavors - UCS-2 (2 bytes) and UCS-4 (4 bytes).  UCS-2 "Basic Latin" numbers have the same numeric value as ASCII - except that it uses two bytes per character (in the format 0x0000) instead of 1 byte.

Therefore, an ASCII or Latin-1 code file can be easily transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, which has 4 bytes per character - we have to insert three 0x00 bytes before every ASCII byte.

Unicode Charts

Each Unicode table uses different sets of 4-digit number assignments with specific ranges.  For example, ASCII character set is taken from English, which is based on the Latin character set.  In ASCII, the letter "b" (lower case) is Decimal 98, which is Hex 62.  If you look in the Unicode table for "Basic Latin" you will see that the letter "b" unicode designation is "Hex 0062" (0x0062).  So unicode has a corresponding number of every ASCII character.  It also has corresponding number for all major languages and character sets.

The Ranges - examples:

  - Basic Latin (ASCII) Unicode numbers go from Hex 0000-007F (they all begin with 00).
  - Arabic Unicode numbers go from 0600-06FF (they all begin with 06)
  - Thai codes range from 0E00-0E7F
  - etc.

This seems to indicate there is a maximum of 9999 codes.  But other charts go way above that.  For example, the Mathematic Numerical Symbols chart range is Hex 1D400-1D7FF.  At the extreme hign-end range, the Supplementary Private Use Area-B chart ranges from Hex 100000-10FFFD (but no actual characters exist in this set and it is for special use only).