Just a couple of notes utf-16 is not a 2-byte encoding per se. Currently defined Unicode characters can be accommodated in 2-bytes using utf-16 but some of us are looking forward to the publishing of Plane 1 characters, etc. any implimentation of utf-16 must include the capacity to correctly handle valid surrogate pairs. You cann't restrict utf-16 characters to 2-bytes. ciao Andj andjc@ozemail.com.au ----- Original Message ----- From: Nuesser, Wilhelm To: ; ; 'linux-utf8@humbolt.nl.linux.org' ; Cc: Nuesser, Wilhelm ; Rohland, Hans-Christoph Sent: Friday, 4 August 2000 22:25 Subject: Proposal for 2 Byte Unicode implementation in gcc and glibc Hi everybody in the following we present a proposal for 2 byte Unicode support in gcc and libraries. Motivation for our proposal: when comparing applications using 4 Byte Unicode running on Linux, with similar applications using 2 Byte Unicode on other platforms (e.g. Win...) Linux will always show the worst performance. Even an otherwise superior OS performance can not compensate the additional requirements in memory bandwidth, CPU, disk space etc.. One simple example: for a typical database used in medium sized companies of about 100 GB, we find a ratio of about 70 percent strings to 30 percent data. The transition to 2 byte Unicode would increase the disk space to (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database would increase by 310 %. If we want Linux to become a major and globally usable platform we strongly believe that we can not sustain this inflation, i.e. we have to provide an additional - 2 byte - implementation of Unicode. At least the programmer must be free to choose which way they want their programs to work. Otherwise Linux will be on the wrong track. Please see the following text for some detailed information and the attachment for our full proposal: **************************************************************************** ***************************************** The next version of our business applications will be offered with the an Unicode option. The software is programmed with 2 byte unicode characters. Because today unicode characters are 4 byte in the Linux world, there has been some effort to add 2 byte unicode strings to gcc 2.95.2 and to implement the appropriate string handling routines. We would appreciate if gcc and glibc would be enhanced with our feature proposal. Of course, we are willing to contribute towards the implementation of 2 byte Unicode. We already have a patch for gcc to provide some 2 byte Unicode support. Reasons for Using UTF-16 and UTF-32 What is UTF-16? UTF-16 allows access to 63K characters as single Unicode 16-bit units. It can access an additional 1M characters by a mechanism known as surrogate pairs. Two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. Highs are from 0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there are no assigned surrogate pairs. Since the most common characters have already been encoded in the first 64K values, the characters requiring surrogate pairs will be relatively rare. (Taken from Unicode FAQ Copyright ©1998-2000 Unicode, Inc.) What is UTF-32? All characters represented in UTF-16, both those represented with 16 bits and those with a surrogate pair, can be represented as a single 32-bit unit in UTF-32. This single unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. UTF-32 is a subset of the encoding mechanism called UCS-4 in ISO 10646. (Taken from Unicode FAQ Copyright ©1998-2000 Unicode, Inc.) These are reasons to use UTF-16: 1.Performance The UTF-16 representation of textual data needs only half the amount of memory that a 32-bit representation would need, provided that surrogate pairs occur only seldom, which will be the case. Memory itself may be cheap, but the size of the data that has to be handled for each user in a multiuser environment is a critical performance issue. 2.Portability Software that uses wchar_t has restricted portability since wchar_t sometimes has 32 bits, but sometimes only 16 bits. A dedicated type for Unicode with platform-independent length allows to write portable software. 3.Interplatform Communication When UTF-16 is used for communication between different platforms, merely an Endian conversion may be necessary, which can be done in place. A conversion between UTF-16 and UTF-32 is more costly. UTF-32 would imply unacceptably high data volumes when used for communication. 4.Embedding in existing IT infrastructures UTF 16 Unicode implementations integrate better in existing IT landscapes. Here we find products which use 16 byte Unicode, too. One example for this is the Java Native Code Interface. Using this JNCI, it is possible to access the UTF-16 representation by C functions. UTF-16 support in C is therefore desirable. (Furthermore, in JDK 1.2, classes such as java.io.OutputStreamWriter and java.io.InputStreamReader support the conversion of surrogate pairs to single UTF-8 characters and back.) 5.Other commercial software uses 16-bit Unicode The Oracle Call Interface (OCI) supports 16-bit Unicode (see Oracle8i National Language Support Guide, Release 8.1.5 and higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8 (C/C++ core) uses 16-bit Unicode, independently of the platform. Also, some office products support 16-bit Unicode. 6.Operations and representation of character strings Although UTF-32 makes some operations on characters easier (e.g. indexing into strings) this implementation leads to a great overhead in other areas (see searching, collating, displaying etc. where the whole string is involved). The final point is that even UTF-16 without surrogates is already capable to support the most important scripts like Arabic, Bengali etc. For a full - and more technical - proposal please have a look at the attachment. Hoping for a fruitful discussion ;-) Kind regards Willi Nuesser SAP LinuxLab PS: When UTF-8 is used, the complexity of variable width characters shows up with almost every commonly used language except pure 7-Bit ASCII. For a number of languages, the UTF-8 representation saves some storage when compared with UTF-16, but for Asian characters UTF-8 requires 50% more storage than UTF-16. We do not consider UTF-8 as advantageous for text representation in the memory. It may be well suited for files where access is sequential but in general it is no uni-versal solution. Appendix: detailed proposal for 2 byte Unicode implementation <>