Just a couple of notes

utf-16 is not a 2-byte encoding per se. Currently defined Unicode characters
can be accommodated in 2-bytes using utf-16

but some of us are looking forward to the publishing of Plane 1 characters,
etc.

any implimentation of utf-16 must include the capacity to correctly handle
valid surrogate pairs. You cann't restrict utf-16 characters to 2-bytes.


ciao

Andj

andjc@ozemail.com.au


----- Original Message -----
From: Nuesser, Wilhelm <wilhelm.nuesser@sap.com>
To: <sap-list@redhat.com>; <gcc@gcc.gnu.org>;
'linux-utf8@humbolt.nl.linux.org' <linux-utf8@nl.linux.org>;
<libc-hacker@sources.redhat.com>
Cc: Nuesser, Wilhelm <wilhelm.nuesser@sap.com>; Rohland, Hans-Christoph
<hans-christoph.rohland@sap.com>
Sent: Friday, 4 August 2000 22:25
Subject: Proposal for 2 Byte Unicode implementation in gcc and glibc


Hi everybody

in the following we present a proposal for 2 byte Unicode support in
gcc and libraries.


Motivation for our proposal:

when comparing applications using 4 Byte Unicode running on Linux, with
similar applications using 2 Byte Unicode on other platforms
(e.g. Win...) Linux will always show the worst performance.

Even an otherwise superior OS performance can not compensate the
additional requirements in memory bandwidth, CPU, disk space etc..  One
simple example: for a typical database used in medium sized companies of
about 100 GB, we find a ratio of about 70 percent strings to 30 percent
data. The transition to 2 byte Unicode would increase the disk space to
(2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
would increase by 310 %.


If we want Linux to become a major and globally usable platform we
strongly believe that we can not sustain this inflation, i.e. we have to
provide an additional - 2 byte - implementation of Unicode.  At least
the programmer must be free to choose which way they want their programs
to work.  Otherwise Linux will be on the wrong track.


Please see the following text for some detailed information and
the attachment for our full proposal:

****************************************************************************
*****************************************

The next version of our business applications will be offered with the
an Unicode option. The software is programmed with 2 byte unicode
characters.  Because today unicode characters are 4 byte in the Linux
world, there has been some effort to add 2 byte unicode strings to gcc
2.95.2 and to implement the appropriate string handling routines.  We
would appreciate if gcc and glibc would be enhanced with our feature
proposal.  Of course, we are willing to contribute towards the
implementation of 2 byte Unicode. We already have a patch for gcc to
provide some 2 byte Unicode support.

Reasons for Using UTF-16 and UTF-32

What is UTF-16?

UTF-16 allows access to 63K characters as single Unicode 16-bit
units. It can access an additional 1M characters by a mechanism known as
surrogate pairs. Two ranges of Unicode code values are reserved for the
high (first) and low (second) values of these pairs.  Highs are from
0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there
are no assigned surrogate pairs. Since the most common characters have
already been encoded in the first 64K values, the characters requiring
surrogate pairs will be relatively rare. (Taken from Unicode FAQ
Copyright Â©1998-2000 Unicode, Inc.)

What is UTF-32?

All characters represented in UTF-16, both those represented with 16
bits and those with a surrogate pair, can be represented as a single
32-bit unit in UTF-32. This single unit corresponds to the Unicode
scalar value, which is the abstract number associated with a Unicode
character. UTF-32 is a subset of the encoding mechanism called UCS-4 in
ISO 10646. (Taken from Unicode FAQ Copyright Â©1998-2000 Unicode, Inc.)

These are reasons to use UTF-16:

    1.Performance

      The UTF-16 representation of textual data needs only half the
      amount of memory that a 32-bit representation would need, provided
      that surrogate pairs occur only seldom, which will be the
      case. Memory itself may be cheap, but the size of the data that
      has to be handled for each user in a multiuser environment is a
      critical performance issue.


    2.Portability

      Software that uses wchar_t has restricted portability since
      wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
      dedicated type for Unicode with platform-independent length allows
      to write portable software.

    3.Interplatform Communication

      When UTF-16 is used for communication between different platforms,
      merely an Endian conversion may be necessary, which can be done in
      place. A conversion between UTF-16 and UTF-32 is more costly.
      UTF-32 would imply unacceptably high data volumes when used for
      communication.

    4.Embedding in existing IT infrastructures
      UTF 16 Unicode implementations integrate better in existing IT
      landscapes. Here we find products which use 16 byte Unicode,
      too. One example for this is the Java Native Code Interface.
      Using this JNCI, it is possible to access the UTF-16
      representation by C functions. UTF-16 support in C is therefore
      desirable. (Furthermore, in JDK 1.2, classes such as
      java.io.OutputStreamWriter and java.io.InputStreamReader support
      the conversion of surrogate pairs to single UTF-8 characters and
      back.)

    5.Other commercial software uses 16-bit Unicode

      The Oracle Call Interface (OCI) supports 16-bit Unicode (see
      Oracle8i National Language Support Guide, Release 8.1.5 and
      higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8
      (C/C++ core) uses 16-bit Unicode, independently of the
      platform. Also, some office products support 16-bit Unicode.

    6.Operations and representation of character strings

      Although UTF-32 makes some operations on characters easier
      (e.g. indexing into strings) this implementation leads to a great
      overhead in other areas (see searching, collating, displaying etc.
      where the whole string is involved).  The final point is that even
      UTF-16 without surrogates is already capable to support the most
      important scripts like Arabic, Bengali etc.


For a full - and more technical - proposal please have a look at the
attachment.

Hoping for a fruitful discussion ;-)


Kind regards

Willi Nuesser
SAP LinuxLab


PS: When UTF-8 is used, the complexity of variable width characters
shows up with almost every commonly used language except pure 7-Bit
ASCII. For a number of languages, the UTF-8 representation saves some
storage when compared with UTF-16, but for Asian characters UTF-8
requires 50% more storage than UTF-16. We do not consider UTF-8 as
advantageous for text representation in the memory. It may be well
suited for files where access is sequential but in general it is no
uni-versal solution.


Appendix:

detailed proposal for 2 byte Unicode implementation

  <<appendix.txt>>