Proposal for 2 Byte Unicode implementation in gcc and glibc

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Proposal for 2 Byte Unicode implementation in gcc and glibc
@ 2000-08-04  5:26 Nuesser, Wilhelm
  2000-08-04  5:56 ` Andrew Cunningham
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Nuesser, Wilhelm @ 2000-08-04  5:26 UTC (permalink / raw)
  To: 'sap-list@redhat.com', 'gcc@gcc.gnu.org',
	'linux-utf8@humbolt.nl.linux.org',
	'libc-hacker@sources.redhat.com'
  Cc: Nuesser, Wilhelm, Rohland, Hans-Christoph

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6302 bytes --]

Hi everybody

in the following we present a proposal for 2 byte Unicode support in 
gcc and libraries.

Motivation for our proposal: 

when comparing applications using 4 Byte Unicode running on Linux, with
similar applications using 2 Byte Unicode on other platforms
(e.g. Win...) Linux will always show the worst performance.

Even an otherwise superior OS performance can not compensate the
additional requirements in memory bandwidth, CPU, disk space etc..  One
simple example: for a typical database used in medium sized companies of
about 100 GB, we find a ratio of about 70 percent strings to 30 percent
data. The transition to 2 byte Unicode would increase the disk space to
(2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
would increase by 310 %.

If we want Linux to become a major and globally usable platform we
strongly believe that we can not sustain this inflation, i.e. we have to
provide an additional - 2 byte - implementation of Unicode.  At least
the programmer must be free to choose which way they want their programs
to work.  Otherwise Linux will be on the wrong track.

Please see the following text for some detailed information and 
the attachment for our full proposal: 

****************************************************************************
*****************************************

The next version of our business applications will be offered with the
an Unicode option. The software is programmed with 2 byte unicode
characters.  Because today unicode characters are 4 byte in the Linux
world, there has been some effort to add 2 byte unicode strings to gcc
2.95.2 and to implement the appropriate string handling routines.  We
would appreciate if gcc and glibc would be enhanced with our feature
proposal.  Of course, we are willing to contribute towards the
implementation of 2 byte Unicode. We already have a patch for gcc to
provide some 2 byte Unicode support.

Reasons for Using UTF-16 and UTF-32 

What is UTF-16? 

UTF-16 allows access to 63K characters as single Unicode 16-bit
units. It can access an additional 1M characters by a mechanism known as
surrogate pairs. Two ranges of Unicode code values are reserved for the
high (first) and low (second) values of these pairs.  Highs are from
0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there
are no assigned surrogate pairs. Since the most common characters have
already been encoded in the first 64K values, the characters requiring
surrogate pairs will be relatively rare. (Taken from Unicode FAQ
Copyright Â©1998-2000 Unicode, Inc.)

What is UTF-32? 

All characters represented in UTF-16, both those represented with 16
bits and those with a surrogate pair, can be represented as a single
32-bit unit in UTF-32. This single unit corresponds to the Unicode
scalar value, which is the abstract number associated with a Unicode
character. UTF-32 is a subset of the encoding mechanism called UCS-4 in
ISO 10646. (Taken from Unicode FAQ Copyright Â©1998-2000 Unicode, Inc.)

These are reasons to use UTF-16: 

    1.Performance

      The UTF-16 representation of textual data needs only half the
      amount of memory that a 32-bit representation would need, provided
      that surrogate pairs occur only seldom, which will be the
      case. Memory itself may be cheap, but the size of the data that
      has to be handled for each user in a multiuser environment is a
      critical performance issue.

    2.Portability 

      Software that uses wchar_t has restricted portability since
      wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
      dedicated type for Unicode with platform-independent length allows
      to write portable software.

    3.Interplatform Communication 

      When UTF-16 is used for communication between different platforms,
      merely an Endian conversion may be necessary, which can be done in
      place. A conversion between UTF-16 and UTF-32 is more costly.
      UTF-32 would imply unacceptably high data volumes when used for
      communication.

    4.Embedding in existing IT infrastructures 
      UTF 16 Unicode implementations integrate better in existing IT
      landscapes. Here we find products which use 16 byte Unicode,
      too. One example for this is the Java Native Code Interface.
      Using this JNCI, it is possible to access the UTF-16
      representation by C functions. UTF-16 support in C is therefore
      desirable. (Furthermore, in JDK 1.2, classes such as
      java.io.OutputStreamWriter and java.io.InputStreamReader support
      the conversion of surrogate pairs to single UTF-8 characters and
      back.)

    5.Other commercial software uses 16-bit Unicode 

      The Oracle Call Interface (OCI) supports 16-bit Unicode (see
      Oracle8i National Language Support Guide, Release 8.1.5 and
      higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8
      (C/C++ core) uses 16-bit Unicode, independently of the
      platform. Also, some office products support 16-bit Unicode.

    6.Operations and representation of character strings 

      Although UTF-32 makes some operations on characters easier
      (e.g. indexing into strings) this implementation leads to a great
      overhead in other areas (see searching, collating, displaying etc.
      where the whole string is involved).  The final point is that even
      UTF-16 without surrogates is already capable to support the most
      important scripts like Arabic, Bengali etc.

For a full - and more technical - proposal please have a look at the
attachment.

Hoping for a fruitful discussion ;-)

Kind regards

Willi Nuesser
SAP LinuxLab

PS: When UTF-8 is used, the complexity of variable width characters
shows up with almost every commonly used language except pure 7-Bit
ASCII. For a number of languages, the UTF-8 representation saves some
storage when compared with UTF-16, but for Asian characters UTF-8
requires 50% more storage than UTF-16. We do not consider UTF-8 as
advantageous for text representation in the memory. It may be well
suited for files where access is sequential but in general it is no
uni-versal solution.

Appendix: 

detailed proposal for 2 byte Unicode implementation

  <<appendix.txt>> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  5:26 Proposal for 2 Byte Unicode implementation in gcc and glibc Nuesser, Wilhelm
@ 2000-08-04  5:56 ` Andrew Cunningham
  2000-08-04  6:21   ` Jamie Lokier
  2000-08-04  6:31 ` Jamie Lokier
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Andrew Cunningham @ 2000-08-04  5:56 UTC (permalink / raw)
  To: linux-utf8, sap-list, gcc, libc-hacker
  Cc: Nuesser, Wilhelm, Rohland, Hans-Christoph

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7091 bytes --]

Just a couple of notes

utf-16 is not a 2-byte encoding per se. Currently defined Unicode characters
can be accommodated in 2-bytes using utf-16

but some of us are looking forward to the publishing of Plane 1 characters,
etc.

any implimentation of utf-16 must include the capacity to correctly handle
valid surrogate pairs. You cann't restrict utf-16 characters to 2-bytes.

ciao

Andj

andjc@ozemail.com.au

----- Original Message -----
From: Nuesser, Wilhelm <wilhelm.nuesser@sap.com>
To: <sap-list@redhat.com>; <gcc@gcc.gnu.org>;
'linux-utf8@humbolt.nl.linux.org' <linux-utf8@nl.linux.org>;
<libc-hacker@sources.redhat.com>
Cc: Nuesser, Wilhelm <wilhelm.nuesser@sap.com>; Rohland, Hans-Christoph
<hans-christoph.rohland@sap.com>
Sent: Friday, 4 August 2000 22:25
Subject: Proposal for 2 Byte Unicode implementation in gcc and glibc

Hi everybody

in the following we present a proposal for 2 byte Unicode support in
gcc and libraries.

Motivation for our proposal:

when comparing applications using 4 Byte Unicode running on Linux, with
similar applications using 2 Byte Unicode on other platforms
(e.g. Win...) Linux will always show the worst performance.

Even an otherwise superior OS performance can not compensate the
additional requirements in memory bandwidth, CPU, disk space etc..  One
simple example: for a typical database used in medium sized companies of
about 100 GB, we find a ratio of about 70 percent strings to 30 percent
data. The transition to 2 byte Unicode would increase the disk space to
(2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
would increase by 310 %.

If we want Linux to become a major and globally usable platform we
strongly believe that we can not sustain this inflation, i.e. we have to
provide an additional - 2 byte - implementation of Unicode.  At least
the programmer must be free to choose which way they want their programs
to work.  Otherwise Linux will be on the wrong track.

Please see the following text for some detailed information and
the attachment for our full proposal:

****************************************************************************
*****************************************

The next version of our business applications will be offered with the
an Unicode option. The software is programmed with 2 byte unicode
characters.  Because today unicode characters are 4 byte in the Linux
world, there has been some effort to add 2 byte unicode strings to gcc
2.95.2 and to implement the appropriate string handling routines.  We
would appreciate if gcc and glibc would be enhanced with our feature
proposal.  Of course, we are willing to contribute towards the
implementation of 2 byte Unicode. We already have a patch for gcc to
provide some 2 byte Unicode support.

Reasons for Using UTF-16 and UTF-32

What is UTF-16?

UTF-16 allows access to 63K characters as single Unicode 16-bit
units. It can access an additional 1M characters by a mechanism known as
surrogate pairs. Two ranges of Unicode code values are reserved for the
high (first) and low (second) values of these pairs.  Highs are from
0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. In Unicode 3.0, there
are no assigned surrogate pairs. Since the most common characters have
already been encoded in the first 64K values, the characters requiring
surrogate pairs will be relatively rare. (Taken from Unicode FAQ
Copyright Â©1998-2000 Unicode, Inc.)

What is UTF-32?

All characters represented in UTF-16, both those represented with 16
bits and those with a surrogate pair, can be represented as a single
32-bit unit in UTF-32. This single unit corresponds to the Unicode
scalar value, which is the abstract number associated with a Unicode
character. UTF-32 is a subset of the encoding mechanism called UCS-4 in
ISO 10646. (Taken from Unicode FAQ Copyright Â©1998-2000 Unicode, Inc.)

These are reasons to use UTF-16:

    1.Performance

      The UTF-16 representation of textual data needs only half the
      amount of memory that a 32-bit representation would need, provided
      that surrogate pairs occur only seldom, which will be the
      case. Memory itself may be cheap, but the size of the data that
      has to be handled for each user in a multiuser environment is a
      critical performance issue.

    2.Portability

      Software that uses wchar_t has restricted portability since
      wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
      dedicated type for Unicode with platform-independent length allows
      to write portable software.

    3.Interplatform Communication

      When UTF-16 is used for communication between different platforms,
      merely an Endian conversion may be necessary, which can be done in
      place. A conversion between UTF-16 and UTF-32 is more costly.
      UTF-32 would imply unacceptably high data volumes when used for
      communication.

    4.Embedding in existing IT infrastructures
      UTF 16 Unicode implementations integrate better in existing IT
      landscapes. Here we find products which use 16 byte Unicode,
      too. One example for this is the Java Native Code Interface.
      Using this JNCI, it is possible to access the UTF-16
      representation by C functions. UTF-16 support in C is therefore
      desirable. (Furthermore, in JDK 1.2, classes such as
      java.io.OutputStreamWriter and java.io.InputStreamReader support
      the conversion of surrogate pairs to single UTF-8 characters and
      back.)

    5.Other commercial software uses 16-bit Unicode

      The Oracle Call Interface (OCI) supports 16-bit Unicode (see
      Oracle8i National Language Support Guide, Release 8.1.5 and
      higher); in the data base, UTF-8 is used. PeopleSoft PeopleTools 8
      (C/C++ core) uses 16-bit Unicode, independently of the
      platform. Also, some office products support 16-bit Unicode.

    6.Operations and representation of character strings

      Although UTF-32 makes some operations on characters easier
      (e.g. indexing into strings) this implementation leads to a great
      overhead in other areas (see searching, collating, displaying etc.
      where the whole string is involved).  The final point is that even
      UTF-16 without surrogates is already capable to support the most
      important scripts like Arabic, Bengali etc.

For a full - and more technical - proposal please have a look at the
attachment.

Hoping for a fruitful discussion ;-)

Kind regards

Willi Nuesser
SAP LinuxLab

PS: When UTF-8 is used, the complexity of variable width characters
shows up with almost every commonly used language except pure 7-Bit
ASCII. For a number of languages, the UTF-8 representation saves some
storage when compared with UTF-16, but for Asian characters UTF-8
requires 50% more storage than UTF-16. We do not consider UTF-8 as
advantageous for text representation in the memory. It may be well
suited for files where access is sequential but in general it is no
uni-versal solution.

Appendix:

detailed proposal for 2 byte Unicode implementation

  <<appendix.txt>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  5:56 ` Andrew Cunningham
@ 2000-08-04  6:21   ` Jamie Lokier
  0 siblings, 0 replies; 10+ messages in thread
From: Jamie Lokier @ 2000-08-04  6:21 UTC (permalink / raw)
  To: Andrew Cunningham
  Cc: linux-utf8, sap-list, gcc, libc-hacker, Nuesser, Wilhelm,
	Rohland, Hans-Christoph

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

Andrew Cunningham wrote:
> any implimentation of utf-16 must include the capacity to correctly handle
> valid surrogate pairs. You cann't restrict utf-16 characters to 2-bytes.

That's way conversion from utf-16 to utf-32 should be analogous to
conversion from utf-8 to wchar_t, Ã  la mbtowcs.  Etc.  The rules about
character by character processing apply.  You may wish to use utf32_t
for the intermediate characters, e.g. in a simple parser.

-- Jamie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  5:26 Proposal for 2 Byte Unicode implementation in gcc and glibc Nuesser, Wilhelm
  2000-08-04  5:56 ` Andrew Cunningham
@ 2000-08-04  6:31 ` Jamie Lokier
  2000-08-09  4:49   ` Christoph Rohland
  2000-08-04  7:15 ` Markus Kuhn
  2000-08-04 11:02 ` Bruno Haible
  3 siblings, 1 reply; 10+ messages in thread
From: Jamie Lokier @ 2000-08-04  6:31 UTC (permalink / raw)
  To: Nuesser, Wilhelm
  Cc: 'sap-list@redhat.com', 'gcc@gcc.gnu.org',
	'linux-utf8@humbolt.nl.linux.org',
	'libc-hacker@sources.redhat.com',
	Rohland, Hans-Christoph

Nuesser, Wilhelm wrote:
> PS: When UTF-8 is used, the complexity of variable width characters
> shows up with almost every commonly used language except pure 7-Bit
> ASCII. For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16. We do not consider UTF-8 as
> advantageous for text representation in the memory. It may be well
> suited for files where access is sequential but in general it is no
> universal solution.

The *complexity* of variable width characters shows up with UTF-16 too.
So although space concerns may be a good reason to choose UTF-16 for
external representations (on disk), within a program UTF-32 is simple
and UTF-8/UTF-16 are more complex.

UTF-8 has the advantage that there is no endianness ambiguity, and has
some other nice lexical properties.

This is why UTF-8 is the standard "unix" representation of large chars.
(Space is not a significant issue, provided you compress your text
files.  Compressed UTF-8 should take about the same space as compressed
UTF-16).

Therefore, it is good to have conversion functions between UTF-8, UTF-16
and UTF-32.  It is perhaps a nice extension to have the compiler able to
parse UTF-16 and UTF-32 constant strings.

But I don't see the point in an extensive set of printfU16
etc. functions.  Standard unix text files use UTF-8 (or unfortunately
they are often ISO-8859-1).  Non-standard formats like databases may use
UTF-16, but databases don't use printf to write to the database.

Btw, I prefer "UTF16" or "utf16" instead of "U16" ;-)

have a nice day,
-- Jamie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  5:26 Proposal for 2 Byte Unicode implementation in gcc and glibc Nuesser, Wilhelm
  2000-08-04  5:56 ` Andrew Cunningham
  2000-08-04  6:31 ` Jamie Lokier
@ 2000-08-04  7:15 ` Markus Kuhn
  2000-08-04 11:02 ` Bruno Haible
  3 siblings, 0 replies; 10+ messages in thread
From: Markus Kuhn @ 2000-08-04  7:15 UTC (permalink / raw)
  To: Nuesser, Wilhelm
  Cc: 'sap-list@redhat.com', 'gcc@gcc.gnu.org',
	'libc-hacker@sources.redhat.com',
	Rohland, Hans-Christoph

"Nuesser, Wilhelm" wrote on 2000-08-04 12:25 UTC:
> The C standard (ISO/IEC 9899:1990) and its Addendum 1 (1994)

This is an obsolete document version. Please replace by "The C standard
(ISO/IEC 9899:1999)". Available on paper or PDF for on-line ordering on

  http://www.iso.ch/cate/d29237.html

Until you receive your official copy, have a look at the final draft on

  http://www.cl.cam.ac.uk/~mgk25/ISO-C-FDIS.1999-04.txt

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: < http://www.cl.cam.ac.uk/~mgk25/ >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  5:26 Proposal for 2 Byte Unicode implementation in gcc and glibc Nuesser, Wilhelm
                   ` (2 preceding siblings ...)
  2000-08-04  7:15 ` Markus Kuhn
@ 2000-08-04 11:02 ` Bruno Haible
  2000-08-04 18:52   ` Werner LEMBERG
  3 siblings, 1 reply; 10+ messages in thread
From: Bruno Haible @ 2000-08-04 11:02 UTC (permalink / raw)
  To: linux-utf8
  Cc: 'sap-list@redhat.com', 'gcc@gcc.gnu.org',
	Nuesser, Wilhelm, Rohland, Hans-Christoph

Wilhelm Nuesser writes:

> One simple example: for a typical database used in medium sized companies of
> about 100 GB, we find a ratio of about 70 percent strings to 30 percent
> data. The transition to 2 byte Unicode would increase the disk space to
> (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
> would increase by 310 %.

Application writers distinguish between external representation of
string (how it is stored on disk) and internal representation (how it
is stored in memory most of the time).

About the external representation:

* Noone uses UCS-4/UTF-32. It's just too wasteful.

* Many Windows applications use UCS-2 or UTF-16.

* Many Unix applications use UTF-8.

* The particular choice for your applications is up to you. Support
  for all of them is available in glibc-2.1.92, through iconv
  (explicit conversion) or fopen/fgetwc/fputwc (implicit conversion).

About the internal representation:

* Many applications use UTF-8 as internal representation, because it
  does not waste a lot of memory for American and European languages.

* For some complicated tasks, like string pattern matching, temporary
  conversion to UCS-4 is performed, using mbsnrtowcs or equivalent.

* For some simpler tasks, like determining the width of a string,
  often the conversion to UCS-4 is performed on the fly, using
  mbrtowc, without need for memory allocation.

* The ISO C 99 standard and its glibc-2.2 implementation offer its
  entire printf/scanf/IO facilities in both the multibyte (possibly
  UTF-8) and wide (UCS-4 on glibc) flavours.

* Again, the choice is up to you. If you absolutely want the third
  flavour (UTF-16 as in-memory representation), libraries like ICU
  give it to you.

> These are reasons to use UTF-16: 
>  
>     1.Performance
>  
>       The UTF-16 representation of textual data needs only half the
>       amount of memory that a 32-bit representation would need, provided
>       that surrogate pairs occur only seldom, which will be the
>       case.

Given that most of the world's textual data is ISO-8859-*/KOI8-R,
encoding it with UTF-8 saves even more memory.

>     2.Portability 
>  
>       Software that uses wchar_t has restricted portability since
>       wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
>       dedicated type for Unicode with platform-independent length allows
>       to write portable software.

Writing portable programs means to realize what is implementation
dependent and what is not. Yes, sizeof(wchar_t) is implementation
dependent.

If you don't like that, you are free to use a middleware library (like
ICU, again) which shields you from the operating system's types.

>     6.Operations and representation of character strings 
>       
>       Although UTF-32 makes some operations on characters easier
>       (e.g. indexing into strings) this implementation leads to a great
>       overhead in other areas (see searching, collating, displaying etc.
>       where the whole string is involved).

In any of these areas (searching, collating, displaying) you can
afford to temporarily convert from UTF-8 or UTF-16 to UCS-4, because
the actual work involved (canonical [de]composition, treatment of
combining characters, reordering of vowels, etc) is far superior to
the conversion cost.

> For a number of languages, the UTF-8 representation saves some
> storage when compared with UTF-16, but for Asian characters UTF-8
> requires 50% more storage than UTF-16.

Yes, it does. And for English and German UTF-16 requires 100% more
storage than UTF-8.

> We do not consider UTF-8 as advantageous for text representation in
> the memory. It may be well suited for files where access is
> sequential but in general it is no uni-versal solution.

Whether the access is sequential or random is irrelevant here. When
doing random access into an UTF-16 encoded string, a program must not
process the second half of a surrogate pair before the first half, and
likewise it normally must not process a combining character before its
preceding base character. Therefore - whether in a UTF-32, UTF-16 or
UTF-8 world - random access into strings is done via substrings
(ranges of indices, not singular indices), and then it doesn't matter
any more whether the substrings are delimited by two "uint32_t *" or
two "uint16_t *" or two "uint8_t *".

>    2.String and character literals 
>  
>       For utf16_t literals, we suggest the prefix u (similar to the
>       prefix L for the type wchar_t):
>  
>          utf16_t s[] = u"someText"; 
>          utf16_t c = u's'; 
>  
>       For utf32_t, we suggest the prefix U. This is similar to the
>       notation for universal character names in the C++ Standard: \u is
>       followed by four hexadecimal digits and \U is followed by eight
>       hexadecimal digits.

The need for this language extension that you propose here - namely,
being able to view and edit source code on non-Unicode text editors -
is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
feature. The problem is that wchar_t is not guaranteed to represent
Unicode is irrelevant, because such programs will work in a given
locale only, anyway. For writing international software, I don't
recommend to put foreign strings in the code. Put them into message
catalogs and use gettext().

                          Bruno

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04 11:02 ` Bruno Haible
@ 2000-08-04 18:52   ` Werner LEMBERG
  0 siblings, 0 replies; 10+ messages in thread
From: Werner LEMBERG @ 2000-08-04 18:52 UTC (permalink / raw)
  To: linux-utf8, haible; +Cc: sap-list, gcc, wilhelm.nuesser, hans-christoph.rohland

> > For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16.
> 
> Yes, it does. And for English and German UTF-16 requires 100% more
> storage than UTF-8.

You can use SCSU to compress your data.  It works with short strings
also (which is not true for generic compression algorithms like LZW).
The Technical Report #6 ( http://www.unicode.org/unicode/reports/tr6/ )
gives the following examples:

  UTF-16 German:     9 chars  (18 Bytes)   -> SCSU   9 Bytes
         Russian:    6 chars  (12 Bytes)   ->        7 Bytes
         Japanese: 116 chars (232 Bytes)   ->      178 Bytes


    Werner

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-04  6:31 ` Jamie Lokier
@ 2000-08-09  4:49   ` Christoph Rohland
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Rohland @ 2000-08-09  4:49 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: sap-list, gcc, linux-utf8, libc-hacker, wilhelm.nuesser

Jamie Lokier <egcs@tantalophile.demon.co.uk> writes:

> Nuesser, Wilhelm wrote:
> > PS: When UTF-8 is used, the complexity of variable width characters
> > shows up with almost every commonly used language except pure 7-Bit
> > ASCII. For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16. We do not consider UTF-8 as
> > advantageous for text representation in the memory. It may be well
> > suited for files where access is sequential but in general it is no
> > universal solution.

[...]

> But I don't see the point in an extensive set of printfU16
> etc. functions.  Standard unix text files use UTF-8 (or unfortunately
> they are often ISO-8859-1).  Non-standard formats like databases may use
> UTF-16, but databases don't use printf to write to the database.

But if you have an application which needs to process huge amounts of
unicode enabled strings, UTF16 is IMHO the best way:

You very seldom have surrogate in normal strings and on the other side
you can detect these pairs with little overhead. We really do not have
a chance either to blow up our memory requirements to 300% compared to
the ASCII case or always scan the whole string for escape sequnences
which would mean increasing the CPU overhead by probably the same
amount. 


(It took me a long time to understand this and I had the very same
point of view like you expressed, but our NLS experts really convinced
me.)

Greetings
		Christoph

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
  2000-08-08  1:29 Brink, Ulrich
@ 2000-08-08 18:25 ` Jamie Lokier
  0 siblings, 0 replies; 10+ messages in thread
From: Jamie Lokier @ 2000-08-08 18:25 UTC (permalink / raw)
  To: Brink, Ulrich
  Cc: 'linux-utf8@nl.linux.org', 'sap-list@redhat.com',
	'gcc@gcc.gnu.org'

> Finally let me point out that the literals are more important for us
> than the library issue.

It would be nice to have UTF-8 string literals, but it is just a nice
thought.

-- Jamie

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Proposal for 2 Byte Unicode implementation in gcc and glibc
@ 2000-08-08  1:29 Brink, Ulrich
  2000-08-08 18:25 ` Jamie Lokier
  0 siblings, 1 reply; 10+ messages in thread
From: Brink, Ulrich @ 2000-08-08  1:29 UTC (permalink / raw)
  To: 'linux-utf8@nl.linux.org'
  Cc: 'sap-list@redhat.com', 'gcc@gcc.gnu.org'

> Bruno Haible writes:
> > Wilhelm Nuesser writes:
> >
> >>    2.String and character literals 
> >>  
> >>       For utf16_t literals, we suggest the prefix u (similar to the
> >>       prefix L for the type wchar_t):
> >>  
> >>          utf16_t s[] = u"someText"; 
> >>          utf16_t c = u's'; 
> >>  
> >
> > The need for this language extension that you propose here - namely,
> > being able to view and edit source code on non-Unicode text editors -
> > is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> > feature. The problem is that wchar_t is not guaranteed to represent
> > Unicode is irrelevant, because such programs will work in a given
> > locale only, anyway. For writing international software, I don't
> > recommend to put foreign strings in the code. Put them into message
> > catalogs and use gettext().
> 
> 
> 16-bit Unicode is being used in existing software. Java is 16-bit
> Unicode.  On AIX and Windows NT, wchar_t has 16 bits.  The template
> class basic_string in C++ is designed to be instantiated with various
> types.  - With our proposal, we leave it to the developer to decide
> which Unicode representation fits best to his needs.
> 
> There are libraries for platform-independent 16-bit Unicode support.
> You mentioned ICU. But there are no literals. The programmer has to
> write something like
> 
>   unsigned short s[] = {'H', 'e', 'l', 'l', 'o', 0 };
>   myfunc( (unsigned short*)"H\000e\000l\000l\000o\000\000" );
> 
> Of course, in internationalized applications the texts that are
> displayed to the users should be translated and should not be coded in
> the C source. Nevertheless literals are frequently used for various
> internal purposes.
> 
> The latest gcc has an option that makes wchar_t 16 bits long.  However
> there is the danger that you mix up objects compiled with 16-bit
> wchar_t and objects with 32-bit wchar_t, and as far as I know, it is
> not planned to create a glibc with 16-bit wchar_t. So we would prefer
> to work without the new option and to have a new type for 16-bit
> characters.
> 
> In the glibc the char and wchar_t versions of some functions (e.g.
> strtol(), strcoll() ) are generated from the same source. It would not
> be too difficult to generate a 16-bit version as well, and the result
> would be more reliable than an independent library. In practice, one
> of the problems we have is that we must migrate old C code to Unicode.
> 
> Finally let me point out that the literals are more important for us
> than the library issue.
> 
> Yours,
> Ulli
> 
> 
> ----------
> From: 	Bruno Haible[SMTP:HAIBLE@ILOG.FR]
> Sent: 	Friday, August 04, 2000 8:00:59 PM
> To: 	linux-utf8@nl.linux.org
> Cc: 	'sap-list@redhat.com'; 'gcc@gcc.gnu.org'; Nuesser, Wilhelm; Rohland,
> Hans-Christoph
> Subject: 	Re: Proposal for 2 Byte Unicode implementation in gcc and
> glibc
> Auto forwarded by a Rule
> 
> Wilhelm Nuesser writes:
> 
> > One simple example: for a typical database used in medium sized
> companies of
> > about 100 GB, we find a ratio of about 70 percent strings to 30 percent
> > data. The transition to 2 byte Unicode would increase the disk space to
> > (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
> > would increase by 310 %.
> 
> Application writers distinguish between external representation of
> string (how it is stored on disk) and internal representation (how it
> is stored in memory most of the time).
> 
> About the external representation:
> 
> * Noone uses UCS-4/UTF-32. It's just too wasteful.
> 
> * Many Windows applications use UCS-2 or UTF-16.
> 
> * Many Unix applications use UTF-8.
> 
> * The particular choice for your applications is up to you. Support
>   for all of them is available in glibc-2.1.92, through iconv
>   (explicit conversion) or fopen/fgetwc/fputwc (implicit conversion).
> 
> About the internal representation:
> 
> * Many applications use UTF-8 as internal representation, because it
>   does not waste a lot of memory for American and European languages.
> 
> * For some complicated tasks, like string pattern matching, temporary
>   conversion to UCS-4 is performed, using mbsnrtowcs or equivalent.
> 
> * For some simpler tasks, like determining the width of a string,
>   often the conversion to UCS-4 is performed on the fly, using
>   mbrtowc, without need for memory allocation.
> 
> * The ISO C 99 standard and its glibc-2.2 implementation offer its
>   entire printf/scanf/IO facilities in both the multibyte (possibly
>   UTF-8) and wide (UCS-4 on glibc) flavours.
> 
> * Again, the choice is up to you. If you absolutely want the third
>   flavour (UTF-16 as in-memory representation), libraries like ICU
>   give it to you.
> 
> > These are reasons to use UTF-16: 
> >  
> >     1.Performance
> >  
> >       The UTF-16 representation of textual data needs only half the
> >       amount of memory that a 32-bit representation would need, provided
> >       that surrogate pairs occur only seldom, which will be the
> >       case.
> 
> Given that most of the world's textual data is ISO-8859-*/KOI8-R,
> encoding it with UTF-8 saves even more memory.
> 
> >     2.Portability 
> >  
> >       Software that uses wchar_t has restricted portability since
> >       wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
> >       dedicated type for Unicode with platform-independent length allows
> >       to write portable software.
> 
> Writing portable programs means to realize what is implementation
> dependent and what is not. Yes, sizeof(wchar_t) is implementation
> dependent.
> 
> If you don't like that, you are free to use a middleware library (like
> ICU, again) which shields you from the operating system's types.
> 
> >     6.Operations and representation of character strings 
> >       
> >       Although UTF-32 makes some operations on characters easier
> >       (e.g. indexing into strings) this implementation leads to a great
> >       overhead in other areas (see searching, collating, displaying etc.
> >       where the whole string is involved).
> 
> In any of these areas (searching, collating, displaying) you can
> afford to temporarily convert from UTF-8 or UTF-16 to UCS-4, because
> the actual work involved (canonical [de]composition, treatment of
> combining characters, reordering of vowels, etc) is far superior to
> the conversion cost.
> 
> > For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16.
> 
> Yes, it does. And for English and German UTF-16 requires 100% more
> storage than UTF-8.
> 
> > We do not consider UTF-8 as advantageous for text representation in
> > the memory. It may be well suited for files where access is
> > sequential but in general it is no uni-versal solution.
> 
> Whether the access is sequential or random is irrelevant here. When
> doing random access into an UTF-16 encoded string, a program must not
> process the second half of a surrogate pair before the first half, and
> likewise it normally must not process a combining character before its
> preceding base character. Therefore - whether in a UTF-32, UTF-16 or
> UTF-8 world - random access into strings is done via substrings
> (ranges of indices, not singular indices), and then it doesn't matter
> any more whether the substrings are delimited by two "uint32_t *" or
> two "uint16_t *" or two "uint8_t *".
> 
> >    2.String and character literals 
> >  
> >       For utf16_t literals, we suggest the prefix u (similar to the
> >       prefix L for the type wchar_t):
> >  
> >          utf16_t s[] = u"someText"; 
> >          utf16_t c = u's'; 
> >  
> >       For utf32_t, we suggest the prefix U. This is similar to the
> >       notation for universal character names in the C++ Standard: \u is
> >       followed by four hexadecimal digits and \U is followed by eight
> >       hexadecimal digits.
> 
> The need for this language extension that you propose here - namely,
> being able to view and edit source code on non-Unicode text editors -
> is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> feature. The problem is that wchar_t is not guaranteed to represent
> Unicode is irrelevant, because such programs will work in a given
> locale only, anyway. For writing international software, I don't
> recommend to put foreign strings in the code. Put them into message
> catalogs and use gettext().
> 
>                           Bruno

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2000-08-09  4:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-08-04  5:26 Proposal for 2 Byte Unicode implementation in gcc and glibc Nuesser, Wilhelm
2000-08-04  5:56 ` Andrew Cunningham
2000-08-04  6:21   ` Jamie Lokier
2000-08-04  6:31 ` Jamie Lokier
2000-08-09  4:49   ` Christoph Rohland
2000-08-04  7:15 ` Markus Kuhn
2000-08-04 11:02 ` Bruno Haible
2000-08-04 18:52   ` Werner LEMBERG
2000-08-08  1:29 Brink, Ulrich
2000-08-08 18:25 ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).