From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paul Eggert <eggert@twinsun.com>
To: martin@mira.isdn.cs.tu-berlin.de
Cc: bothner@cygnus.com, gcc2@gnu.org, egcs@cygnus.com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Thu, 17 Dec 1998 07:32:00 -0000
Message-id: <199812171531.HAA01746@shade.twinsun.com>
References: <199812100702.XAA26400@cygnus.com> <199812120323.TAA10442@shade.twinsun.com> <199812121018.LAA02558@mira.isdn.cs.tu-berlin.de> <199812160559.VAA01252@shade.twinsun.com> <199812160712.IAA00239@mira.isdn.cs.tu-berlin.de>
X-SW-Source: 1998-12/msg00635.html

   Date: Wed, 16 Dec 1998 08:12:52 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > But this would mean ... "\u00b5" would turn into a two-byte
   > multibyte character string, which is incorrect for the common ISO
   > 8859/1 encoding where it is represented by a single byte.

   Are you talking about source character set or process character set
   here?

I'm talking about the execution character set.  E.g. printf ("\u00b5")
should output a single byte in the Solaris 7 "de" locale, which uses
ISO 8859/1.

   There is no concept of locale character set in C++...

There's no such official term in C either, but I think C++ and C use
the same basic idea here, namely that a locale (in particular, the
LC_CTYPE part of the locale) specifies the rules for how multibyte
characters are converted to wide chars, which wide chars are
considered to be upper case, etc., etc.  These rules are defined by
the locale's character set and encoding.

      char hello[]="Hall\u00f6chen";

   you would get "Hall\303\266chen" at run time.

That's certainly not true in draft C9x, for non-UTF-8 locales.
In draft C9x, if you want "Hall\303\266chen" at run time,
you can write "Hall\303\266chen" at compile time.

I also suspect that it's not true for C++.  It's hard for me to
believe that C++ requires UTF-8 encoding for strings at run-time.

      wchar hello[]=L"Hall\u00f6chen";

This should give the equivalent wide string at run-time.  If the
implementation uses Unicode wide chars, this is equivalent to
"Hall\x00f6chen"; otherwise, it's equivalent to whatever binary
encoding they use.

It's possible for the locale to use Unicode wide strings even though
it uses a non-UTF-8 encoding for multibyte chars.  (I believe glibc
2.1 does this, but I haven't checked.)  But it's not required by the C
standard, and some systems use other wide encodings (e.g. JIS).

   > Yes.  But the current locale should affect the processing of \u
   > escapes, as well as the recognition of multibyte characters.

   After the discussion with RMS, I agree that we should copy
   bytes *unmodified* into output.

But your example above with `char hello' doesn't copy the bytes
unmodified!  It translates the 6 chars "\u00f6" to 2 bytes in your
locale's charset and encoding, which is the right thing to do; RMS
(reluctantly, I think :-) agreed that \u requires locale-dependent
translation.

I agree that multibyte chars should be copied unmodified into the
output.  However, as I mentioned earlier, they require locale-specific
processing to be *recognized*; otherwise they might be confused with
ASCII chars.

   If we get a \u escape (which the standard says clearly identifies
   ISO 10646 characters) we should also copy it as-is to the output.

Again, you seem to be contradicting your own example.

Though draft C9x says that \u identifies ISO 10646 chars, it doesn't
require that the implementation use UTF-8 narrow strings, nor does it
require that the implementation use Unicode in wide strings.  It can
use some other encoding, e.g. Shift-JIS or ISO 8859/1 or even Ascii.
I assume C++ is similar here.

Java is a different animal here; it requires Unicode at run-time.  But
we're talking about C (and C++), which make no such requirement.

	   WCHAR DriverName[] = "\u1234\u5678";

   The C standard says you should get Unicode.

All draft C9x says is that you should get the appropriate chars, and
that the relationship between those chars and the Unicode chars is
implementation-defined.

   Microsoft says you should use Unicode in certain situations.

Absolutely.  In a locale that uses Unicode, you should get Unicode.

   Converting Unicode escapes to an encoding that uses illegal
   ASCII in assembler doesn't sound too smart to me.

Sorry, you've lost me.  ``illegal ASCII'??

   > * In general, assembly language files will not be text files.

   Define "text file".

A file that (among other things) uses a single encoding for its
characters.  Such files can be processed by standard text tools like
wc, iconv, and emacs.

You're proposing that assembler files use UTF-8 in some cases, and the
locale's multibyte encoding in other cases.  Such files can't be
processed by standard text tools.