From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Eggert To: martin@mira.isdn.cs.tu-berlin.de Cc: bothner@cygnus.com, gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Thu, 17 Dec 1998 07:32:00 -0000 Message-id: <199812171531.HAA01746@shade.twinsun.com> References: <199812100702.XAA26400@cygnus.com> <199812120323.TAA10442@shade.twinsun.com> <199812121018.LAA02558@mira.isdn.cs.tu-berlin.de> <199812160559.VAA01252@shade.twinsun.com> <199812160712.IAA00239@mira.isdn.cs.tu-berlin.de> X-SW-Source: 1998-12/msg00635.html Date: Wed, 16 Dec 1998 08:12:52 +0100 From: Martin von Loewis > But this would mean ... "\u00b5" would turn into a two-byte > multibyte character string, which is incorrect for the common ISO > 8859/1 encoding where it is represented by a single byte. Are you talking about source character set or process character set here? I'm talking about the execution character set. E.g. printf ("\u00b5") should output a single byte in the Solaris 7 "de" locale, which uses ISO 8859/1. There is no concept of locale character set in C++... There's no such official term in C either, but I think C++ and C use the same basic idea here, namely that a locale (in particular, the LC_CTYPE part of the locale) specifies the rules for how multibyte characters are converted to wide chars, which wide chars are considered to be upper case, etc., etc. These rules are defined by the locale's character set and encoding. char hello[]="Hall\u00f6chen"; you would get "Hall\303\266chen" at run time. That's certainly not true in draft C9x, for non-UTF-8 locales. In draft C9x, if you want "Hall\303\266chen" at run time, you can write "Hall\303\266chen" at compile time. I also suspect that it's not true for C++. It's hard for me to believe that C++ requires UTF-8 encoding for strings at run-time. wchar hello[]=L"Hall\u00f6chen"; This should give the equivalent wide string at run-time. If the implementation uses Unicode wide chars, this is equivalent to "Hall\x00f6chen"; otherwise, it's equivalent to whatever binary encoding they use. It's possible for the locale to use Unicode wide strings even though it uses a non-UTF-8 encoding for multibyte chars. (I believe glibc 2.1 does this, but I haven't checked.) But it's not required by the C standard, and some systems use other wide encodings (e.g. JIS). > Yes. But the current locale should affect the processing of \u > escapes, as well as the recognition of multibyte characters. After the discussion with RMS, I agree that we should copy bytes *unmodified* into output. But your example above with `char hello' doesn't copy the bytes unmodified! It translates the 6 chars "\u00f6" to 2 bytes in your locale's charset and encoding, which is the right thing to do; RMS (reluctantly, I think :-) agreed that \u requires locale-dependent translation. I agree that multibyte chars should be copied unmodified into the output. However, as I mentioned earlier, they require locale-specific processing to be *recognized*; otherwise they might be confused with ASCII chars. If we get a \u escape (which the standard says clearly identifies ISO 10646 characters) we should also copy it as-is to the output. Again, you seem to be contradicting your own example. Though draft C9x says that \u identifies ISO 10646 chars, it doesn't require that the implementation use UTF-8 narrow strings, nor does it require that the implementation use Unicode in wide strings. It can use some other encoding, e.g. Shift-JIS or ISO 8859/1 or even Ascii. I assume C++ is similar here. Java is a different animal here; it requires Unicode at run-time. But we're talking about C (and C++), which make no such requirement. WCHAR DriverName[] = "\u1234\u5678"; The C standard says you should get Unicode. All draft C9x says is that you should get the appropriate chars, and that the relationship between those chars and the Unicode chars is implementation-defined. Microsoft says you should use Unicode in certain situations. Absolutely. In a locale that uses Unicode, you should get Unicode. Converting Unicode escapes to an encoding that uses illegal ASCII in assembler doesn't sound too smart to me. Sorry, you've lost me. ``illegal ASCII'?? > * In general, assembly language files will not be text files. Define "text file". A file that (among other things) uses a single encoding for its characters. Such files can be processed by standard text tools like wc, iconv, and emacs. You're proposing that assembler files use UTF-8 in some cases, and the locale's multibyte encoding in other cases. Such files can't be processed by standard text tools.