"Martin v. Loewis" wrote on 2000-09-22 19:34 UTC: > > It seems that gcc ignores the locale and does not use glibc's multi-byte > > decoding functions to read in wide-string literals. :-( > > I believe that gcc rightfully ignores the locale. I strongly disagree for the reasons outlined below. > The C standard says > that input files are mapped to the source character set in an > implementation-defined way; nowhere it says that environment settings > of the user operating the compiler should be taken into account. If gcc runs on a POSIX system, then the POSIX spec also comes into play and POSIX applications should clearly determine the character encoding in all their input/output streams based on the locale setting, unless some other way (e.g., MIME headers, command-line options, implementation-defined source code pragmas for compilers, etc.) has been used to override the current locale. POSIX specifies already what the "implementation-defined way of determining the source character set" is that the C standard refers to. > It would be wrong to take such settings into account: the results of > invoking the compiler would not be reproducable anymore, and it would > not be possible to mix header files that are written in different > encodings - who says that header files on a system have an encoding > that necessarily matches the environment settings of some user? First of all: Encodings are trivially to convert into each other (simply use iconv, recode, etc.). Users on POSIX systems have to make an effort to keep all their files in the same encoding, namely the encoding specified in their locale. The rapid proliferation of UTF-8 will make this actually feasible in the near future, because UTF-8 can be very practically used in place of all other encodings. The fathers of Unix have already decided back in 1992 (Plan9) that this is the only real way to go and I hope the GNU/ Linux world will follow soon. I hope that one day in the not too far future I can simply place into /etc/profile the line export LANG=C.UTF-8 then convert all my plain text files on my installation to UTF-8, and from then on never have to worry about the restrictions of ASCII or the problems of switching between different encodings any more. Sounds like a promising idea to me, but it clearly requires also that gcc -- like any other POSIX application that has to know the file encodings -- will honor the locale setting. > I believe that characters outside the basic character set (i.e. ASCII) > should not be used in portable software. The authors of the C standard made it very clear that they want to support the ISO 10646 repertoire in source code, and I hope that this will soon become common practice. > If you absolutely have to > have non-ASCII characters in your source code, you should use > universal character names, i.e. > > wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"); Please not!!! If I run on a beautiful modern system with full UTF-8 support, then I definitely want to make full use of this encoding in my development environment. Hex escape sequences like the above one have soon to be seen as an emergency fallback mechanism for use in cases where archaic environments (such as gcc 2.95 ;-) have to be maintained. In such situations, a trivial recoding program can be used to convert the normal UTF-8 source code into an ugly and user-unfriendly emergency fallback such as L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" when files are transmitted to the archaic system. You must not confuse the emergency hack (hex fallbacks) with the daily usage on modern systems (UTF-8). Gettext() makes only sense if support of multi-lingual messages is a requirement. If I am a Thai student writing UTF-8 C source code for a Thai programming class, then I want to use the Thai alphabet in variables, comments, and wide-string literals just like you use ASCII. I am convinced that a) people will use lots of non-ASCII text in C source code (even English-speaking people will find en/em-dashes, curly quotation marks and mathematical symbols a highly desirable extension beyond ASCII) b) people will prefer to have these characters UTF-8 encoded in their development environment such that they see in the text editor the actual characters and not the hex fallback c) people will find it trivial to use a 5-line Perl script to convert L"Schöne Grüße!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" in case they encounter a (hopefully soon very rare) environment that can't handle ISO 10646 characters. It's just like they find it already trivial to convert {[]}^~ into trigraphs when they encounter a (thanks god already exceedingly rare) system that does not handle all ISO 646 IRV characters. Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as ugly and hopefully unnecessary as trigraphs, not as common or even recommendable practice! Otherwise you will just reveal yourself as an ASCII chauvinist and I shall condem you to years of maximum-portable trigraph usage ... ;-) Markus P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: < http://www.cl.cam.ac.uk/~mgk25/ >