* Re: gcc ignores locale (no UTF-8 source code supported) [not found] <200009221934.VAA00916@loewis.home.cs.tu-berlin.de> @ 2000-09-23 9:35 ` Markus Kuhn 2000-09-23 12:22 ` Martin v. Loewis 0 siblings, 1 reply; 6+ messages in thread From: Markus Kuhn @ 2000-09-23 9:35 UTC (permalink / raw) To: libc-alpha; +Cc: gcc [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 5281 bytes --] "Martin v. Loewis" wrote on 2000-09-22 19:34 UTC: > > It seems that gcc ignores the locale and does not use glibc's multi-byte > > decoding functions to read in wide-string literals. :-( > > I believe that gcc rightfully ignores the locale. I strongly disagree for the reasons outlined below. > The C standard says > that input files are mapped to the source character set in an > implementation-defined way; nowhere it says that environment settings > of the user operating the compiler should be taken into account. If gcc runs on a POSIX system, then the POSIX spec also comes into play and POSIX applications should clearly determine the character encoding in all their input/output streams based on the locale setting, unless some other way (e.g., MIME headers, command-line options, implementation-defined source code pragmas for compilers, etc.) has been used to override the current locale. POSIX specifies already what the "implementation-defined way of determining the source character set" is that the C standard refers to. > It would be wrong to take such settings into account: the results of > invoking the compiler would not be reproducable anymore, and it would > not be possible to mix header files that are written in different > encodings - who says that header files on a system have an encoding > that necessarily matches the environment settings of some user? First of all: Encodings are trivially to convert into each other (simply use iconv, recode, etc.). Users on POSIX systems have to make an effort to keep all their files in the same encoding, namely the encoding specified in their locale. The rapid proliferation of UTF-8 will make this actually feasible in the near future, because UTF-8 can be very practically used in place of all other encodings. The fathers of Unix have already decided back in 1992 (Plan9) that this is the only real way to go and I hope the GNU/ Linux world will follow soon. I hope that one day in the not too far future I can simply place into /etc/profile the line export LANG=C.UTF-8 then convert all my plain text files on my installation to UTF-8, and from then on never have to worry about the restrictions of ASCII or the problems of switching between different encodings any more. Sounds like a promising idea to me, but it clearly requires also that gcc -- like any other POSIX application that has to know the file encodings -- will honor the locale setting. > I believe that characters outside the basic character set (i.e. ASCII) > should not be used in portable software. The authors of the C standard made it very clear that they want to support the ISO 10646 repertoire in source code, and I hope that this will soon become common practice. > If you absolutely have to > have non-ASCII characters in your source code, you should use > universal character names, i.e. > > wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"); Please not!!! If I run on a beautiful modern system with full UTF-8 support, then I definitely want to make full use of this encoding in my development environment. Hex escape sequences like the above one have soon to be seen as an emergency fallback mechanism for use in cases where archaic environments (such as gcc 2.95 ;-) have to be maintained. In such situations, a trivial recoding program can be used to convert the normal UTF-8 source code into an ugly and user-unfriendly emergency fallback such as L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" when files are transmitted to the archaic system. You must not confuse the emergency hack (hex fallbacks) with the daily usage on modern systems (UTF-8). Gettext() makes only sense if support of multi-lingual messages is a requirement. If I am a Thai student writing UTF-8 C source code for a Thai programming class, then I want to use the Thai alphabet in variables, comments, and wide-string literals just like you use ASCII. I am convinced that a) people will use lots of non-ASCII text in C source code (even English-speaking people will find en/em-dashes, curly quotation marks and mathematical symbols a highly desirable extension beyond ASCII) b) people will prefer to have these characters UTF-8 encoded in their development environment such that they see in the text editor the actual characters and not the hex fallback c) people will find it trivial to use a 5-line Perl script to convert L"Schöne GrüÃe!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" in case they encounter a (hopefully soon very rare) environment that can't handle ISO 10646 characters. It's just like they find it already trivial to convert {[]}^~ into trigraphs when they encounter a (thanks god already exceedingly rare) system that does not handle all ISO 646 IRV characters. Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as ugly and hopefully unnecessary as trigraphs, not as common or even recommendable practice! Otherwise you will just reveal yourself as an ASCII chauvinist and I shall condem you to years of maximum-portable trigraph usage ... ;-) Markus P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: < http://www.cl.cam.ac.uk/~mgk25/ > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: gcc ignores locale (no UTF-8 source code supported) 2000-09-23 9:35 ` gcc ignores locale (no UTF-8 source code supported) Markus Kuhn @ 2000-09-23 12:22 ` Martin v. Loewis 2000-09-23 12:33 ` Joseph S. Myers ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Martin v. Loewis @ 2000-09-23 12:22 UTC (permalink / raw) To: Markus.Kuhn; +Cc: libc-alpha, gcc [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 5772 bytes --] > POSIX specifies already what the "implementation-defined way of > determining the source character set" is that the C standard refers > to. Can you please point to the exact chapter and verse of Posix that specifies that the C compiler must consider environment variables when reading source code? > First of all: Encodings are trivially to convert into each other (simply > use iconv, recode, etc.). Users on POSIX systems have to make an effort > to keep all their files in the same encoding, namely the encoding > specified in their locale. Users may not have the administrative permissions to do so: Most users can not modify the files in /usr/include, for example. > The fathers of Unix have already decided back in 1992 (Plan9) that > this is the only real way to go and I hope the GNU/ Linux world will > follow soon. I can easily emagine that gcc supports a -futf-8 option some day (or -fencoding=utf8). I hope it will never consider LANG when reading source code, though. That is evil. > The authors of the C standard made it very clear that they want to > support the ISO 10646 repertoire in source code, and I hope that this > will soon become common practice. The authors also made it pretty clear that any mechanism except for universal-character-names will be implementation-defined, and cannot hope to be portable across implementations. Therefore, authors of portable software should not make use of such a mechanism. > > wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"); > > Please not!!! If I run on a beautiful modern system with full UTF-8 > support, then I definitely want to make full use of this encoding in my > development environment. I guess you don't type UTF-8 bytes byte-for-byte into the files; instead, your editor is capable of producing them on a key stroke. Just tell your beautiful modern system to produce universal-character-names when you type the keys. So the line above would *display* with umlauts, even though the file uses a MBCS encoding (namely, \u escapes). An advanced editor (such as Emacs) is capable of dealing with multiple encodings, it certainly could associate C files (and C++ and Java and Tcl) with an encoding unicode-escape or such. Maybe it is time to further improve your system. > You must not confuse the emergency hack (hex fallbacks) with the > daily usage on modern systems (UTF-8). Why is one multibyte encoding capable of expressing full Unicode (UTF-8) more modern than another one (universal character names)? > Gettext() makes only sense if support of multi-lingual messages is a > requirement. If I am a Thai student writing UTF-8 C source code for a > Thai programming class, then I want to use the Thai alphabet in > variables, comments, and wide-string literals just like you use ASCII. Sure. Just use the right text editor - not one that produces UTF-8, but one that produces universal character names. That way, you can have all the features you want, *and* your code will compile even if you take it with you when hired by a German company. > a) people will use lots of non-ASCII text in C source code (even > English-speaking people will find en/em-dashes, curly quotation marks > and mathematical symbols a highly desirable extension beyond ASCII) Certainly, although the barrier is high even if the technical problem where solved: Keywords in English don't mix well with non-English identifiers, and corporate style may require all technical documentation (including comments) to be in English. > b) people will prefer to have these characters UTF-8 encoded in their > development environment such that they see in the text editor the > actual characters and not the hex fallback People won't care about encodings as long as it works. > c) people will find it trivial to use a 5-line Perl script to > convert L"Schöne GrüÃe!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" > in case they encounter a (hopefully soon very rare) environment > that can't handle ISO 10646 characters. The environment not supporting ISO 10646 characters won't support the universal character names, either. They are just two encodings of ISO 10646 - and one of them happens to be mandated by the language standards (ISO 9899 and ISO 14882), while the other is not. It will be trivial to convert the source, yes - but it would be even better if editors supported them in the first place. > Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as > ugly and hopefully unnecessary as trigraphs, not as common or even > recommendable practice! Well, I want editors to support that. Until I give up on that, I'll continue to recommend that - especially as more and more languages support that as a means of putting Unicode into source code. > Otherwise you will just reveal yourself as an ASCII chauvinist Well, I guess I'm an ASCII chauvinis then... > P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate You write > However, most maintainers of existing applications chose instead to > do only soft conversion and do not use the libc wide character > functions, either because they are not yet that widely implemented > or because this would require too many changes in their software. For gcc, the issue is both one of portability and performance: the wide character routines are not available on supported hosts, and the performance hit of calling mb* routines would be unacceptable. Only recently, the preprocessor has been improved to go over each input character only once, and the compilers will soon use tokenization as produced by the preprocessor (instead of tokenizing themselves all over). All these improvements would likely be taken back if we had to call the C library every time. Regards, Martin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: gcc ignores locale (no UTF-8 source code supported) 2000-09-23 12:22 ` Martin v. Loewis @ 2000-09-23 12:33 ` Joseph S. Myers 2000-09-23 13:31 ` Markus Kuhn 2000-09-25 16:33 ` Joern Rennecke 2 siblings, 0 replies; 6+ messages in thread From: Joseph S. Myers @ 2000-09-23 12:33 UTC (permalink / raw) To: Martin v. Loewis; +Cc: Markus.Kuhn, libc-alpha, gcc On Sat, 23 Sep 2000, Martin v. Loewis wrote: > > POSIX specifies already what the "implementation-defined way of > > determining the source character set" is that the C standard refers > > to. > > Can you please point to the exact chapter and verse of Posix that > specifies that the C compiler must consider environment variables when > reading source code? ISO/IEC 9945-2:1993(E) A.1.5.3 page 690 lines 130-133 describes the effect of LC_CTYPE on the c89 utility. GCC doesn't provide a c89 wrapper; I think it would be useful if it did have a --enable-cc-wrappers configure option that created a link cc to gcc, a c89 wrapper (which presently could be #! /bin/sh exec gcc -ansi -pedantic "$@" with the appropriate adjustments to get the right executable and any portability kludge needed if "$@" isn't sufficiently portable) and a c99 wrapper (as specified by the Austin Group draft). If the default will not be to follow LC_CTYPE, then these scripts would need additional options in the gcc invocation. -- Joseph S. Myers jsm28@cam.ac.uk ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: gcc ignores locale (no UTF-8 source code supported) 2000-09-23 12:22 ` Martin v. Loewis 2000-09-23 12:33 ` Joseph S. Myers @ 2000-09-23 13:31 ` Markus Kuhn 2000-09-25 16:33 ` Joern Rennecke 2 siblings, 0 replies; 6+ messages in thread From: Markus Kuhn @ 2000-09-23 13:31 UTC (permalink / raw) To: Martin v. Loewis; +Cc: gcc [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 6790 bytes --] "Martin v. Loewis" wrote on 2000-09-23 19:17 UTC: > > POSIX specifies already what the "implementation-defined way of > > determining the source character set" is that the C standard refers > > to. > > Can you please point to the exact chapter and verse of Posix that > specifies that the C compiler must consider environment variables when > reading source code? POSIX.2 requires the interpretation of LANG for most of its own applications and this way sets an example of good implementation practice that should be followed by other applications as well. [I can provide holy words of IEEE when I'm back in our the departmental library on Monday to read the precious scripture. :] > I can easily emagine that gcc supports a -futf-8 option some day (or > -fencoding=utf8). I hope it will never consider LANG when reading > source code, though. That is evil. Your -futf-8 just adds yet another entry to my long list of non-standard command-line options for telling an application to use UTF-8, which you can find on http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate UTF-8 will never fly if switching from ASCII to UTF-8 requires users to memorize and specify two dozen different special command-line options from now on. That is what we have the locale environment variables for and it works pretty beautifully. Unix was fundamentally built around the idea that files and pipes are not typed, so every effort should be made to move towards one single globally acceptable character encoding that will hopefully soon be as ubiquitous as ASCII. A far cleaner solution for your needs is to extend the existing option -pedantic to issue a warning whenever it encounters a character outside the portable character set (such as @ or ü) in the source code. This way, you can easily check your code if you insist on bible-proof character portability. [Feel free to add -tripedantic which adds warnings whenever {[]}~^ and other characters that are not present in all ISO 646 variants occur. (Oh yes, also -morsepedantic and -baudotpedantic for guaranteed shortwave and telex compatibility of C source code. You really can't trust the average telegraph operator with any non-portable characters in your precious wired C source code.)] > I guess you don't type UTF-8 bytes byte-for-byte into the files; Usually not, but actually, sometimes (rarely, on old ASCII terminals) I do indeed and that is usually not even much less convenient than \u escapes. Why should "\u00a1" be more readable than "<C2> <A1>" (what less produces in ASCII mode if it sees UTF-8) or \302\241 (what emacs says in ASCII mode). All these hex forms are equally unfreindly and only for emergency usage. > instead, your editor is capable of producing them on a key > stroke. Just tell your beautiful modern system to produce > universal-character-names when you type the keys. So the line above > would *display* with umlauts, even though the file uses a MBCS > encoding (namely, \u escapes). But this works ONLY if the ALL the applications handling the source code (editor, file viewer, CASE tools, diff, CVS web browser, etc. etc. etc. etc. etc. etc.) are familiar with the C syntax and apply a rather non-trivial and very language-specific tokenization process before they can display Unicode characters adequately. UTF-8 on the other hand can be safely and robustly interpreted in components as ignorant as the terminal emulator without any of the programs in the processing pipeline involved having to know the slightest bit about C's token syntax. I want "cat test.c" still to work in a user-friendly way when non-ASCII characters are present. UTF-8 allows this, \u certainly not. > An advanced editor (such as Emacs) is capable of dealing with multiple > encodings, it certainly could associate C files (and C++ and Java and > Tcl) with an encoding unicode-escape or such. Maybe it is time to > further improve your system. But this would just restrict me to one single kitchen-sink tool such as Emacs, and I would still find non-ASCII characters in my source code being treated as third-class citizens by the many many many other tools that I use besides Emacs (wdiff, gdb, tcl tools for cvs, etc. etc. etc.). Paste a few lines of C code into your mailer and (unless you operate completely within a single product such as Emacs) the \uXXXX will show up again. It is far more likely that both your C editor and your email editor can understand UTF-8 in the near future than that both are identical to Emacs. > > You must not confuse the emergency hack (hex fallbacks) with the > > daily usage on modern systems (UTF-8). > > Why is one multibyte encoding capable of expressing full Unicode > (UTF-8) more modern than another one (universal character names)? Should be obvious: UTF-8 does not require a C scanner to be processed, but the C universal character names do. UTF-8 can be easily and safely integrated into such dumb things as terminal emulators and can be used end-to-end in a processing pipeline in which any ASCII sequence can have special semantics. Universal character names are C specific and not at all universal. Fortran, Ada95, TeX and XML all have their own independent "universal character name" equivalents, yet all of them could process UTF-8 smoothly. If you look at it this way (namely portability and interoperability of tools), then UTF-8 becomes quickly the least common denominator that enables portable exchange of non-ASCII content across tools and platforms. \u sequences remain a C specific fallback hack. We should make the use of UTF-8 as easy and natural as possible. As natural as ASCII. > Just use the right text editor - not one that produces UTF-8, > but one that produces universal character names. That way, you can > have all the features you want, *and* your code will compile even if > you take it with you when hired by a German company. Much more likely is the scenario in which the German company is already anyway using the same encoding as the Thai company: UTF-8. > > b) people will prefer to have these characters UTF-8 encoded in their > > development environment such that they see in the text editor the > > actual characters and not the hex fallback > > People won't care about encodings as long as it works. The problem is that I don't see your "editor hides \u sequences from the user" proposal to work conveniently. It will remain as ugly as base64 as soon as you leave the confines of your editor. Don't make the same mistake again that the email folks made with their base64 mess. It simply does not scale across all tools that you might want to use to touch your source code. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: < http://www.cl.cam.ac.uk/~mgk25/ > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: gcc ignores locale (no UTF-8 source code supported) 2000-09-23 12:22 ` Martin v. Loewis 2000-09-23 12:33 ` Joseph S. Myers 2000-09-23 13:31 ` Markus Kuhn @ 2000-09-25 16:33 ` Joern Rennecke 2 siblings, 0 replies; 6+ messages in thread From: Joern Rennecke @ 2000-09-25 16:33 UTC (permalink / raw) To: Martin v. Loewis; +Cc: Markus.Kuhn, libc-alpha, gcc > For gcc, the issue is both one of portability and performance: the > wide character routines are not available on supported hosts, and the > performance hit of calling mb* routines would be unacceptable. Only You could autoconf for the existence of the wide character routines. And when starting the preprocessor, you can check which 8-bit raw characters map to a full wide character, and build a lookup table; when doing lexical scanning, you can then fast-track the 8 bit characters. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: gcc ignores locale (no UTF-8 source code supported)
@ 2000-09-26 9:01 Benjamin Kosnik
0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Kosnik @ 2000-09-26 9:01 UTC (permalink / raw)
To: gcc
> You could autoconf for the existence of the wide character routines.
FYI: libstdc++-v3 already has done this. Feel free to take the
relevant bits from libstdc++-v3/acinclude.m4:894
GLIBCPP_CHECK_WCHAR_T_SUPPORT
-benjamin
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2000-09-26 9:01 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <200009221934.VAA00916@loewis.home.cs.tu-berlin.de> 2000-09-23 9:35 ` gcc ignores locale (no UTF-8 source code supported) Markus Kuhn 2000-09-23 12:22 ` Martin v. Loewis 2000-09-23 12:33 ` Joseph S. Myers 2000-09-23 13:31 ` Markus Kuhn 2000-09-25 16:33 ` Joern Rennecke 2000-09-26 9:01 Benjamin Kosnik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).