Re: gcc ignores locale (no UTF-8 source code supported)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: gcc ignores locale (no UTF-8 source code supported)
       [not found] <200009221934.VAA00916@loewis.home.cs.tu-berlin.de>
@ 2000-09-23  9:35 ` Markus Kuhn
  2000-09-23 12:22   ` Martin v. Loewis
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Kuhn @ 2000-09-23  9:35 UTC (permalink / raw)
  To: libc-alpha; +Cc: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 5281 bytes --]

"Martin v. Loewis" wrote on 2000-09-22 19:34 UTC:
> > It seems that gcc ignores the locale and does not use glibc's multi-byte
> > decoding functions to read in wide-string literals. :-(
> 
> I believe that gcc rightfully ignores the locale.

I strongly disagree for the reasons outlined below.

> The C standard says
> that input files are mapped to the source character set in an
> implementation-defined way; nowhere it says that environment settings
> of the user operating the compiler should be taken into account.

If gcc runs on a POSIX system, then the POSIX spec also comes into play
and POSIX applications should clearly determine the character encoding
in all their input/output streams based on the locale setting, unless
some other way (e.g., MIME headers, command-line options,
implementation-defined source code pragmas for compilers, etc.) has been
used to override the current locale. POSIX specifies already what the
"implementation-defined way of determining the source character set" is
that the C standard refers to.

> It would be wrong to take such settings into account: the results of
> invoking the compiler would not be reproducable anymore, and it would
> not be possible to mix header files that are written in different
> encodings - who says that header files on a system have an encoding
> that necessarily matches the environment settings of some user?

First of all: Encodings are trivially to convert into each other (simply
use iconv, recode, etc.). Users on POSIX systems have to make an effort
to keep all their files in the same encoding, namely the encoding
specified in their locale. The rapid proliferation of UTF-8 will make
this actually feasible in the near future, because UTF-8 can be very
practically used in place of all other encodings. The fathers of Unix
have already decided back in 1992 (Plan9) that this is the only real way
to go and I hope the GNU/ Linux world will follow soon.

I hope that one day in the not too far future I can simply place into
/etc/profile the line

  export LANG=C.UTF-8

then convert all my plain text files on my installation to UTF-8, and
from then on never have to worry about the restrictions of ASCII or the
problems of switching between different encodings any more. Sounds like
a promising idea to me, but it clearly requires also that gcc -- like
any other POSIX application that has to know the file encodings -- will
honor the locale setting.

> I believe that characters outside the basic character set (i.e. ASCII)
> should not be used in portable software.

The authors of the C standard made it very clear that they want to
support the ISO 10646 repertoire in source code, and I hope that this
will soon become common practice.

> If you absolutely have to
> have non-ASCII characters in your source code, you should use
> universal character names, i.e.
> 
> wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n");

Please not!!! If I run on a beautiful modern system with full UTF-8
support, then I definitely want to make full use of this encoding in my
development environment. Hex escape sequences like the above one have
soon to be seen as an emergency fallback mechanism for use in cases
where archaic environments (such as gcc 2.95 ;-) have to be maintained.
In such situations, a trivial recoding program can be used to convert
the normal UTF-8 source code into an ugly and user-unfriendly emergency
fallback such as L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" when files are
transmitted to the archaic system. You must not confuse the emergency
hack (hex fallbacks) with the daily usage on modern systems (UTF-8).
Gettext() makes only sense if support of multi-lingual messages is a
requirement. If I am a Thai student writing UTF-8 C source code for a
Thai programming class, then I want to use the Thai alphabet in
variables, comments, and wide-string literals just like you use ASCII.

I am convinced that

  a) people will use lots of non-ASCII text in C source code (even
     English-speaking people will find en/em-dashes, curly quotation marks
     and mathematical symbols a highly desirable extension beyond ASCII)
  b) people will prefer to have these characters UTF-8 encoded in their
     development environment such that they see in the text editor the
     actual characters and not the hex fallback
  c) people will find it trivial to use a 5-line Perl script to
     convert L"SchÃ¶ne GrÃ¼ÃŸe!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"
     in case they encounter a (hopefully soon very rare) environment
     that can't handle ISO 10646 characters. It's just like they find it
     already trivial to convert {[]}^~ into trigraphs when they
     encounter a (thanks god already exceedingly rare) system that does
     not handle all ISO 646 IRV characters.

Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as
ugly and hopefully unnecessary as trigraphs, not as common or even
recommendable practice! Otherwise you will just reveal yourself as an
ASCII chauvinist and I shall condem you to years of maximum-portable
trigraph usage ... ;-)

Markus

P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: < http://www.cl.cam.ac.uk/~mgk25/ >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: gcc ignores locale (no UTF-8 source code supported)
  2000-09-23  9:35 ` gcc ignores locale (no UTF-8 source code supported) Markus Kuhn
@ 2000-09-23 12:22   ` Martin v. Loewis
  2000-09-23 12:33     ` Joseph S. Myers
                       ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Martin v. Loewis @ 2000-09-23 12:22 UTC (permalink / raw)
  To: Markus.Kuhn; +Cc: libc-alpha, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 5772 bytes --]

> POSIX specifies already what the "implementation-defined way of
> determining the source character set" is that the C standard refers
> to.

Can you please point to the exact chapter and verse of Posix that
specifies that the C compiler must consider environment variables when
reading source code?

> First of all: Encodings are trivially to convert into each other (simply
> use iconv, recode, etc.). Users on POSIX systems have to make an effort
> to keep all their files in the same encoding, namely the encoding
> specified in their locale. 

Users may not have the administrative permissions to do so: Most users
can not modify the files in /usr/include, for example.

> The fathers of Unix have already decided back in 1992 (Plan9) that
> this is the only real way to go and I hope the GNU/ Linux world will
> follow soon.

I can easily emagine that gcc supports a -futf-8 option some day (or
-fencoding=utf8). I hope it will never consider LANG when reading
source code, though. That is evil.

> The authors of the C standard made it very clear that they want to
> support the ISO 10646 repertoire in source code, and I hope that this
> will soon become common practice.

The authors also made it pretty clear that any mechanism except for
universal-character-names will be implementation-defined, and cannot
hope to be portable across implementations. Therefore, authors of
portable software should not make use of such a mechanism.

> > wprintf(L"Sch\u00f6ne Gr\u00FC\u00DFe!\n");
> 
> Please not!!! If I run on a beautiful modern system with full UTF-8
> support, then I definitely want to make full use of this encoding in my
> development environment. 

I guess you don't type UTF-8 bytes byte-for-byte into the files;
instead, your editor is capable of producing them on a key
stroke. Just tell your beautiful modern system to produce
universal-character-names when you type the keys. So the line above
would *display* with umlauts, even though the file uses a MBCS
encoding (namely, \u escapes).

An advanced editor (such as Emacs) is capable of dealing with multiple
encodings, it certainly could associate C files (and C++ and Java and
Tcl) with an encoding unicode-escape or such. Maybe it is time to
further improve your system.

> You must not confuse the emergency hack (hex fallbacks) with the
> daily usage on modern systems (UTF-8).

Why is one multibyte encoding capable of expressing full Unicode
(UTF-8) more modern than another one (universal character names)?

> Gettext() makes only sense if support of multi-lingual messages is a
> requirement. If I am a Thai student writing UTF-8 C source code for a
> Thai programming class, then I want to use the Thai alphabet in
> variables, comments, and wide-string literals just like you use ASCII.

Sure. Just use the right text editor - not one that produces UTF-8,
but one that produces universal character names. That way, you can
have all the features you want, *and* your code will compile even if
you take it with you when hired by a German company.

>   a) people will use lots of non-ASCII text in C source code (even
>      English-speaking people will find en/em-dashes, curly quotation marks
>      and mathematical symbols a highly desirable extension beyond ASCII)

Certainly, although the barrier is high even if the technical problem
where solved: Keywords in English don't mix well with non-English
identifiers, and corporate style may require all technical
documentation (including comments) to be in English.

>   b) people will prefer to have these characters UTF-8 encoded in their
>      development environment such that they see in the text editor the
>      actual characters and not the hex fallback

People won't care about encodings as long as it works.

>   c) people will find it trivial to use a 5-line Perl script to
>      convert L"SchÃ¶ne GrÃ¼ÃŸe!\n" into L"Sch\u00f6ne Gr\u00FC\u00DFe!\n"
>      in case they encounter a (hopefully soon very rare) environment
>      that can't handle ISO 10646 characters. 

The environment not supporting ISO 10646 characters won't support the
universal character names, either. They are just two encodings of ISO
10646 - and one of them happens to be mandated by the language
standards (ISO 9899 and ISO 14882), while the other is not.

It will be trivial to convert the source, yes - but it would be even
better if editors supported them in the first place.

> Please please treat L"Sch\u00f6ne Gr\u00FC\u00DFe!\n" as something as
> ugly and hopefully unnecessary as trigraphs, not as common or even
> recommendable practice! 

Well, I want editors to support that. Until I give up on that, I'll
continue to recommend that - especially as more and more languages
support that as a means of putting Unicode into source code.

> Otherwise you will just reveal yourself as an ASCII chauvinist

Well, I guess I'm an ASCII chauvinis then...

> P.S.: See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

You write

> However, most maintainers of existing applications chose instead to
> do only soft conversion and do not use the libc wide character
> functions, either because they are not yet that widely implemented
> or because this would require too many changes in their software.

For gcc, the issue is both one of portability and performance: the
wide character routines are not available on supported hosts, and the
performance hit of calling mb* routines would be unacceptable. Only
recently, the preprocessor has been improved to go over each input
character only once, and the compilers will soon use tokenization as
produced by the preprocessor (instead of tokenizing themselves all
over). All these improvements would likely be taken back if we had to
call the C library every time.

Regards,
Martin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: gcc ignores locale (no UTF-8 source code supported)
  2000-09-23 12:22   ` Martin v. Loewis
@ 2000-09-23 12:33     ` Joseph S. Myers
  2000-09-23 13:31     ` Markus Kuhn
  2000-09-25 16:33     ` Joern Rennecke
  2 siblings, 0 replies; 6+ messages in thread
From: Joseph S. Myers @ 2000-09-23 12:33 UTC (permalink / raw)
  To: Martin v. Loewis; +Cc: Markus.Kuhn, libc-alpha, gcc

On Sat, 23 Sep 2000, Martin v. Loewis wrote:

> > POSIX specifies already what the "implementation-defined way of
> > determining the source character set" is that the C standard refers
> > to.
> 
> Can you please point to the exact chapter and verse of Posix that
> specifies that the C compiler must consider environment variables when
> reading source code?

ISO/IEC 9945-2:1993(E) A.1.5.3 page 690 lines 130-133 describes the effect
of LC_CTYPE on the c89 utility.

GCC doesn't provide a c89 wrapper; I think it would be useful if it did
have a --enable-cc-wrappers configure option that created a link cc to
gcc, a c89 wrapper (which presently could be

	#! /bin/sh
	exec gcc -ansi -pedantic "$@"

with the appropriate adjustments to get the right executable and any
portability kludge needed if "$@" isn't sufficiently portable) and a c99
wrapper (as specified by the Austin Group draft).  If the default will
not be to follow LC_CTYPE, then these scripts would need additional
options in the gcc invocation.

-- 
Joseph S. Myers
jsm28@cam.ac.uk

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: gcc ignores locale (no UTF-8 source code supported)
  2000-09-23 12:22   ` Martin v. Loewis
  2000-09-23 12:33     ` Joseph S. Myers
@ 2000-09-23 13:31     ` Markus Kuhn
  2000-09-25 16:33     ` Joern Rennecke
  2 siblings, 0 replies; 6+ messages in thread
From: Markus Kuhn @ 2000-09-23 13:31 UTC (permalink / raw)
  To: Martin v. Loewis; +Cc: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6790 bytes --]

"Martin v. Loewis" wrote on 2000-09-23 19:17 UTC:
> > POSIX specifies already what the "implementation-defined way of
> > determining the source character set" is that the C standard refers
> > to.
> 
> Can you please point to the exact chapter and verse of Posix that
> specifies that the C compiler must consider environment variables when
> reading source code?

POSIX.2 requires the interpretation of LANG for most of its own
applications and this way sets an example of good implementation
practice that should be followed by other applications as well. [I can
provide holy words of IEEE when I'm back in our the departmental library
on Monday to read the precious scripture. :]

> I can easily emagine that gcc supports a -futf-8 option some day (or
> -fencoding=utf8). I hope it will never consider LANG when reading
> source code, though. That is evil.

Your -futf-8 just adds yet another entry to my long list of non-standard
command-line options for telling an application to use UTF-8, which you
can find on

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

UTF-8 will never fly if switching from ASCII to UTF-8 requires users to
memorize and specify two dozen different special command-line options
from now on. That is what we have the locale environment variables for
and it works pretty beautifully. Unix was fundamentally built around the
idea that files and pipes are not typed, so every effort should be made
to move towards one single globally acceptable character encoding that
will hopefully soon be as ubiquitous as ASCII.

A far cleaner solution for your needs is to extend the existing option
-pedantic to issue a warning whenever it encounters a character outside
the portable character set (such as @ or Ã¼) in the source code. This
way, you can easily check your code if you insist on bible-proof
character portability. [Feel free to add -tripedantic which adds
warnings whenever {[]}~^ and other characters that are not present in
all ISO 646 variants occur. (Oh yes, also -morsepedantic and
-baudotpedantic for guaranteed shortwave and telex compatibility of C
source code. You really can't trust the average telegraph operator with
any non-portable characters in your precious wired C source code.)]

> I guess you don't type UTF-8 bytes byte-for-byte into the files;

Usually not, but actually, sometimes (rarely, on old ASCII terminals) I
do indeed and that is usually not even much less convenient than \u
escapes. Why should "\u00a1" be more readable than "<C2> <A1>" (what
less produces in ASCII mode if it sees UTF-8) or \302\241 (what emacs
says in ASCII mode). All these hex forms are equally unfreindly and only
for emergency usage.

> instead, your editor is capable of producing them on a key
> stroke. Just tell your beautiful modern system to produce
> universal-character-names when you type the keys. So the line above
> would *display* with umlauts, even though the file uses a MBCS
> encoding (namely, \u escapes).

But this works ONLY if the ALL the applications handling the source code
(editor, file viewer, CASE tools, diff, CVS web browser, etc. etc. etc.
etc. etc. etc.) are familiar with the C syntax and apply a rather
non-trivial and very language-specific tokenization process before they
can display Unicode characters adequately. UTF-8 on the other hand can
be safely and robustly interpreted in components as ignorant as the
terminal emulator without any of the programs in the processing pipeline
involved having to know the slightest bit about C's token syntax. I want
"cat test.c" still to work in a user-friendly way when non-ASCII
characters are present. UTF-8 allows this, \u certainly not.

> An advanced editor (such as Emacs) is capable of dealing with multiple
> encodings, it certainly could associate C files (and C++ and Java and
> Tcl) with an encoding unicode-escape or such. Maybe it is time to
> further improve your system.

But this would just restrict me to one single kitchen-sink tool such as
Emacs, and I would still find non-ASCII characters in my source code
being treated as third-class citizens by the many many many other tools
that I use besides Emacs (wdiff, gdb, tcl tools for cvs, etc. etc.
etc.). Paste a few lines of C code into your mailer and (unless you
operate completely within a single product such as Emacs) the \uXXXX
will show up again. It is far more likely that both your C editor and
your email editor can understand UTF-8 in the near future than that both
are identical to Emacs.

> > You must not confuse the emergency hack (hex fallbacks) with the
> > daily usage on modern systems (UTF-8).
> 
> Why is one multibyte encoding capable of expressing full Unicode
> (UTF-8) more modern than another one (universal character names)?

Should be obvious: UTF-8 does not require a C scanner to be processed,
but the C universal character names do. UTF-8 can be easily and safely
integrated into such dumb things as terminal emulators and can be used
end-to-end in a processing pipeline in which any ASCII sequence can have
special semantics. Universal character names are C specific and not at
all universal. Fortran, Ada95, TeX and XML all have their own
independent "universal character name" equivalents, yet all of them
could process UTF-8 smoothly.

If you look at it this way (namely portability and interoperability of
tools), then UTF-8 becomes quickly the least common denominator that
enables portable exchange of non-ASCII content across tools and
platforms. \u sequences remain a C specific fallback hack. We should
make the use of UTF-8 as easy and natural as possible. As natural as
ASCII.

> Just use the right text editor - not one that produces UTF-8,
> but one that produces universal character names. That way, you can
> have all the features you want, *and* your code will compile even if
> you take it with you when hired by a German company.

Much more likely is the scenario in which the German company is already
anyway using the same encoding as the Thai company: UTF-8.

> >   b) people will prefer to have these characters UTF-8 encoded in their
> >      development environment such that they see in the text editor the
> >      actual characters and not the hex fallback
> 
> People won't care about encodings as long as it works.

The problem is that I don't see your "editor hides \u sequences from the
user" proposal to work conveniently. It will remain as ugly as base64 as
soon as you leave the confines of your editor. Don't make the same
mistake again that the email folks made with their base64 mess. It
simply does not scale across all tools that you might want to use to
touch your source code.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: < http://www.cl.cam.ac.uk/~mgk25/ >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: gcc ignores locale (no UTF-8 source code supported)
  2000-09-23 12:22   ` Martin v. Loewis
  2000-09-23 12:33     ` Joseph S. Myers
  2000-09-23 13:31     ` Markus Kuhn
@ 2000-09-25 16:33     ` Joern Rennecke
  2 siblings, 0 replies; 6+ messages in thread
From: Joern Rennecke @ 2000-09-25 16:33 UTC (permalink / raw)
  To: Martin v. Loewis; +Cc: Markus.Kuhn, libc-alpha, gcc

> For gcc, the issue is both one of portability and performance: the
> wide character routines are not available on supported hosts, and the
> performance hit of calling mb* routines would be unacceptable. Only

You could autoconf for the existence of the wide character routines.
And when starting the preprocessor, you can check which 8-bit raw
characters map to a full wide character, and build a lookup table;
when doing lexical scanning, you can then fast-track the 8 bit characters.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: gcc ignores locale (no UTF-8 source code supported)
@ 2000-09-26  9:01 Benjamin Kosnik
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Kosnik @ 2000-09-26  9:01 UTC (permalink / raw)
  To: gcc

> You could autoconf for the existence of the wide character routines.

FYI: libstdc++-v3 already has done this. Feel free to take the
relevant bits from libstdc++-v3/acinclude.m4:894

GLIBCPP_CHECK_WCHAR_T_SUPPORT

-benjamin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2000-09-26  9:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200009221934.VAA00916@loewis.home.cs.tu-berlin.de>
2000-09-23  9:35 ` gcc ignores locale (no UTF-8 source code supported) Markus Kuhn
2000-09-23 12:22   ` Martin v. Loewis
2000-09-23 12:33     ` Joseph S. Myers
2000-09-23 13:31     ` Markus Kuhn
2000-09-25 16:33     ` Joern Rennecke
2000-09-26  9:01 Benjamin Kosnik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).