From mboxrd@z Thu Jan  1 00:00:00 1970
From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>
To: eggert@twinsun.com
Cc: brolley@cygnus.com, gcc2@gnu.org, egcs@cygnus.com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Wed, 09 Dec 1998 23:18:00 -0000
Message-id: <199812100712.IAA00283@mira.isdn.cs.tu-berlin.de>
References: <19981204032449.3033.qmail@comton.airs.com> <199812060519.VAA07309@shade.twinsun.com> <366C0645.61C48A38@cygnus.com> <199812080057.QAA00491@shade.twinsun.com> <366D460E.4FB0ECD0@cygnus.com> <199812092143.NAA04890@shade.twinsun.com> <199812092227.XAA12100@mira.isdn.cs.tu-berlin.de> <199812100145.RAA07906@shade.twinsun.com>
X-SW-Source: 1998-12/msg00361.html

> I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR
> is needed.  When outputting a name, you don't need to have a separate
> flag specifying whether whether the identifier contains \u; you can
> just inspect the identifier string directly.  This would be
> ASM_OUTPUT_LABELREF's job.

TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string.
Note that it is defined for the C++ front-end only.

The encoding of Unicode has to be done in the front-end for C++; the
length of a class name depends on the encoding, and it has to get into
the mangling.

Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes
U7Fo_1234, where the U indicates that the underscore is an escape.
The backend can't know this concept.

> Also, I assume that once the patch is generalized to non-UTF-8
> locales, it won't be just the \u and \U escapes that require mangling.

There is no need to generalise that. Defining object files to use
Unicode is the right thing :-)

> If the object-code standard is to use UTF-8 names, then I suppose the
> assembler can convert to UTF-8.

No. The gas people made it very clear that they consider character sets
somebody else's problems (i.e. ours).

> Sorry, I don't understand this point.  If you're saying that C++
> mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I
> don't see why that should be: there's no reason in principle that C
> couldn't or shouldn't use the same sort of mangling.

Sure there is. Look at the example above, and see how you can't do
that service for C linkage.

> I've run into shells that use the top bit for their own purposes.

What system?

> 
> And, even if such shells are discounted, it's a bit odd to use UTF-8
> in configure.in without labeling the file.  My Emacs (20.3)
> misidentified the file as being ISO Latin 1.

So what? This tests whether the assembler can process a certain
sequence of uninterpreted bytes (well, whether they are interpreted is
up to the assembler). The test is to test a feature, not to look nice
in Emacs. Please tell me how I can perform the same test with
ASCII-only shell commands, and I happily convert.

> Really?  Suppose I write the preprocessor line
> 
> #if X == 1
> 
> where X is some Japanese identifier, but I make the understandable
> mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.

\uFF11 is not a letter in C++, so this is ill-formed and will be
rejected. The same holds for the Arabic digits. If you want to write
numbers in C++, use ASCII 0-9.

Regards,
Martin