public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* thoughts on martin's proposed patch for GCC and UTF-8
       [not found]       ` <366D460E.4FB0ECD0@cygnus.com>
@ 1998-12-09 13:44         ` Paul Eggert
  1998-12-09 14:38           ` Martin von Loewis
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-09 13:44 UTC (permalink / raw)
  To: Martin von Loewis, brolley; +Cc: gcc2, egcs

I took a look at martin's proposed patch for UTF-8 support in GCC, and
have the following thoughts and suggestions.

* GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and
  \U000000B5; but GCC should not always unify these four identifiers
  to the identifier with character code B5, as this is incorrect in
  non-UTF-8 locales.

  The latest EGCS and GCC2 code already contains support for non-UTF-8
  locales, and this support is incompatible with the proposed patch.
  To get started, perhaps the proposed patch could be modified to
  report an error if it encounters \u or \U in a non-UTF-8 locale,
  saying that this is not supported yet.

* GCC should represent non-ASCII identifiers using the locale's
  preferred multibyte encoding; e.g. it should use EUC-JIS if that's
  what the locale uses.  This is the best way to make GCC work well
  with other tools in that locale.  If the locale cannot represent a
  particular Unicode character, GCC should store it in a canonicalized
  escape form (e.g. the locale's encoding for \u with lowercase alpha
  digits if it fits in 16 bits, \U with lowercase alpha digits
  otherwise); this is along the lines of what draft C9x suggests.

  Proper support for \u in non-UTF-8 locales requires a
  locale-specific translation table from Unicode to the locale's
  encoding.  We'll also need a locale-specific table that specifies
  which characters are C letters and digits, but this can be derived
  from the other table automatically.

  One way to translate from Unicode to non-UTF-8 is to have GCC use
  the iconv function if available.  iconv will be supported by glibc
  2.1; it's also been supported by Solaris 2.x for some time.  GCC
  could supply its own substitute for iconv if that's needed by
  cross-compilers, but the native iconv is generally preferable.

* Given the above, I don't see the need for TREE_UNIVERSAL_CHAR.  The
  identifier should be stored using the locale's multibyte chars as
  suggested above (with canonical escapes if needed), and output
  as-is, just as identifiers are now.

* HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's
  current philosophy that the user knows what he or she is doing.
  People who use multibyte chars in identifiers will expect them to go
  through to the assembler; if the assembler doesn't support them,
  they'll understand the assembler's error message.  So GCC's behavior
  shouldn't depend on whether the assembler supports multibyte chars.

  There's precedent for this: GCC already doesn't care whether the
  assembler supports dollar signs in identifiers.  If the user writes
  a function named `a$b', and the assembler doesn't support that name,
  then the assembler will report the error.  That's preferable to
  having GCC second-guess the assembler.

  Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it.  This
  won't work with older shells that don't allow UTF-8.  It's simpler if
  we just remove HAVE_GAS_UTF8.

* I assume that cp/universal.c is supposed to support the constraints
  on identifiers required by ISO/IEC TR 10176?  If so, it should be
  commented that way.  The code needs to be fixed to have an
  is_universal_digit function, since letters and digits have distinct
  roles in identifiers.  You need to remove `,' before `}' in the
  code, for portability to older compilers.  The code currently dumps
  core if is_uni[h]==NULL.

* The universal-char code needs to be exported out to the main GCC
  level; it's not specific to C++.

* The C compiler and preprocessor also need to support \u and
  multibyte chars.  I'll take a look at doing this, taking inspiration
  from martin's proposed patch.

* GAS should be extended to support locales with encodings other than
  UTF-8; in particular, this means that GAS should support \u, if it
  doesn't already, as \u is needed for characters that can't be
  represented in the locale's multibyte encoding.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 13:44         ` thoughts on martin's proposed patch for GCC and UTF-8 Paul Eggert
@ 1998-12-09 14:38           ` Martin von Loewis
  1998-12-09 14:56             ` Per Bothner
  1998-12-09 17:46             ` Paul Eggert
  0 siblings, 2 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-09 14:38 UTC (permalink / raw)
  To: eggert; +Cc: brolley, gcc2, egcs

> * GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and
>   \U000000B5; but GCC should not always unify these four identifiers
>   to the identifier with character code B5, as this is incorrect in
>   non-UTF-8 locales.

No, it shouldn't. Instead, it should convert B5 into Unicode, and use
whatever it then gets (at least for identifiers, it also needs to
check that B5 is a letter in the given locale).

One question, though - what exactly is a UTF-8 locale? E.g. would the
"C" locale qualify?

> * GCC should represent non-ASCII identifiers using the locale's
>   preferred multibyte encoding; e.g. it should use EUC-JIS if that's
>   what the locale uses.

I assume you talk about error messages, here?

>   This is the best way to make GCC work well
>   with other tools in that locale.  If the locale cannot represent a
>   particular Unicode character, GCC should store it in a canonicalized
>   escape form (e.g. the locale's encoding for \u with lowercase alpha
>   digits if it fits in 16 bits, \U with lowercase alpha digits
>   otherwise); this is along the lines of what draft C9x suggests.

Can you elaborate? What is the locale's encoding for \u, if not "\u"
literally?

>   One way to translate from Unicode to non-UTF-8 is to have GCC use
>   the iconv function if available.  iconv will be supported by glibc
>   2.1; it's also been supported by Solaris 2.x for some time.  GCC
>   could supply its own substitute for iconv if that's needed by
>   cross-compilers, but the native iconv is generally preferable.

I agree. Still, I'd like to ask whether incorporation of gconv into
egcs would be acceptable to the egcs maintainers; if this is the case,
I will work towards such an incorporation. Having gconv available
would simplify cross-compilation.

There might be a problem with the native iconv: What if it doesn't
have conversion to Unicode, or what if we don't know what the name of
Unicode is in a particular iconv implementation (we always know for
glibc iconv, i.e. gconv).

> * Given the above, I don't see the need for TREE_UNIVERSAL_CHAR.  The
>   identifier should be stored using the locale's multibyte chars as
>   suggested above (with canonical escapes if needed), and output
>   as-is, just as identifiers are now.

Well, it is needed for name mangling. Name mangling (in object files)
needs to be independent from the user's locale; otherwise you can't
link libraries produced by somebody else.

Please note that jc1 already defines object files to use UTF-8. It
seems that jc1 integration is a objective for cc1plus, so we need
to keep that fixed.

> * HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's
>   current philosophy that the user knows what he or she is doing.

Please give other examples of that "current" philosophy :-)

g++ usually does not assume that users know what they are doing,
instead, there is big emphasis on binary compatibility and protecting
users from linking non-matching things.

>   People who use multibyte chars in identifiers will expect them to go
>   through to the assembler; if the assembler doesn't support them,
>   they'll understand the assembler's error message.

This is certainly true for C, it does not hold for C++. In C++, you
can produce output even if the assembler does not support non-ASCII in
labels. Again, this is what jc1 does, and it is the right thing.

>   Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it.  This
>   won't work with older shells that don't allow UTF-8.  It's simpler if
>   we just remove HAVE_GAS_UTF8.

No shell supports UTF-8 as such. What systems don't support 8-bit
characters as arguments to echo?

> * I assume that cp/universal.c is supposed to support the constraints
>   on identifiers required by ISO/IEC TR 10176?

Yes. This has to be rewritten to check a list of ranges, instead of
trying to be smart.

> The code needs to be fixed to have an is_universal_digit function,
> since letters and digits have distinct roles in identifiers.

No. C++ does not distinguish between non-ASCII digits and letters.

> * The universal-char code needs to be exported out to the main GCC
>   level; it's not specific to C++.

Yes. I'd like to have a module that knows what characters the
different standards accept (it would also have is_universal_digit if
C9X requires that). The compilers then would accept any characters
that are accepted in any of the languages, except when being pedantic.

> * GAS should be extended to support locales with encodings other than
>   UTF-8; in particular, this means that GAS should support \u, if it
>   doesn't already, as \u is needed for characters that can't be
>   represented in the locale's multibyte encoding.

The current (CVS?) gas supports arbitrary 8-bit characters in strings
and identifiers; the GAS maintainers think that encodings is not a
thing they need to care about. I agree (now).

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 14:38           ` Martin von Loewis
@ 1998-12-09 14:56             ` Per Bothner
  1998-12-09 22:57               ` Martin von Loewis
  1998-12-09 17:46             ` Paul Eggert
  1 sibling, 1 reply; 81+ messages in thread
From: Per Bothner @ 1998-12-09 14:56 UTC (permalink / raw)
  To: Martin von Loewis; +Cc: gcc2, egcs

> Please note that jc1 already defines object files to use UTF-8.

I wouldn't put it that strongly.  The Java language allows
non-Ascii Unicode characters in identifiers.  These have to be
mangled in some standard way.  Using UTF-8 seems like the cleanest
way, but that may require use of gas and also requires an 8-bit
clean ld.

I would prefer to mangle Unicode characters using UTF-8, as that
is the cleanest solution, but there are alternative mangling schemes
which have the advantage of working with older assemblers.  I don't
have a good handle on how important that is.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 14:38           ` Martin von Loewis
  1998-12-09 14:56             ` Per Bothner
@ 1998-12-09 17:46             ` Paul Eggert
  1998-12-09 18:01               ` Tim Hollebeek
                                 ` (3 more replies)
  1 sibling, 4 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-09 17:46 UTC (permalink / raw)
  To: martin; +Cc: brolley, gcc2, egcs

   Date: Wed, 9 Dec 1998 23:27:17 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > * GCC should represent non-ASCII identifiers using the locale's
   >   preferred multibyte encoding

   I assume you talk about error messages, here?

I'm talking about every place that GCC outputs identifiers.  This
includes error messages, assembler output, and other auxiliary text
output (e.g. -dM output).

   >   If the locale cannot represent a particular Unicode character,
   >   GCC should store it in a canonicalized escape form (e.g. the
   >   locale's encoding for \u with lowercase alpha digits if it fits
   >   in 16 bits, \U with lowercase alpha digits otherwise); this is
   >   along the lines of what draft C9x suggests.

   Can you elaborate? What is the locale's encoding for \u, if not "\u"
   literally?

If it were up to me, it would always be \u or \U as specified above.
We have to be a bit careful, though, as we can't just splice
e.g. "\u1234" into a multibyte sequence; we may need to surround the
"\u1234" with bytes that bring us to the initial shift state and back
again.

The platform may have its own idea of the canonicalized escape form.
ASM_OUTPUT_LABELREF could arrange to convert to the platform's format.

   what exactly is a UTF-8 locale?

a locale that uses UTF-8 encoding for multibyte characters, and
(obviously) that uses the Unicode character set.

   E.g. would the "C" locale qualify?

No, typically "C" uses plain ASCII with no multibyte chars, though it
can and sometimes does use something else (e.g. ASCII subsets).

   >   One way to translate from Unicode to non-UTF-8 is to have GCC use
   >   the iconv function if available.

   There might be a problem with the native iconv: What if it doesn't
   have conversion to Unicode, or what if we don't know what the name of
   Unicode is in a particular iconv implementation (we always know for
   glibc iconv, i.e. gconv).

Yes, but that's just one thing to configure.  And the default value,
"UTF-8", will probably work on most hosts.

There's also the problem of getting the name of the encoding for the
current locale.  The XPG4 way of doing this is `nl_langinfo (CODESET)'.
This is supported by glibc 2.1 and by recent Solaris versions.
Systems that don't support this will have to be supported ad hoc
(if at all).

   I'd like to ask whether incorporation of gconv into egcs would be
   acceptable to the egcs maintainers; if this is the case, I will
   work towards such an incorporation. Having gconv available would
   simplify cross-compilation.

Since egcs already maintains its own idea of S-JIS etc. I suspect that
there'd be no objection to also maintaining its own idea of what the
translation tables should be, for hosts that don't already have it.

   > * Given the above, I don't see the need for TREE_UNIVERSAL_CHAR.  The
   >   identifier should be stored using the locale's multibyte chars as
   >   suggested above (with canonical escapes if needed), and output
   >   as-is, just as identifiers are now.

   Well, it is needed for name mangling. Name mangling (in object files)
   needs to be independent from the user's locale; otherwise you can't
   link libraries produced by somebody else.

I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR
is needed.  When outputting a name, you don't need to have a separate
flag specifying whether whether the identifier contains \u; you can
just inspect the identifier string directly.  This would be
ASM_OUTPUT_LABELREF's job.

Also, I assume that once the patch is generalized to non-UTF-8
locales, it won't be just the \u and \U escapes that require mangling.
If the goal is to link libraries that were built in other locales,
then we'll also need to mangle the non-UTF-8 multibyte chars.  Again,
this doesn't sound like a high-level concept that needs to be in the
parse tree -- it's just a low-level thing that can be done on output.

Perhaps it's just an efficiency thing?  If so, then this should be
made a bit clearer, and the flag name changed to
TREE_NAME_NEEDS_ASCIIFYING or something like that, with the other
identifiers named changed accordingly.

   jc1 already defines object files to use UTF-8. It seems that jc1
   integration is a objective for cc1plus, so we need to keep that fixed.

If the compilation locale uses, say, Shift-JIS, then the assembly
language text file should use shift-JIS, as this is what the
programmer's tools will expect.

If the object-code standard is to use UTF-8 names, then I suppose the
assembler can convert to UTF-8.  (Object code isn't text, so the usual
rules about text locales don't apply to it.)  But this would mean that
the assembler would have to understand character conversion, which is
a unwanted complication.

However, if jc1 is meant only to be used in UTF-8 locales (which seems
likely), then we needn't worry about this.  We just tell people that
they have to use an UTF-8 locale if they want to use jc1 with non-"C"
names, because jc1 object files must use UTF-8.  This would be an
understandable restriction, and it means we could avoid having to do
the translations in either the compiler or the assembler.

   >   People who use multibyte chars in identifiers will expect them to go
   >   through to the assembler; if the assembler doesn't support them,
   >   they'll understand the assembler's error message.

   This is certainly true for C, it does not hold for C++. In C++, you
   can produce output even if the assembler does not support non-ASCII in
   labels.

Sorry, I don't understand this point.  If you're saying that C++
mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I
don't see why that should be: there's no reason in principle that C
couldn't or shouldn't use the same sort of mangling.

If the assembler requires some form of mangling from non-ASCII
identifiers into ASCII labels, then shouldn't this be
ASM_OUTPUT_LABELREF's job, or something like that?  I don't see why
the issue is specific to C++; it sounds like it's general to all
languages with non-ASCII identifiers.

   >   Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it.  This
   >   won't work with older shells that don't allow UTF-8.  It's simpler if
   >   we just remove HAVE_GAS_UTF8.

   What systems don't support 8-bit characters as arguments to echo?

I've run into shells that use the top bit for their own purposes.

And, even if such shells are discounted, it's a bit odd to use UTF-8
in configure.in without labeling the file.  My Emacs (20.3)
misidentified the file as being ISO Latin 1.  It'd be better if you
rewrote the configure.in test in ASCII, so we didn't have to worry
aboug gotchas like this.  It should be fairly easy to do this with tr.

   C++ does not distinguish between non-ASCII digits and letters.

Really?  Suppose I write the preprocessor line

#if X == 1

where X is some Japanese identifier, but I make the understandable
mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.
What you're saying is that the preprocessor is obliged to treat this
line as if it were

#if X == 0

because undeclared preprocessor identifiers default to zero?

This seems to me to be asking for trouble; it's a common mistake in
Japanese text.  If C++ requires this, then I suggest that C++ by
default should warn about identifiers beginning with digits.  (It also
means yet another difference between the C and C++ preprocessors,
sigh.)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 17:46             ` Paul Eggert
@ 1998-12-09 18:01               ` Tim Hollebeek
  1998-12-10  5:58                 ` Craig Burley
  1998-12-09 23:03               ` Per Bothner
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 81+ messages in thread
From: Tim Hollebeek @ 1998-12-09 18:01 UTC (permalink / raw)
  To: Paul Eggert; +Cc: martin, brolley, gcc2, egcs

Paul Eggert writes ...
> 
> Really?  Suppose I write the preprocessor line
> 
> #if X == 1
> 
> where X is some Japanese identifier, but I make the understandable
> mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.
> What you're saying is that the preprocessor is obliged to treat this
> line as if it were
> 
> #if X == 0
> 
> because undeclared preprocessor identifiers default to zero?

IMO, the problem here has nothing to do with Japanese, and everything
to do with the fact that this rule is error prone in general.  I've
been meaning to implement -Wpreprocessor-undeclared for a while now.

In fact, there is an instance of a typo in the SGI standard header
files which has never been caught by any ANSI C compiler because if
this "undeclared macro == 0" rule, despite the fact that the rule only
exists to support source files which predate the existence of #ifdef.

gcc really should complain "undeclared identifier 'x' in preprocessor
conditional expression has value 0" when -Wall is specified.  Only
BASIC and perl programmers, and Fortran programmers who don't use
"implicit none", should have to worry about creating new variables
every time they make a typo.

---------------------------------------------------------------------------
Tim Hollebeek                           | "Everything above is a true
email: tim@wfn-shop.princeton.edu       |  statement, for sufficiently
URL: http://wfn-shop.princeton.edu/~tim |  false values of true."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 14:56             ` Per Bothner
@ 1998-12-09 22:57               ` Martin von Loewis
  1998-12-09 23:16                 ` Per Bothner
  0 siblings, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-09 22:57 UTC (permalink / raw)
  To: bothner; +Cc: gcc2, egcs

> I would prefer to mangle Unicode characters using UTF-8, as that
> is the cleanest solution, but there are alternative mangling schemes
> which have the advantage of working with older assemblers.  I don't
> have a good handle on how important that is.

My mistake. jc1 does not mandate UTF-8; what it does mandate is
Unicode (in some form), right? Paul is proposing that assembler files
should be in the source character set; I think this is the wrong way.

Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 17:46             ` Paul Eggert
  1998-12-09 18:01               ` Tim Hollebeek
@ 1998-12-09 23:03               ` Per Bothner
  1998-12-10  7:49                 ` Ian Lance Taylor
  1998-12-11 19:23                 ` Paul Eggert
  1998-12-09 23:18               ` Martin von Loewis
       [not found]               ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
  3 siblings, 2 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-09 23:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: gcc2, egcs

> However, if jc1 is meant only to be used in UTF-8 locales (which seems
> likely), then we needn't worry about this.  We just tell people that
> they have to use an UTF-8 locale if they want to use jc1 with non-"C"
> names, because jc1 object files must use UTF-8.  This would be an
> understandable restriction, and it means we could avoid having to do
> the translations in either the compiler or the assembler.

I'm not sure about jc1, but gcj (the preferred user-level driver)
is not meant to be used only in UTF-8 locales.  Java only uses Unicode
*internally*, but we need to be able to read non-Unicode / non-UTF-8
*files*.  And Java defines a mechanism where you can specify an
encoding to use when translating external byte streams to/from
internal Unicode streams.  However, Java does not define the
external encoding of Java program files, but only that after
processing \u and \U escapes the input to the lexer is a stream
of Unicode chracters.

This is a somewhat hypythetical problem, as we have no experience
with to what extent if any people need to be able use non-Ascii
characters in their source files.  But I assume they will want to
do that in their locale's text encoding - which need not be a
"UTF-8" locale.  In that case, jc1 (or a pre-processor for jc1)
has to translate the locale's character set into Unicode.

It is reasonable that the default locale for source files (i.e.
te one assumed if you don't override things) should be UTF-8.

The locale for assembler files should probably also be UTF-8.
I see no reason to support anything else.  What we might do
is had gasp (the gas pre-processor) provide a hook for
converting from other character sets.  But gas itself
should just assume UTF-8 - and generate ld symbols that
are also UTF-8.  (A simple implementation is for gas to
just recognize that bytes that have the high-order bit set
should be treated as (part of) letters.)

Similarly, gdb and ld should assume that labels are UTF-8.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 22:57               ` Martin von Loewis
@ 1998-12-09 23:16                 ` Per Bothner
  1998-12-11 19:27                   ` Paul Eggert
  0 siblings, 1 reply; 81+ messages in thread
From: Per Bothner @ 1998-12-09 23:16 UTC (permalink / raw)
  To: Martin von Loewis; +Cc: gcc2, egcs

> Paul is proposing that assembler files
> should be in the source character set; I think this is the wrong way.

Well, it seems clear that symbols in .o files have to be in a
locale-independent encoding.  That to me seems to mandate UTF-8.

It is less clear what encoding we should use for assembler files,
but given that the assembler translates to UTF-8, that the
assembler is primarily used for compiler output files, and
that assembly files are traditionally low-level and close
to the .o files, that suggests to me that assembler files
should also be in UTF-8, at least for compiler-generated
.s files).  Humans-written .s files will probably be in
the source locale, so we may need a pre-processor (possibly
gasp) to convert to UTF-8.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 17:46             ` Paul Eggert
  1998-12-09 18:01               ` Tim Hollebeek
  1998-12-09 23:03               ` Per Bothner
@ 1998-12-09 23:18               ` Martin von Loewis
  1998-12-10  7:57                 ` Ian Lance Taylor
  1998-12-11 19:28                 ` Paul Eggert
       [not found]               ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
  3 siblings, 2 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-09 23:18 UTC (permalink / raw)
  To: eggert; +Cc: brolley, gcc2, egcs

> I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR
> is needed.  When outputting a name, you don't need to have a separate
> flag specifying whether whether the identifier contains \u; you can
> just inspect the identifier string directly.  This would be
> ASM_OUTPUT_LABELREF's job.

TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string.
Note that it is defined for the C++ front-end only.

The encoding of Unicode has to be done in the front-end for C++; the
length of a class name depends on the encoding, and it has to get into
the mangling.

Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes
U7Fo_1234, where the U indicates that the underscore is an escape.
The backend can't know this concept.

> Also, I assume that once the patch is generalized to non-UTF-8
> locales, it won't be just the \u and \U escapes that require mangling.

There is no need to generalise that. Defining object files to use
Unicode is the right thing :-)

> If the object-code standard is to use UTF-8 names, then I suppose the
> assembler can convert to UTF-8.

No. The gas people made it very clear that they consider character sets
somebody else's problems (i.e. ours).

> Sorry, I don't understand this point.  If you're saying that C++
> mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I
> don't see why that should be: there's no reason in principle that C
> couldn't or shouldn't use the same sort of mangling.

Sure there is. Look at the example above, and see how you can't do
that service for C linkage.

> I've run into shells that use the top bit for their own purposes.

What system?

> 
> And, even if such shells are discounted, it's a bit odd to use UTF-8
> in configure.in without labeling the file.  My Emacs (20.3)
> misidentified the file as being ISO Latin 1.

So what? This tests whether the assembler can process a certain
sequence of uninterpreted bytes (well, whether they are interpreted is
up to the assembler). The test is to test a feature, not to look nice
in Emacs. Please tell me how I can perform the same test with
ASCII-only shell commands, and I happily convert.

> Really?  Suppose I write the preprocessor line
> 
> #if X == 1
> 
> where X is some Japanese identifier, but I make the understandable
> mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.

\uFF11 is not a letter in C++, so this is ill-formed and will be
rejected. The same holds for the Arabic digits. If you want to write
numbers in C++, use ASCII 0-9.

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 18:01               ` Tim Hollebeek
@ 1998-12-10  5:58                 ` Craig Burley
  1998-12-10 10:21                   ` Tim Hollebeek
  1998-12-10 14:23                   ` Chip Salzenberg
  0 siblings, 2 replies; 81+ messages in thread
From: Craig Burley @ 1998-12-10  5:58 UTC (permalink / raw)
  To: tim; +Cc: burley

>gcc really should complain "undeclared identifier 'x' in preprocessor
>conditional expression has value 0" when -Wall is specified.  Only
>BASIC and perl programmers, and Fortran programmers who don't use
>"implicit none", should have to worry about creating new variables
>every time they make a typo.

Long a pet peeve of mine as well, though I wonder if fixing this might
cause gcc to warn about too many constructs like

  #if defined (FOO) && (FOO == 1)

and:

  #ifdef FOO
  #if FOO == 1

The warning could be made smart enough to avoid most, maybe all, such
spurious warnings.

If not all, the documentation should be pretty clear about what to
do, and what to not bother doing, to eliminate the warnings.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 23:03               ` Per Bothner
@ 1998-12-10  7:49                 ` Ian Lance Taylor
  1998-12-11 19:23                 ` Paul Eggert
  1 sibling, 0 replies; 81+ messages in thread
From: Ian Lance Taylor @ 1998-12-10  7:49 UTC (permalink / raw)
  To: bothner; +Cc: eggert, gcc2, egcs

   Date: Wed, 09 Dec 1998 23:02:40 -0800
   From: Per Bothner <bothner@cygnus.com>

   The locale for assembler files should probably also be UTF-8.
   I see no reason to support anything else.  What we might do
   is had gasp (the gas pre-processor) provide a hook for
   converting from other character sets.

As far as I'm concerned, gasp is dead.  It served a purpose for a
time, which was to provide a richer set of assembly language
operations.  However, gas itself now has all the interesting features
which were once found only in gasp (basically, macros).  I don't want
to see any plan that relies on using gasp.

   But gas itself
   should just assume UTF-8 - and generate ld symbols that
   are also UTF-8.  (A simple implementation is for gas to
   just recognize that bytes that have the high-order bit set
   should be treated as (part of) letters.)

This change has already been made in the development sources, and will
be in the next release.

Ian

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 23:18               ` Martin von Loewis
@ 1998-12-10  7:57                 ` Ian Lance Taylor
  1998-12-10 13:12                   ` Martin von Loewis
                                     ` (2 more replies)
  1998-12-11 19:28                 ` Paul Eggert
  1 sibling, 3 replies; 81+ messages in thread
From: Ian Lance Taylor @ 1998-12-10  7:57 UTC (permalink / raw)
  To: martin; +Cc: eggert, brolley, gcc2, egcs

   Date: Thu, 10 Dec 1998 08:12:20 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > If the object-code standard is to use UTF-8 names, then I suppose the
   > assembler can convert to UTF-8.

   No. The gas people made it very clear that they consider character sets
   somebody else's problems (i.e. ours).

That is too strong.  For hand coded assembler, I can see that there
may be a need for gas to do some character set conversions.  Also, if
it is ever possible for an identifier name to include a byte value
which gas will consider to be an operator, then it is clearly
necessary for gas to permit quoting that byte value, and perhaps to do
more general character set conversions.

In general, though, if gcc needs to understands character set issues,
which appears to be the case, and if it can emit identifiers in a
manner which will not confuse gas, then I think it is reasonable for
gcc to emit identifiers as uninterpreted byte sequences, and for gas
to simply pass those identifiers straight through into the object
file.

I can't claim to understand many of the issues here, though.

Several people have mentioned the linker as an issue.  To the best of
my knowledge, the linker will permit any byte value except 0 to appear
in an identifier.  I don't see why the linker has to change at all for
any character set issues.

Ian

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10  5:58                 ` Craig Burley
@ 1998-12-10 10:21                   ` Tim Hollebeek
  1998-12-10 11:50                     ` Craig Burley
  1998-12-10 14:23                   ` Chip Salzenberg
  1 sibling, 1 reply; 81+ messages in thread
From: Tim Hollebeek @ 1998-12-10 10:21 UTC (permalink / raw)
  To: Craig Burley; +Cc: eggert, martin, brolley, gcc2, egcs, burley

Craig Burley writes ...
> 
> >gcc really should complain "undeclared identifier 'x' in preprocessor
> >conditional expression has value 0" when -Wall is specified.  Only
> >BASIC and perl programmers, and Fortran programmers who don't use
> >"implicit none", should have to worry about creating new variables
> >every time they make a typo.
> 
> Long a pet peeve of mine as well, though I wonder if fixing this might
> cause gcc to warn about too many constructs like
> 
>   #if defined (FOO) && (FOO == 1)
> 
> and:
> 
>   #ifdef FOO
>   #if FOO == 1
> 
> The warning could be made smart enough to avoid most, maybe all, such
> spurious warnings.
> 
> If not all, the documentation should be pretty clear about what to
> do, and what to not bother doing, to eliminate the warnings.

Good points.  The second isn't a problem, though, since if FOO isn't
defined, we're skipping when we see #if, and don't need too parse the
expression.  In fact, if I remember correctly, ANSI forbids parsing of
the expression (other than recognizing pp-tokens).

The first case is more important, and one I hadn't thought of.
However, the boolean operators that short circuit are the only ones
that don't use one operand, so it seems consistent with C to not
evaluate (and hence not warn about) arguments that are short
circuited.  I believe this isn't ad hoc, and avoids all spurious
warnings in a consistent manner.

---------------------------------------------------------------------------
Tim Hollebeek                           | "Everything above is a true
email: tim@wfn-shop.princeton.edu       |  statement, for sufficiently
URL: http://wfn-shop.princeton.edu/~tim |  false values of true."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
       [not found]               ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
@ 1998-12-10 11:31                 ` Jonathan Larmour
  0 siblings, 0 replies; 81+ messages in thread
From: Jonathan Larmour @ 1998-12-10 11:31 UTC (permalink / raw)
  To: tim; +Cc: gcc2, egcs

In article <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> you write:
[ #if X == 1 is a problem if "1" is not ASCII 1 ]
>gcc really should complain "undeclared identifier 'x' in preprocessor
>conditional expression has value 0" when -Wall is specified.

It already does this check if you use -Wundef. Its in the info page, and
not the man page, unfortunately.

`-Wundef'
     Warn if an undefined identifier is evaluated in an `#if' directive.

Jifl
-- 
Cygnus Solutions, 35 Cambridge Place, Cambridge, UK.  Tel: +44 (1223) 728762
"Women marry hoping their husbands will change, men||Home e-mail: jifl @ 
marry hoping their wives never do. Both are rare." ||     jifvik.demon.co.uk
Help fight spam! http://spam.abuse.net/  These opinions are all my own fault

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10 10:21                   ` Tim Hollebeek
@ 1998-12-10 11:50                     ` Craig Burley
  0 siblings, 0 replies; 81+ messages in thread
From: Craig Burley @ 1998-12-10 11:50 UTC (permalink / raw)
  To: tim; +Cc: burley

>>   #if defined (FOO) && (FOO == 1)
>> 
>> and:
>> 
>>   #ifdef FOO
>>   #if FOO == 1
>> 
>Good points.  The second isn't a problem, though, since if FOO isn't
>defined, we're skipping when we see #if, and don't need too parse the
>expression.  In fact, if I remember correctly, ANSI forbids parsing of
>the expression (other than recognizing pp-tokens).

That's good to hear, and I had hoped it would be the case, but thought
I should mention it anyway.

>The first case is more important, and one I hadn't thought of.
>However, the boolean operators that short circuit are the only ones
>that don't use one operand, so it seems consistent with C to not
>evaluate (and hence not warn about) arguments that are short
>circuited.  I believe this isn't ad hoc, and avoids all spurious
>warnings in a consistent manner.

It sounds like you're saying the only binary operators are && and
||, which doesn't sound quite right.  && and || are the logical AND
and OR operators, according to my 1998-12-07 draft copy of the ANSI
C standard, and I believe the &, |, and other bitwise operators,
plus the +, -, * and / integer operators are supported for preprocessor
directives as well.

But I'm not sure this changes what you're saying.

What might, though, is the implementation.  It might insist on
expanding macros beyond a short-circuit, and, if it does, you
can't just change it so that, when it replaces an undefined
macro with 0, it optionally warns.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10  7:57                 ` Ian Lance Taylor
@ 1998-12-10 13:12                   ` Martin von Loewis
  1998-12-11 19:32                   ` Paul Eggert
  1998-12-11 19:34                   ` Ken Raeburn
  2 siblings, 0 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-10 13:12 UTC (permalink / raw)
  To: ian; +Cc: eggert, brolley, gcc2, egcs

> Also, if it is ever possible for an identifier name to include a
> byte value which gas will consider to be an operator, then it is
> clearly necessary for gas to permit quoting that byte value, and
> perhaps to do more general character set conversions.

Fortunately, UTF-8 only uses characters above 128, plus ASCII;
so you don't get special characters in identifiers, since they
are already banned by C.

> Several people have mentioned the linker as an issue.  To the best of
> my knowledge, the linker will permit any byte value except 0 to appear
> in an identifier.  I don't see why the linker has to change at all for
> any character set issues.

I've tried the binutils linker, and it is happy with any byte
sequence. Of course, there still might be linkers that do care about
characters above 128. Maybe we should perform some manual tests now,
or even have an autoconf test. OTOH, people will complain when the
linker complains...

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10  5:58                 ` Craig Burley
  1998-12-10 10:21                   ` Tim Hollebeek
@ 1998-12-10 14:23                   ` Chip Salzenberg
  1 sibling, 0 replies; 81+ messages in thread
From: Chip Salzenberg @ 1998-12-10 14:23 UTC (permalink / raw)
  To: Craig Burley; +Cc: tim, eggert, martin, brolley, gcc2, egcs

<advocacy subject="perl">

Tim writes:
> Only BASIC and perl programmers, and Fortran programmers who don't
> use "implicit none", should have to worry about creating new variables
> every time they make a typo.

That should be: "Perl programmers who don't 'use strict'".

Thanks.

</advocacy>
-- 
Chip Salzenberg      - a.k.a. -      <chip@perlsupport.com>
      "When do you work?"   "Whenever I'm not busy."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 23:03               ` Per Bothner
  1998-12-10  7:49                 ` Ian Lance Taylor
@ 1998-12-11 19:23                 ` Paul Eggert
  1998-12-12  2:21                   ` Martin von Loewis
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-11 19:23 UTC (permalink / raw)
  To: bothner; +Cc: gcc2, egcs

   Date: Wed, 09 Dec 1998 23:02:40 -0800
   From: Per Bothner <bothner@cygnus.com>

   This is a somewhat hypythetical problem, as we have no experience
   with to what extent if any people need to be able use non-Ascii
   characters in their source files.

I have some experience; we sometimes use gcc that way here.

   But I assume they will want to do that in their locale's text
   encoding - which need not be a "UTF-8" locale.

Yes; this is already widespread practice for C strings.

   In that case, jc1 (or a pre-processor for jc1) has to translate the
   locale's character set into Unicode.

It could also be done by a postprocessor for jc1.

   The locale for assembler files should probably also be UTF-8.

This disagrees with existing practice with C strings.  I don't think
it's wise to commit now to UTF-8 for all assembler files.  Among other
things, it'd mean you couldn't look at the files with Emacs (as the
latest Emacs doesn't support UTF-8).

I have misgivings about having GCC support multiple locales
simultaneously.  Multilingual applications are the province of fancy
text editors like Emacs; simple translators like GCC shouldn't have to
worry about handling multiple locales in the same program execution.
I've dealt with programs like that, and they are a pain to configure
and maintain.  For GCC it's cleaner to add a separate pass to
translate the assembler input, if this is needed.

To some extent this is an ``after you, alphonse'' situation.  The gas
people don't want to worry about translating codes, and I don't blame
them.  I don't want cpp to worry about it either.  Or cc1.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 23:16                 ` Per Bothner
@ 1998-12-11 19:27                   ` Paul Eggert
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-11 19:27 UTC (permalink / raw)
  To: bothner; +Cc: martin, gcc2, egcs

   Date: Wed, 09 Dec 1998 23:15:43 -0800
   From: Per Bothner <bothner@cygnus.com>

   assembler files should also be in UTF-8, at least for
   compiler-generated .s files).  Humans-written .s files will
   probably be in the source locale, so we may need a pre-processor
   (possibly gasp) to convert to UTF-8.

This doesn't sound feasible to me.  It will be confusing to explain to
people that assembly-language files do not all smell the same, and
that you need to compile hand-generated files with one command, and
compiler-generated files with another.

Also, it doesn't match current GCC practice, which already puts
EUC-JIS strings into assembler files quite happily.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-09 23:18               ` Martin von Loewis
  1998-12-10  7:57                 ` Ian Lance Taylor
@ 1998-12-11 19:28                 ` Paul Eggert
  1998-12-12  1:06                   ` Martin von Loewis
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-11 19:28 UTC (permalink / raw)
  To: martin; +Cc: brolley, gcc2, egcs

   Date: Thu, 10 Dec 1998 08:12:20 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes
   U7Fo_1234, where the U indicates that the underscore is an escape....

Sorry, I'm still lost.  If the identifier is the UTF-8 character MICRO
SIGN (code 00B5), do you generate the same UTF-8 character on output,
or do you mangle it as if the user had typed `\u00b5'?  If the latter,
then I don't understand why gas needs to be 8-bit clean; if the
former, then I don't understand your example with \u1234 as it seems
to me that it won't unify with the UTF-8 sequence that is equivalent
to \u1234.

   > I've run into shells that use the top bit for their own purposes.

   What system?

Older BSD systems.  The original Bourne shell used the top bit for its
own purposes.  A few years back, all major Unix suppliers went through
their shells and made them 8-bit clean, but a few bugs lurked for a
while and I wouldn't be surprised if some were still out there.

   Please tell me how I can perform the same test with ASCII-only
   shell commands, and I happily convert.

Something like this should do it:
echo ab | tr 'ab' '\123\456'
Or you could write a little C program, compile it, and run it.

   >	C++ does not distinguish between non-ASCII digits and letters.
   >
   > Really?  Suppose I write the preprocessor line
   > 
   > #if X == 1
   > 
   > where X is some Japanese identifier, but I make the understandable
   > mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.

   \uFF11 is not a letter in C++

OK, so then there's no problem: C++ _does_ distinguish between
non-ASCII digits and letters.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10  7:57                 ` Ian Lance Taylor
  1998-12-10 13:12                   ` Martin von Loewis
@ 1998-12-11 19:32                   ` Paul Eggert
  1998-12-11 19:34                   ` Ken Raeburn
  2 siblings, 0 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-11 19:32 UTC (permalink / raw)
  To: ian; +Cc: martin, brolley, gcc2, egcs

   Date: Thu, 10 Dec 1998 10:57:10 -0500
   From: Ian Lance Taylor <ian@cygnus.com>

   I think it is reasonable for gcc to emit identifiers as
   uninterpreted byte sequences, and for gas to simply pass those
   identifiers straight through into the object file.

Yes, that should work.

   Several people have mentioned the linker as an issue.  To the best of
   my knowledge, the linker will permit any byte value except 0 to appear
   in an identifier.  I don't see why the linker has to change at all for
   any character set issues.

Perhaps people are thinking that the user might want to link files
that were compiled in different locales.  E.g. one user compiles with
C-language function names in Shift-JIS, whereas user compiles with
them encoded in EUC-JIS.

These scenarios are fanciful now, because nobody compiles with
non-ASCII names.

I see no particular reason why the linker (or the compiler or
assembler) would have to support such scenarios.  Nobody is doing it
this sort of thing now, and I think few if any users will require this
behavior in the future.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-10  7:57                 ` Ian Lance Taylor
  1998-12-10 13:12                   ` Martin von Loewis
  1998-12-11 19:32                   ` Paul Eggert
@ 1998-12-11 19:34                   ` Ken Raeburn
  1998-12-14 17:05                     ` Ian Lance Taylor
  2 siblings, 1 reply; 81+ messages in thread
From: Ken Raeburn @ 1998-12-11 19:34 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: martin, eggert, brolley, gcc2, egcs

Ian Lance Taylor <ian@cygnus.com> writes:

> Several people have mentioned the linker as an issue.  To the best of
> my knowledge, the linker will permit any byte value except 0 to appear
> in an identifier.  I don't see why the linker has to change at all for
> any character set issues.

Linker scripts processing with odd symbol, section or file names?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-11 19:28                 ` Paul Eggert
@ 1998-12-12  1:06                   ` Martin von Loewis
  0 siblings, 0 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-12  1:06 UTC (permalink / raw)
  To: eggert; +Cc: brolley, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

> Sorry, I'm still lost.  If the identifier is the UTF-8 character MICRO
> SIGN (code 00B5), do you generate the same UTF-8 character on output,
> or do you mangle it as if the user had typed `\u00b5'?

Suppose I have a

class µ{
  µ();  //This should read MICRO SIGN
};

Then, the compiler tests at installation time whether the assembler on
the system is 8-bit-clean. If it is, the constructor is mangled as

__2\302\265v

If the assembler does not support 8-bit symbols, it is mangled as

__U5_00b5

This is what jc1 currently does.

> echo ab | tr 'ab' '\123\456'

Thanks, this looks good.

> OK, so then there's no problem: C++ _does_ distinguish between
> non-ASCII digits and letters.

Right. It just doesn't distinguish between non-ASCII digits and
non-ASCII non-alphanumerics :-) That's why no predicate function
was needed.

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-11 19:23                 ` Paul Eggert
@ 1998-12-12  2:21                   ` Martin von Loewis
  1998-12-13  6:23                     ` Richard Stallman
  1998-12-15 22:00                     ` Paul Eggert
  0 siblings, 2 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-12  2:21 UTC (permalink / raw)
  To: eggert; +Cc: bothner, gcc2, egcs

> I have misgivings about having GCC support multiple locales
> simultaneously.

So how about that: 

gcc/g++ process strictly-conforming input that is already in the base
character set (plus \u escapes, in a way that the standards
mandate. Object files are then UTF-8, (or U escapes for C++).

gcc/g++ also process input based on the current locale, and pass the
input unmodified to the output.

There is no interworking between the two (i.e: characters in the
current locale are not at all related to \u escapes)

This means that the compiler, in locale-aware mode, would not be
strictly conforming, but so what? People could ask their editors to
save files in C/C++ style encoding if they want portable source files,
or use filters.

If this sounds like a reasonable strategy, we only need to worry how
to combine the two algorithms, i.e. how we arrange processing of
identifiers and strings both using the C wchar functions, and
recognizing \u.

What do you think?

Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-12  2:21                   ` Martin von Loewis
@ 1998-12-13  6:23                     ` Richard Stallman
  1998-12-13 12:27                       ` Martin von Loewis
  1998-12-15 22:00                     ` Paul Eggert
  1 sibling, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-13  6:23 UTC (permalink / raw)
  To: martin; +Cc: eggert, bothner, gcc2, egcs

To make GCC depend on the current locale for correct compilation of a
program is error-prone.  If it is necessary for the handling of C code
to depend on the locale, we should have a way to specify the locale in
the source file itself--perhaps with a #-line.

But it would be much better for GCC to be independent of the locale,
as regards the behavior of the .o file at link time and at run time.
It is no great loss if debugging symbol tables depend on the locale,
but all other aspects of the generated .o file should be as close to
locale-independent as we can make them, to reduce the possibility for
things to go wrong.

Wouldn't it work for GCC to treat all byte values above 127 as part of
an identifier, and not worry about how they group into multibyte
characters?  Perhaps the current locale would have something to do
with how to they look when printed in an error message, but no more
than that.

I have not been following the discussion until now--no time to study
all those messages carefully--so please forgive me if someone has
already explained a reason this cannot work.  But if it merely has
some possible inconvenience for the user, the advantage of being
locale-independent could easily outweigh that.  It is a very large
advantage.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-13  6:23                     ` Richard Stallman
@ 1998-12-13 12:27                       ` Martin von Loewis
  1998-12-14  2:22                         ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-13 12:27 UTC (permalink / raw)
  To: rms; +Cc: gcc2, egcs

> Wouldn't it work for GCC to treat all byte values above 127 as part of
> an identifier, and not worry about how they group into multibyte
> characters?

That would work fine. It might or might not be what the user expects.

The only real drawback is standards compliance. C++, Java, and C9X all
allow to express Unicode in identifiers using \u escapes, like

void h\u00D6llo();

The standards go on saying that the actual source input might be in a
different character set, and the implementation defines how that
relates to Unicode. A sensible implementation would use translation
mechanisms.

Of course, the easiest thing would be to assume that we always get
non-ASCII in identifiers as Unicode escapes. The editor (e.g. Emacs)
would then need to convert the internal encoding to Unicode escapes
when saving. That would make the feature in the language truly useful.

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-13 12:27                       ` Martin von Loewis
@ 1998-12-14  2:22                         ` Richard Stallman
  1998-12-15 10:47                           ` Paul Eggert
  0 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-14  2:22 UTC (permalink / raw)
  To: martin; +Cc: gcc2, egcs

    > Wouldn't it work for GCC to treat all byte values above 127 as part of
    > an identifier, and not worry about how they group into multibyte
    > characters?

    That would work fine. It might or might not be what the user expects.

Passing along the multibyte sequence unchanged is the most natural
thing to do; it is what cat does, for example.  Why would any user be
surprised by it?

    The only real drawback is standards compliance. C++, Java, and C9X all
    allow to express Unicode in identifiers using \u escapes, like

    void h\u00D6llo();

You are right that the handling of \u would have to depend on the
multibyte representation, and therefore on the locale.  That would be
unfortunate, but at least it would happen only when \u is used.

It remains desirable to make GCC handle multibyte input in a way that
is independent of the locale--is there any *specific* problem with that?

    Of course, the easiest thing would be to assume that we always get
    non-ASCII in identifiers as Unicode escapes. 

No, just the opposite.  If the non-ASCII characters are represented in
multibyte, then GCC can handle them properly in a locale-independent
way.  But \u cannot be handled in a locale-independent way.
Therefore, Emacs should save these characters in multibyte
representation (which, as it happens, is the more general feature, and
what we are going to work on anyway).

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-11 19:34                   ` Ken Raeburn
@ 1998-12-14 17:05                     ` Ian Lance Taylor
  0 siblings, 0 replies; 81+ messages in thread
From: Ian Lance Taylor @ 1998-12-14 17:05 UTC (permalink / raw)
  To: raeburn; +Cc: martin, eggert, brolley, gcc2, egcs

   From: Ken Raeburn <raeburn@cygnus.com>
   Date: 11 Dec 1998 22:35:51 -0500

   Ian Lance Taylor <ian@cygnus.com> writes:

   > Several people have mentioned the linker as an issue.  To the best of
   > my knowledge, the linker will permit any byte value except 0 to appear
   > in an identifier.  I don't see why the linker has to change at all for
   > any character set issues.

   Linker scripts processing with odd symbol, section or file names?

You're right, I hadn't considered those.  That can be a problem for
somebody to solve some day.

Ian

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-14  2:22                         ` Richard Stallman
@ 1998-12-15 10:47                           ` Paul Eggert
  1998-12-17 18:10                             ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-15 10:47 UTC (permalink / raw)
  To: rms; +Cc: martin, gcc2, egcs

   Date: Mon, 14 Dec 1998 03:22:51 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

   It remains desirable to make GCC handle multibyte input in a way that
   is independent of the locale--is there any *specific* problem with that?

Yes.  Some widely used multibyte encodings use ordinary ASCII bytes to
encode multibyte characters.  Examples include Shift-JIS, BIG5, and
7-bit ISO-2022.  The ASCII bytes include printable bytes like "\", so
this is an issue for both strings and identifiers.

These encodings all use first bytes that cannot be confused with any
of the single-byte chars in the basic C character set, so they can be
supported by a C compiler.

   Passing along the multibyte sequence unchanged is the most natural
   thing to do

Yes, I tend to think this is the right thing to do in both identifiers
and strings.  Otherwise, the assembly language output might not be a
text file, as it might use different encodings in different regions
with no easy way to distinguish between the reasons.

   that the handling of \u would have to depend on the multibyte
   representation, and therefore on the locale.  That would be
   unfortunate, but at least it would happen only when \u is used.

If GCC is to support encodings like Shift-JIS, it also needs to have a
locale-dependent way to determine the number of bytes in a multibyte
character.  It should copy these bytes straight through; but without a
way of being able to count them, it can't copy them.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-12  2:21                   ` Martin von Loewis
  1998-12-13  6:23                     ` Richard Stallman
@ 1998-12-15 22:00                     ` Paul Eggert
  1998-12-15 23:17                       ` Martin von Loewis
  1998-12-16  0:18                       ` Per Bothner
  1 sibling, 2 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-15 22:00 UTC (permalink / raw)
  To: martin; +Cc: bothner, gcc2, egcs

   Date: Sat, 12 Dec 1998 11:18:00 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > I have misgivings about having GCC support multiple locales
   > simultaneously.

   gcc/g++ process strictly-conforming input that is already in the base
   character set (plus \u escapes, in a way that the standards
   mandate. Object files are then UTF-8, (or U escapes for C++).

But this would mean that \u escapes wouldn't have their intended
effect in non-UTF-8 locales.  E.g. "\u00b5" would turn into a two-byte
multibyte character string, which is incorrect for the common ISO
8859/1 encoding where it is represented by a single byte.

   gcc/g++ also process input based on the current locale

Yes.  But the current locale should affect the processing of \u
escapes, as well as the recognition of multibyte characters.

   and pass the input unmodified to the output.

This is largely correct for multibyte characters (though their bytes
may need escaping to satisfy some assemblers).  I think \u will need
to be translated, though, if possible -- unless the assembler handles
\u, which is not true for gas at least.

   There is no interworking between the two (i.e: characters in the
   current locale are not at all related to \u escapes)

I'm not sure that this is a good idea, partly for the reasons
described above.  Tt would mean that \u escapes would turn into
gibberish in the vast majority of locales in practical use today.

   This means that the compiler, in locale-aware mode, would not be
   strictly conforming, but so what?

Actually, draft C9x allows the behavior that you propose, because it
says that the relationship between multibyte chars and \u is
implementation defined.  I lobbied for this design freedom; earlier
C9x drafts required closer conformance to Unicode (and my impression
is that C++ still requires it).  I was hoping that this freedom would
let GCC (or at least cpp :-) function in a locale-invariant way.  But
if we go this route, we have several problems:

* We won't handle \u the way that users will expect.

* We're limited to locales whose multibyte encodings never use ASCII
  bytes -- and this rules out several popular encodings.

* We'll have to disable the checking for identifier spellings in
  multibyte chars, since we won't know which multibyte chars are
  letters and/or digits.

* In general, assembly language files will not be text files.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-15 22:00                     ` Paul Eggert
@ 1998-12-15 23:17                       ` Martin von Loewis
  1998-12-17  7:32                         ` Paul Eggert
  1998-12-16  0:18                       ` Per Bothner
  1 sibling, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-15 23:17 UTC (permalink / raw)
  To: eggert; +Cc: bothner, gcc2, egcs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3712 bytes --]

> But this would mean that \u escapes wouldn't have their intended
> effect in non-UTF-8 locales.  E.g. "\u00b5" would turn into a two-byte
> multibyte character string, which is incorrect for the common ISO
> 8859/1 encoding where it is represented by a single byte.

Are you talking about source character set or process character set
here? There is no concept of locale character set in C++...
If you have

   char hello[]="Hall\u00f6chen";

you would get "Hall\303\266chen" at run time. This is why the user
shouldn't do that, instead she should write

   wchar hello[]=L"Hall\u00f6chen";

which would then give a wide string where hello[4] is 0xf6. How you
use the string at run-time is up to the application; who says the
application needs to consider the user's locale when processing that
string? Come on.

> Yes.  But the current locale should affect the processing of \u
> escapes, as well as the recognition of multibyte characters.

No way. After the discussion with RMS, I agree that we should copy
bytes *unmodified* into output. That is, if we have a Latin-1 ö in the
input, copy it literally to the output. If we get a \u escape (which
the standard says clearly identifies ISO 10646 characters) we should
also copy it as-is to the output.

Now, there are cases were we can't output the character unmodified. We
use UTF-8 in these cases (identifiers, narrow strings) then, this is how
we define the process character set.

> may need escaping to satisfy some assemblers).  I think \u will need
> to be translated, though, if possible -- unless the assembler handles
> \u, which is not true for gas at least.

No. gcc shall *not* perform character set conversions, at least for the
time being.

> I'm not sure that this is a good idea, partly for the reasons
> described above.  Tt would mean that \u escapes would turn into
> gibberish in the vast majority of locales in practical use today.

Rubbish (sorry). You seem to know exactly how programmers use these
things. Well, I tell you. In a Microsoft COM program, you want to write

	WCHAR DriverName[] = "\u1234\u5678";

The C standard says you should get Unicode. Microsoft says you should
use Unicode in certain situations. This is what you do: You use
Unicode, no matter what the system locale is on Windows NT.

> * We won't handle \u the way that users will expect.

I still don't see this problem. The user expects \u to be Unicode,
we give her Unicode. Why would this not be what the user expects?

> * We're limited to locales whose multibyte encodings never use ASCII
>   bytes -- and this rules out several popular encodings.

This is indeed a problem. But then, maybe it is not. What's most
important is that you can use these encodings in *strings*.

The primary reason why the original C restricted identifiers to ASCII
was the feeling that otherwise, it "would not work". The primary
reason why the standards mandate Unicode and \u escapes is that it
might work with this approach, but it still won't work with other
encodings.

We can either accept that as a fact of life, or come up with something
smart. Converting Unicode escapes to an encoding that uses illegal
ASCII in assembler doesn't sound too smart to me.

> * We'll have to disable the checking for identifier spellings in
>   multibyte chars, since we won't know which multibyte chars are
>   letters and/or digits.

Well, I think there is agreement that we should process the input
unmodified. Whether we use the locale functions to define the set of
programs we accept; well, maybe. I'd prefer command line options, but
there is certainly a problem with that.

> * In general, assembly language files will not be text files.

Define "text file".

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-15 22:00                     ` Paul Eggert
  1998-12-15 23:17                       ` Martin von Loewis
@ 1998-12-16  0:18                       ` Per Bothner
  1 sibling, 0 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-16  0:18 UTC (permalink / raw)
  To: Paul Eggert; +Cc: gcc2, egcs

> But this would mean that \u escapes wouldn't have their intended
> effect in non-UTF-8 locales.  E.g. "\u00b5" would turn into a two-byte
> multibyte character string, which is incorrect for the common ISO
> 8859/1 encoding where it is represented by a single byte.

This is not the case for Java (nor from what Martin says C).
For Java, the locale would specify how bytes on a disk file are
interpreted/translated into Unicode.  For example if the locale
uses 8859/1 (ISO-Latin 1, I believe) then the byte 0xb5 becomes
the character '\u00b5';  if the local uses UTF-8 (which should
be the default, I think), the 0xb5 is part of a multi-byte
encoding.

However, the interpretation of \u00b5 is *not* locale-dependent.
That is, once input processor seems the characters '\\', 'u',
'0', '0', 'b', '5' then that is interpreted as the Unicode
character '\u00b5' in *all* locales.

Whether the \u00b5 is passed through to the assembler for expansion
or not is an implemenation detail.  (The Java "phases of translation"
specification requires \u-escapes to be expanded early, so having
the assembler handle it does not seem to be practical.)  In any case,
it is quite clear that "\u00b5" becomes a run-time String object
containing one 16-bit character, whatever the locales or
external character encoding.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-15 23:17                       ` Martin von Loewis
@ 1998-12-17  7:32                         ` Paul Eggert
  1998-12-17 16:48                           ` Martin von Loewis
  1998-12-18 21:31                           ` Richard Stallman
  0 siblings, 2 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-17  7:32 UTC (permalink / raw)
  To: martin; +Cc: bothner, gcc2, egcs

   Date: Wed, 16 Dec 1998 08:12:52 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > But this would mean ... "\u00b5" would turn into a two-byte
   > multibyte character string, which is incorrect for the common ISO
   > 8859/1 encoding where it is represented by a single byte.

   Are you talking about source character set or process character set
   here?

I'm talking about the execution character set.  E.g. printf ("\u00b5")
should output a single byte in the Solaris 7 "de" locale, which uses
ISO 8859/1.

   There is no concept of locale character set in C++...

There's no such official term in C either, but I think C++ and C use
the same basic idea here, namely that a locale (in particular, the
LC_CTYPE part of the locale) specifies the rules for how multibyte
characters are converted to wide chars, which wide chars are
considered to be upper case, etc., etc.  These rules are defined by
the locale's character set and encoding.

      char hello[]="Hall\u00f6chen";

   you would get "Hall\303\266chen" at run time.

That's certainly not true in draft C9x, for non-UTF-8 locales.
In draft C9x, if you want "Hall\303\266chen" at run time,
you can write "Hall\303\266chen" at compile time.

I also suspect that it's not true for C++.  It's hard for me to
believe that C++ requires UTF-8 encoding for strings at run-time.

      wchar hello[]=L"Hall\u00f6chen";

This should give the equivalent wide string at run-time.  If the
implementation uses Unicode wide chars, this is equivalent to
"Hall\x00f6chen"; otherwise, it's equivalent to whatever binary
encoding they use.

It's possible for the locale to use Unicode wide strings even though
it uses a non-UTF-8 encoding for multibyte chars.  (I believe glibc
2.1 does this, but I haven't checked.)  But it's not required by the C
standard, and some systems use other wide encodings (e.g. JIS).

   > Yes.  But the current locale should affect the processing of \u
   > escapes, as well as the recognition of multibyte characters.

   After the discussion with RMS, I agree that we should copy
   bytes *unmodified* into output.

But your example above with `char hello' doesn't copy the bytes
unmodified!  It translates the 6 chars "\u00f6" to 2 bytes in your
locale's charset and encoding, which is the right thing to do; RMS
(reluctantly, I think :-) agreed that \u requires locale-dependent
translation.

I agree that multibyte chars should be copied unmodified into the
output.  However, as I mentioned earlier, they require locale-specific
processing to be *recognized*; otherwise they might be confused with
ASCII chars.

   If we get a \u escape (which the standard says clearly identifies
   ISO 10646 characters) we should also copy it as-is to the output.

Again, you seem to be contradicting your own example.

Though draft C9x says that \u identifies ISO 10646 chars, it doesn't
require that the implementation use UTF-8 narrow strings, nor does it
require that the implementation use Unicode in wide strings.  It can
use some other encoding, e.g. Shift-JIS or ISO 8859/1 or even Ascii.
I assume C++ is similar here.

Java is a different animal here; it requires Unicode at run-time.  But
we're talking about C (and C++), which make no such requirement.

	   WCHAR DriverName[] = "\u1234\u5678";

   The C standard says you should get Unicode.

All draft C9x says is that you should get the appropriate chars, and
that the relationship between those chars and the Unicode chars is
implementation-defined.

   Microsoft says you should use Unicode in certain situations.

Absolutely.  In a locale that uses Unicode, you should get Unicode.

   Converting Unicode escapes to an encoding that uses illegal
   ASCII in assembler doesn't sound too smart to me.

Sorry, you've lost me.  ``illegal ASCII'??

   > * In general, assembly language files will not be text files.

   Define "text file".

A file that (among other things) uses a single encoding for its
characters.  Such files can be processed by standard text tools like
wc, iconv, and emacs.

You're proposing that assembler files use UTF-8 in some cases, and the
locale's multibyte encoding in other cases.  Such files can't be
processed by standard text tools.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17  7:32                         ` Paul Eggert
@ 1998-12-17 16:48                           ` Martin von Loewis
  1998-12-17 22:10                             ` Paul Eggert
  1998-12-18 21:31                           ` Richard Stallman
  1 sibling, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-17 16:48 UTC (permalink / raw)
  To: eggert; +Cc: bothner, gcc2, egcs

> I'm talking about the execution character set.  E.g. printf ("\u00b5")
> should output a single byte in the Solaris 7 "de" locale, which uses
> ISO 8859/1.

Is this what you want to happen, or what some standard mandates to
happen? If so, what does the standard mandate for printf("\u1234");

>       char hello[]="Hall\u00f6chen";
> 
>    you would get "Hall\303\266chen" at run time.
> 
> That's certainly not true in draft C9x, for non-UTF-8 locales.

I didn't mean that C++ mandates that. It is implementation defined
what happens, and I'm proposing that egcs defines it that way.

> In draft C9x, if you want "Hall\303\266chen" at run time,
> you can write "Hall\303\266chen" at compile time.

You've confused input and output here. The question is not how to
achieve a certain output, but how to process a certain input.

> I also suspect that it's not true for C++.  It's hard for me to
> believe that C++ requires UTF-8 encoding for strings at run-time.

It doesn't. It doesn't prohibit that, either.

> But your example above with `char hello' doesn't copy the bytes
> unmodified!  It translates the 6 chars "\u00f6" to 2 bytes in your
> locale's charset and encoding, which is the right thing to do

No, it is not *my* locale, it is how gcc is (or could be) defined.
The big difference is predictability. If gcc defines that translation
into multibyte characters always means UTF-8 for \u escapes, people
know what to expect.

If the output *at run time* depends on the setting of environment
variables *at compile time*, people will kill us.

>    If we get a \u escape (which the standard says clearly identifies
>    ISO 10646 characters) we should also copy it as-is to the output.
> 
> Again, you seem to be contradicting your own example.

Converting Unicode to UTF-8 is as close as you can get to 'as-is', if
you want to convert arbitrary Unicode to multibyte.

> Java is a different animal here; it requires Unicode at run-time.  But
> we're talking about C (and C++), which make no such requirement.

We also plan to combine C++ and Java.

> 
> 	   WCHAR DriverName[] = "\u1234\u5678";
[...]
>    Microsoft says you should use Unicode in certain situations.
> 
> Absolutely.  In a locale that uses Unicode, you should get Unicode.

Microsoft says you should get Unicode no matter what the locale is.

>    Converting Unicode escapes to an encoding that uses illegal
>    ASCII in assembler doesn't sound too smart to me.
> 
> Sorry, you've lost me.  ``illegal ASCII'??

Well, ASCII sequences that are not legal as identifiers.

> You're proposing that assembler files use UTF-8 in some cases, and the
> locale's multibyte encoding in other cases.  Such files can't be
> processed by standard text tools.

I don't want to process assembler files by standard text tools, I want
the assembler to process it.

Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-15 10:47                           ` Paul Eggert
@ 1998-12-17 18:10                             ` Richard Stallman
  1998-12-17 21:41                               ` Paul Eggert
  1998-12-17 23:55                               ` Joern Rennecke
  0 siblings, 2 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-17 18:10 UTC (permalink / raw)
  To: eggert; +Cc: martin, gcc2, egcs

    Yes.  Some widely used multibyte encodings use ordinary ASCII bytes to
    encode multibyte characters.  Examples include Shift-JIS, BIG5, and
    7-bit ISO-2022.

This makes the situation more difficult.

I think the key to how to cope with it is to recognize that currently
GCC does not handle these encodings at all.  But it does handle (in
strings and comments) multibyte encodings that use only non-ASCII
bytes, and it does so reliably, regardless of the environment.  I'll
call these "clean multibyte encodings".

The default mode in GCC should continue to handle clean multibyte
encodings reliably, meaning that it should not try to understand the
encodings, just treat sequences of non-ASCII bytes in the usual way.
Perhaps it will handle them thus in identifiers, as well as in strings
and comments.

However, it would be ok to add an option which tells GCC to decode all
multibyte encodings encountered, according to the specified locale.
That mode would handle the unclean multibyte encodings.

In addition to that, handling of \u in a non-wide string has to
depend on the encoding.  So there may be two things for which GCC
needs to know the encoding, and there is certainly at least one.

It is unreliable to get the encoding from the environment.  So we
should provide other ways to specify the encoding, and encourage
people to use them.  One way is with an option, --locale=LOCALE
Another way is with a directive such as

  #locale LOCALE

in the source code.

I think GCC should issue a warning if the source actually depends on
the choice of locale, and the locale has been obtained from the
environment.  The warning should encourage use of --locale or #locale
to specify the locale.

      I think \u will need
    to be translated, though, if possible -- unless the assembler handles
    \u, which is not true for gas at least.

Once we decide that either GCC or the assembler should translate \u
into a locale-specific multibyte encoding, it may as well be done in
GCC.  GCC is used with many different assemblers.

    * We'll have to disable the checking for identifier spellings in
      multibyte chars, since we won't know which multibyte chars are
      letters and/or digits.

Why would we want to check?  Why NOT simply define all non-ASCII
characters as being allowed in identifiers?  No non-ASCII characters
have any other meaning in C.

    * In general, assembly language files will not be text files.

When GCC wants to put certain non-ASCII bytes into a string or
identifier, that doesn't necessarily mean just outputting those bytes
into the .s file.

Non-ASCII bytes in strings can be output to the .s file using .byte,
so that the bytes themselves don't appear in the file.  This is a
reliable way to produce the same sequence of bytes in core when the
program runs.

Non-ASCII bytes in identifiers need to be encoded in some
way, since assemblers won't allow them in identifiers.

I suggest using `.' followed by the hex code of the byte.
That is allowed by most assemblers, and provides a unique
representation.  Of course, there should be a way to
specify a different handling for any given system,
in case the native compiler does something different.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17 18:10                             ` Richard Stallman
@ 1998-12-17 21:41                               ` Paul Eggert
  1998-12-18  1:23                                 ` Martin von Loewis
  1998-12-17 23:55                               ` Joern Rennecke
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-17 21:41 UTC (permalink / raw)
  To: rms; +Cc: martin, gcc2, egcs

   Date: Thu, 17 Dec 1998 19:10:24 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

   One way is with an option, --locale=LOCALE
   Another way is with a directive such as `#locale LOCALE' in the source code.

Like your other suggestions, this one sounds reasonable to me.  It
might help to use more specific names, e.g. `--locale-ctype=LOCALE'
and `#locale LC_CTYPE LOCALE'.  Locales are a more general notion than
just their LC_CTYPE component, and we may find a use for specifying
other (non-LC_CTYPE) parts of the locale later.

   Why NOT simply define all non-ASCII characters as being allowed in
   identifiers?  No non-ASCII characters have any other meaning in C.

OK, you can talk me into not checking, at least by default.  It might
be useful to have an optional check, for people who want to port their
code to more restrictive compilers that check the restrictions of
draft C9x's Annex I.  This optional check would be locale-dependent.

   Non-ASCII bytes in identifiers need to be encoded in some
   way, since assemblers won't allow them in identifiers.
   I suggest using `.' followed by the hex code of the byte.

It may make sense to use this notation (or something like it) even
for assemblers like GAS that allow multibyte chars in identifiers.  This
would mean that assembly language files would be text, which is good.

However, a drawback of this approach is that the names will be mangled.
This will require that debuggers demangle the names.  For C++ this is
not such a big deal, as the names are already mangled, but for C it
is a bit inconvenient.

Here is another possibility.  For identifier chars that can be
expressed as multibyte chars in the locale's encoding, use those
chars; otherwise, use `.uxxxx' or `.Uxxxxxxxx' where xxxx (or
xxxxxxxx) are the Unicode position.  E.g. if the original identifier
was a\u1234b\U12345678c, and if \u1234 and \U12345678 cannot be
represented as multibyte chars, then represent this identifier as
a.u1234b.U12345678c in the assembler file.

An advantage of this approach is that (for C, at least), it's upward
compatible with martin's proposal.  A locale that uses a UTF-8 charset
and encoding will simply use UTF-8 identifiers, which is what he
wants.  This will be a natural way to do things in the UTF-8 world --
the assembler file will be much easier to read as UTF-8 text than it
would be with .uxxxx or .xx.xx.xx escapes.  (Similarly for other
encodings like EUC-JIS.)

I don't know how this would affect C++ mangling, though.

   handling of \u in a non-wide string has to depend on the encoding.

A minor point: this is also true for \u in wide strings, as not all
systems use Unicode for wide chars.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17 16:48                           ` Martin von Loewis
@ 1998-12-17 22:10                             ` Paul Eggert
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-17 22:10 UTC (permalink / raw)
  To: martin; +Cc: bothner, gcc2, egcs

   Date: Fri, 18 Dec 1998 01:44:20 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > E.g. printf ("\u00b5") should output a single byte in the Solaris 7
   > "de" locale, which uses ISO 8859/1.

   Is this what you want to happen, or what some standard mandates to happen?

The draft C9x standard mandates only that the implementation define
the relation between \u escapes and the locale's characters.  However,
the intent is that \u00b5 correspond to the ISO 10646-1 MICRO SIGN
character, and the ISO 8859/1 equivalent is the single byte with hex
code b5.

   what does the standard mandate for printf("\u1234");

Again, it's implementation defined.  If the implementation's encoding
can't represent a Unicode character, the implementation must
substitute some other char.  E.g. printf("\u1234") might print a
question mark in a locale that is limited to ISO 8859/1 chars.
   
   If gcc defines that translation into multibyte characters always
   means UTF-8 for \u escapes, people know what to expect.

It's true that this would be reproducible behavior, and it would also
conform to the letter of the standard; but it's undesirable (e.g. it
mixes encodings on output) and doesn't conform to the standard's intent.
It would make \u useless in non-UTF-8 locales.

   If the output *at run time* depends on the setting of environment
   variables *at compile time*, people will kill us.

I think you're right to be leery of environmental settings (as is
RMS), and I also think it wise to prefer explicit settings to
environmental ones.  But it's too strong to rule out the environment
entirely.  The runtime behavior already depends on the values of
compile-time environment variables (e.g. C_INCLUDE_PATH); having one
more such dependency won't kill us.

   > Java is a different animal here; it requires Unicode at run-time.  But
   > we're talking about C (and C++), which make no such requirement.

   We also plan to combine C++ and Java.

This means that the C++ side will most likely have to use UTF-8.
That's OK.  For UTF-8 locales I think we're pretty much in agreement.

   Microsoft says you should get Unicode no matter what the locale is.

GCC is used by many non-Microsoft platforms; it can't (and shouldn't
try to) impose Microsoft's rules on everybody else.

   I don't want to process assembler files by standard text tools

You may not need this capability, but other people do.  E.g. GCC's
maintainers need to look at the assembler output to debug GCC itself.
These needs make it desirable to have assembler files be text rather
than some encoding that's not human-readable.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17 18:10                             ` Richard Stallman
  1998-12-17 21:41                               ` Paul Eggert
@ 1998-12-17 23:55                               ` Joern Rennecke
  1998-12-19  5:13                                 ` Richard Stallman
  1 sibling, 1 reply; 81+ messages in thread
From: Joern Rennecke @ 1998-12-17 23:55 UTC (permalink / raw)
  To: rms; +Cc: eggert, martin, gcc2, egcs

>   #locale LOCALE

I think it should rather be

#pragma locale LOCALE

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17 21:41                               ` Paul Eggert
@ 1998-12-18  1:23                                 ` Martin von Loewis
  0 siblings, 0 replies; 81+ messages in thread
From: Martin von Loewis @ 1998-12-18  1:23 UTC (permalink / raw)
  To: eggert; +Cc: rms, gcc2, egcs

> Here is another possibility.  For identifier chars that can be
> expressed as multibyte chars in the locale's encoding, use those
> chars; otherwise, use `.uxxxx' or `.Uxxxxxxxx' where xxxx (or
> xxxxxxxx) are the Unicode position.

[...]

> I don't know how this would affect C++ mangling, though.

This won't work for C++. Consider

class Foo{
        static int u1234;
};

This currently compiles into _3Foo.u1234. With your proposal,
_3Foo.u1234.u1234 could either be Foo\u1234::u1234, or
Foo::u1234\u1234.

If people don't like converting Unicode identifiers to UTF-8 always, I
drop that proposal with regrets. It would work on assemblers that
support 8bit in identifiers, it would work for C and C++, and it would
work independently from compile-time or runtime settings (identifiers
are *not* effected by the users locale whatsoever).

Anyway, I drop that proposal. There is a proposed mangling for \u
escapes in C++ in gxxint.texi. It works for all cases and for all
assemblers, giving plain text in identifiers. It doesn't work for C,
but after this discussion, I guess I don't care about that anymore.
Somebody just tell me how it should work for C.

Kind regrets,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17  7:32                         ` Paul Eggert
  1998-12-17 16:48                           ` Martin von Loewis
@ 1998-12-18 21:31                           ` Richard Stallman
  1 sibling, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-18 21:31 UTC (permalink / raw)
  To: eggert; +Cc: martin, bothner, gcc2, egcs

      It translates the 6 chars "\u00f6" to 2 bytes in your
    locale's charset and encoding, which is the right thing to do; RMS
    (reluctantly, I think :-) agreed that \u requires locale-dependent
    translation.

We have to do locale-dependent translation for \u in a non-wide
string, because the character meaning of a \u escape is
locale-independent, while the proper multibyte representation of any
given character in a non-wide string is locale-dependent.

It might be appropriate to do locale-dependent translation for \u
in a wide string, in case the locale's wide character representation
is not Unicode.  But maybe it is ok to say, "you lose in that case."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-17 23:55                               ` Joern Rennecke
@ 1998-12-19  5:13                                 ` Richard Stallman
  1998-12-19 10:36                                   ` Paul Eggert
  0 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-19  5:13 UTC (permalink / raw)
  To: amylaar; +Cc: eggert, martin, gcc2, egcs

    >   #locale LOCALE

    I think it should rather be

    #pragma locale LOCALE

No, definitely not.
It is not a good idea to use #pragma for anything
that affects the meaning of the program.

We should invent a new command for this,
so that it could later be adopted as a standard.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-19  5:13                                 ` Richard Stallman
@ 1998-12-19 10:36                                   ` Paul Eggert
  1998-12-20 20:29                                     ` Richard Stallman
                                                       ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-19 10:36 UTC (permalink / raw)
  To: rms; +Cc: amylaar, martin, gcc2, egcs

I thought of some drawbacks to #pragma LC_CTYPE "ja_JP.PCK"
(or to #locale LC_CTYPE "ja_JP.PCK", for that matter).

* If the program text is converted from one encoding to another, its
  #pragma will become incorrect.  This will make it a hassle to convert
  program text automatically (e.g. from Shift-JIS to UTF-8).

* Locale names aren't very portable.  E.g. Solaris uses "ja" for
  EUC-JIS whereas Unixware uses "ja_JP.EUC".

Perhaps it would be better for GCC to autodetect the character set and
encoding, much as Emacs already does.  GCC could even reuse the Emacs
code.  There would have to be a way to override the default (e.g. a
command-line option), but autodetection might be good enough in practice
so that overriding would be rarely needed.

   Date: Sat, 19 Dec 1998 06:12:48 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

   It is not a good idea to use #pragma for anything
   that affects the meaning of the program.

But all the pragmas required by draft C9x affect the meaning of the
program.  I think draft C9x's intent is that #pragma not affect the
meaning of the program ``much''.

For reference, here are the draft C9x pragmas and what they do.

	#pragma STDC FP_CONTRACT ON allows a floating-point expression to be
	``contracted'', i.e. evaluated as though it were an atomic operation,
	thereby omitting rounding errors.  (E.g. PowerPC multiply-add.)

	#pragma STDC FENV_ACCESS ON lets floating-point code test flags or run
	under non-default modes.

	#pragma STDC CX_LIMITED_RANGE ON lets the implementation evaluate
	complex multiply, divide, and absolute value efficiently without
	worrying about correct behavior because of undue overflow and
	underflow.

	The default state of these pragmas is implementation-defined.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-19 10:36                                   ` Paul Eggert
  1998-12-20 20:29                                     ` Richard Stallman
@ 1998-12-20 20:29                                     ` Richard Stallman
  1998-12-21  1:52                                       ` Andreas Schwab
  1998-12-21 12:25                                     ` Samuel Figueroa
  2 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-20 20:29 UTC (permalink / raw)
  To: eggert; +Cc: amylaar, martin, gcc2, egcs

    For reference, here are the draft C9x pragmas and what they do.

These arithmetic parameters should not be done with #pragma, not with
any kind of #-command.  That is because macro expansions cannot
produce a #-command.  I told the committee about this problem ten
years ago, but it seems that the temptation of the #pragma idea
is too strong for mere logic to overcome.

We should design a cleaner syntax for these parameters, one that can
be produced by macro expansion, and we should deprecate the use of
pragmas for this purpose.  Actually I did design one ten years ago or
so.  Maybe the committee still has records of what it was.

If some of us are still on the committee, could those people please
forward the suggestion to the committee, before it is too late?  (I was a
member but left when they insisted on paying to be a member.)


Locale specification is a different kind of operation.
It does not affect the meaning of expressions;
instead it says how to read lines of source code.
That is why a #-command is ok for locale specification,
even tho it is a bad interface for these floating point parameters.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-19 10:36                                   ` Paul Eggert
@ 1998-12-20 20:29                                     ` Richard Stallman
  1998-12-21  7:00                                       ` Zack Weinberg
  1998-12-21 18:11                                       ` Paul Eggert
  1998-12-20 20:29                                     ` Richard Stallman
  1998-12-21 12:25                                     ` Samuel Figueroa
  2 siblings, 2 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-20 20:29 UTC (permalink / raw)
  To: eggert; +Cc: amylaar, martin, gcc2, egcs

    * If the program text is converted from one encoding to another, its
      #pragma will become incorrect.

Yes, the #locale will need to be changed in that case.

If GCC is going to depend on the locale, you will have to specify the
locale for your files.  Regardless of how you specify it, with #locale
or with --locale or with an envvar, in any case there is a risk you
might forget to change the specification along with the file.

This problem applies to ALL possible ways of specifying the locale.
So it is not a reason to prefer method of specification to another
method.

However, use of #locale avoids the danger that you will simply forget
to specify the right locale, even though the correct locale is the
same as it always was.

    * Locale names aren't very portable.  E.g. Solaris uses "ja" for
      EUC-JIS whereas Unixware uses "ja_JP.EUC".

This is a real issue.  I see three possible solutions.

1. Define our own system-independent names for (some) locales.
2. Allow specification of several locale names, and GCC will use
the first one that is meaningful on the system in use.
3. Allow specification of several locale names, each associated
with a host system type.

    Perhaps it would be better for GCC to autodetect the character set and
    encoding, much as Emacs already does.

Autodetection is limited; in Emacs it requires (in some cases)
preference information from the user.  For example, Emacs cannot
distinguish between Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5.
There is no way to distinguish them automatically, because they use
the same set of valid bytes.

Adding something to the file to specify this preference information
is pretty much equivalent to alternative 1 above.

    But all the pragmas required by draft C9x affect the meaning of the
    program.

The committee is making a foolish decision.  Let's not follow suit.
If some day they define a specific #pragma to do this job, then we
should support it, but we should lead the way towards a cleaner
approach.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-20 20:29                                     ` Richard Stallman
@ 1998-12-21  1:52                                       ` Andreas Schwab
  1998-12-22  1:09                                         ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Andreas Schwab @ 1998-12-21  1:52 UTC (permalink / raw)
  To: rms; +Cc: eggert, amylaar, martin, gcc2, egcs

Richard Stallman <rms@gnu.org> writes:

|>     For reference, here are the draft C9x pragmas and what they do.
|> 
|> These arithmetic parameters should not be done with #pragma, not with
|> any kind of #-command.  That is because macro expansions cannot
|> produce a #-command.

That's why C9x has the Pragma operator.

       6.10.9  Pragma operator

       Semantics

       [#1] A unary operator expression of the form:

               _Pragma ( string-literal  )

       is processed as follows: The string literal is  destringized
       by  deleting  the L prefix, if present, deleting the leading
       and trailing double-quotes, replacing each  escape  sequence
       \"  by a double-quote, and replacing each escape sequence \\
       by a single backslash.  The resulting sequence of characters
       is   processed   through  translation  phase  3  to  produce
       preprocessing tokens that are executed as if they  were  the
       pp-tokens   in   a  pragma  directive.   The  original  four
       preprocessing tokens in the unary  operator  expression  are
       removed.

-- 
Andreas Schwab                                      "And now for something
schwab@issan.cs.uni-dortmund.de                      completely different"
schwab@gnu.org

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-20 20:29                                     ` Richard Stallman
@ 1998-12-21  7:00                                       ` Zack Weinberg
  1998-12-21 18:58                                         ` Paul Eggert
  1998-12-21 18:11                                       ` Paul Eggert
  1 sibling, 1 reply; 81+ messages in thread
From: Zack Weinberg @ 1998-12-21  7:00 UTC (permalink / raw)
  To: rms; +Cc: amylaar, martin, gcc2, egcs

On Sun, 20 Dec 1998 21:29:41 -0700 (MST), Richard Stallman wrote:
>
>    * Locale names aren't very portable.  E.g. Solaris uses "ja" for
>      EUC-JIS whereas Unixware uses "ja_JP.EUC".
>
>This is a real issue.  I see three possible solutions.
>
>1. Define our own system-independent names for (some) locales.
>2. Allow specification of several locale names, and GCC will use
>the first one that is meaningful on the system in use.
>3. Allow specification of several locale names, each associated
>with a host system type.

GCC should only care about the character set, not the rest of the
locale.  Therefore, it makes sense to use the charset names from the
iconv library (part of glibc 2.1, also in Solaris and probably
elsewhere) which are the names standardized by the MIME RFCs.

zw

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-19 10:36                                   ` Paul Eggert
  1998-12-20 20:29                                     ` Richard Stallman
  1998-12-20 20:29                                     ` Richard Stallman
@ 1998-12-21 12:25                                     ` Samuel Figueroa
  2 siblings, 0 replies; 81+ messages in thread
From: Samuel Figueroa @ 1998-12-21 12:25 UTC (permalink / raw)
  To: rms; +Cc: gcc2, egcs

>These arithmetic parameters should not be done with #pragma, not with
>any kind of #-command.  That is because macro expansions cannot
>produce a #-command.  I told the committee about this problem ten
>years ago, but it seems that the temptation of the #pragma idea
>is too strong for mere logic to overcome.

I have never been a member of the committee, but I attended a meeting in  
which your objections to using #pragma for these arithmetic parameters were  
mentioned.  The attitude was that if no one present wished to champion a  
proposal or objection, it would not be considered at all.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-20 20:29                                     ` Richard Stallman
  1998-12-21  7:00                                       ` Zack Weinberg
@ 1998-12-21 18:11                                       ` Paul Eggert
  1998-12-21 18:46                                         ` Per Bothner
                                                           ` (3 more replies)
  1 sibling, 4 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-21 18:11 UTC (permalink / raw)
  To: rms; +Cc: amylaar, martin, gcc2, egcs

   Date: Sun, 20 Dec 1998 21:29:41 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

       Perhaps it would be better for GCC to autodetect the character set and
       encoding, much as Emacs already does.

   Autodetection is limited; in Emacs it requires (in some cases)
   preference information from the user.  For example, Emacs cannot
   distinguish between Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5.
   There is no way to distinguish them automatically, because they use
   the same set of valid bytes.

There are two related issues here:

1. Can autodetection work well enough to support non-"C" characters in
   strings and identifiers?

2. Can autodetection work well enough to support \u escapes as well?

Your example suggests that the answer to (2) is ``no''.  But this
already follows from the fact that a C program could be written
entirely in the "C" character set with \u escapes, and autodetection
can't possibly determine the multibyte encoding of such a program.

I was thinking more about case (1), which I think will be more common
in practice.  I suspect that autodetection could work reasonably well
for \u-free programs.  For example, GCC needn't worry about the
distinction between Latin-1 and Latin-2 if there are no \u escapes,
since it needn't worry about whether the byte 0xb5 corresponds to
MICRO SIGN or to some other character.

       all the pragmas required by draft C9x affect the program's meaning

   The committee is making a foolish decision.

The committee is also requiring _Pragma("FOO") to have the same
meaning as #pragma FOO.  _Pragma("FOO") can be output by macros.
Does this overcome your objection to pragmas?

       * If the program text is converted from one encoding to another, its
	 #locale will become incorrect.

   If GCC is going to depend on the locale, you will have to specify the
   locale for your files.  Regardless of how you specify it, with #locale
   or with --locale or with an envvar, in any case there is a risk you
   might forget to change the specification along with the file.

I have a shorter and a longer answer to this.

The shorter answer:

	GCC currently doesn't have directives like this:

	#character-set ASCII
	#character-set EBCDIC

	because they're not needed; people who compile in EBCDIC environments
	already know about these issues, set things up appropriately, and
	would find those directives to be a pain to maintain.  GCC (and other
	compilers) have survived all these years without character-set
	directives, even though they solve roughly the same problems that
	#locale directives would solve.  This suggests that GCC doesn't
	need #locale directives either.

The longer answer:

	There is a risk to mis-specifying the locale, yes, but in practice my
	experience is that the risk is smaller if the locale is part of the
	environment.

	In our company, when we import files from other sources, we typically
	transliterate them to an encoding suitable for our preferred working
	locale.  This is the only plausible way to do things; otherwise, few
	of our text-processing tools would work.  Even Emacs supports only
	_some_ of the Japanese encodings that we import -- e.g. it doesn't
	support UTF-8 or DBCS.  Most other tools support only one character
	set and encoding at a time, and it is set from the locale environment
	variables in the usual way.

	If #locale were part of the source, we'd have more work to do, since
	we'd also have to munge the #locale directives of imported sources.
	This would be doable, but it would be a hassle, particularly when
	trading patches with our correspondents who use different encodings.
	I can easily see where people would screw this up.

	In contrast, if the locale is part of the build environment, we
	needn't worry about munging anything.  We must set up our build
	environment correctly, but that's OK -- we also must set up our
	PATH correctly, etc., and setting up the locale correctly is something
	that everyone versed in software internationalization and localization
	already knows how to do.


   use of #locale avoids the danger that you will simply forget
   to specify the right locale,

No, it doesn't avoid the danger.  You can specify the wrong locale
just as easily, if not more easily, with #locale -- e.g. see the
transliteration scenario in my longer answer above.

       * Locale names aren't very portable.  E.g. Solaris uses "ja" for
	 EUC-JIS whereas Unixware uses "ja_JP.EUC".

   This is a real issue.  I see three possible solutions.

   1. Define our own system-independent names for (some) locales.

I'd rather not do this -- it will be a maintenance hassle.  But if we
must do it, we should steal code from glibc rather than reinvent the
wheel (as is done in the current GCC2 and EGCS snapshots).

   2. Allow specification of several locale names, and GCC will use
   the first one that is meaningful on the system in use.
   3. Allow specification of several locale names, each associated
   with a host system type.

These are also maintenance hassles, for several reasons.  E.g. "ja"
means different things on different hosts.  I don't know which locale
names map to which encodings on which hosts, and keeping track of this
info will be tedious and quite error prone.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:11                                       ` Paul Eggert
@ 1998-12-21 18:46                                         ` Per Bothner
  1998-12-21 19:44                                           ` Paul Eggert
  1998-12-21 20:16                                           ` Paul Eggert
  1998-12-21 19:16                                         ` Per Bothner
                                                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-21 18:46 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs

> 1. Can autodetection work well enough to support non-"C" characters in
>    strings and identifiers?

I don't know if it can work for C.  I do know that autodectection
cannot work for Java.  If a Java program is written in Latin-2,
then any non-Ascii characters have to be converted into the
corresponding Unicode at some stage in the translation process.
I.e. either the compiler or the assembler (or a pre-processor)
has to know the character set of the input file.

Yes, we could have auto-detection for C but not Java,
but that does seem rather clumsy.

In any case:  I think we want to support linking together
source files written in different locales.  E.g. libc
should be written in UTF-8, but an application may be written
in a local character set.  If we want these to be able to link,
either the linker has to be able to convert between character
encodings (which I think we agree we don't want), or
symbol names in .o files have to be a common character set.
The only plauible contender for such a common character set
is UTF-8.

Given that symbols have to be in a common character encoding,
it follows that you cannot possibly do autodetection, at
least not for identifiers.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21  7:00                                       ` Zack Weinberg
@ 1998-12-21 18:58                                         ` Paul Eggert
  1998-12-21 19:07                                           ` Zack Weinberg
                                                             ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-21 18:58 UTC (permalink / raw)
  To: zack; +Cc: rms, amylaar, martin, gcc2, egcs

   Date: Mon, 21 Dec 1998 10:00:13 -0500
   From: Zack Weinberg <zack@rabi.columbia.edu>

   GCC should only care about the character set, not the rest of the
   locale.  Therefore, it makes sense to use the charset names from the
   iconv library (part of glibc 2.1, also in Solaris and probably
   elsewhere) which are the names standardized by the MIME RFCs.

This is a good suggestion.  I assume you're saying that GCC should use
directives like `#charset "SJIS"' rather than directives like `#locale
"ja"', since the other attributes of "ja" are not important for GCC.

Unfortunately, this suggestion doesn't solve the problem of unportable
directives in practice, because the charset+encoding names are not
standardized well either.  E.g for Shift-JIS, Solaris 7 has the
aliases "PCK" and "SJIS", glibc 2.0.108 has "SJIS", and MIME has
"Shift_JIS", "MS_Kanji", and "csShiftJIS".  It sounds like we might
slide through with "SJIS" for Shift-JIS, even though it's not in the
MIME standard; but for EUC-JIS, Solaris 7 has the name "eucJP", glibc
2.0.108 has "EUC-JP", and MIME has the aliases "EUC-JP",
"csEUCPkdFmtJapanese", and
"Extended_UNIX_Code_Packed_Format_for_Japanese"; so no single name
will do in practice for EUC-JIS.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:58                                         ` Paul Eggert
@ 1998-12-21 19:07                                           ` Zack Weinberg
  1998-12-21 19:28                                           ` Ulrich Drepper
  1998-12-23  0:36                                           ` Richard Stallman
  2 siblings, 0 replies; 81+ messages in thread
From: Zack Weinberg @ 1998-12-21 19:07 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs

On Mon, 21 Dec 1998 18:57:01 -0800 (PST), Paul Eggert wrote:
>   Date: Mon, 21 Dec 1998 10:00:13 -0500
>   From: Zack Weinberg <zack@rabi.columbia.edu>
>
>   GCC should only care about the character set, not the rest of the
>   locale.  Therefore, it makes sense to use the charset names from the
>   iconv library (part of glibc 2.1, also in Solaris and probably
>   elsewhere) which are the names standardized by the MIME RFCs.
>
>This is a good suggestion.  I assume you're saying that GCC should use
>directives like `#charset "SJIS"' rather than directives like `#locale
>"ja"', since the other attributes of "ja" are not important for GCC.

Yes.

You have a point about this being something that belongs in the
environment.  I don't have any experience in this field and can't say
what makes the most sense.

>Unfortunately, this suggestion doesn't solve the problem of unportable
>directives in practice, because the charset+encoding names are not
>standardized well either.

The MIME RFCs try to standardize charset+encoding names.  It's the
closest to a proper standard there is, and all the iconv(3)
implementations I know of (all two of them :)  support all those
names.

I'm tempted to suggest that we use iconv to convert everything to
UTF-8 (Java seems to need this, and consistency is good) but only when
it comes with the system.  When it isn't available, we don't even try
to support extended charsets.  Trying to support all the different
incompatible encoding libraries out there would be a nightmare, and
importing glibc's iconv is impractical - it's >4megs of code.

zw

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:11                                       ` Paul Eggert
  1998-12-21 18:46                                         ` Per Bothner
@ 1998-12-21 19:16                                         ` Per Bothner
  1998-12-21 19:20                                           ` Per Bothner
  1998-12-23  0:35                                           ` Richard Stallman
  1998-12-22  3:09                                         ` Joern Rennecke
  1998-12-23  0:36                                         ` Richard Stallman
  3 siblings, 2 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-21 19:16 UTC (permalink / raw)
  To: gcc2, egcs

I too am rather leary of using #pragma locale or any other in-band
indicator or the character set.

Paul mentions the problem of converting a set of text files from one
encoding to another.  Perhaps someone in Western Europe wants
to examine a program with its documentation, but both were written
in China.  It makes sense to convert it to the local character
set first.  If the original program contains #pragma locale statements,
these have to be translated also, but expecting a chracter-set
translation tool to understand C syntax seems a bit much.

If you *don't* do the translation, all your other tools (emacs,
less, grep, etc) need to understand the #pragma locale statement,
which again seems reasonable.

Another problem is that switching character encoding
in-band may be difficult.  Many libraries do not support it.
The Java FileReader class requires you to specify the encoding
at *open* time.  Of course there are various work-around.
For example, you can try opening the file in UTF-8 mode,
and if you see a #pragma locale statement, re-open it in the
apprioriate mode.  Still this is not something applications
programmers shoudl have to deal with.

The only general solution I think is for the *file system*
and/or input library to do the translation.  Perferably
each file should specify its encoding out-of-bound,
just like MIME does.  As a back-up, the user should be
able tospecify a default encoding (based on their lcoale),
and perhaps over-ride it for individual files.

Still, while #pragra locale does have its problems, and
we must also support other ways for getting character
encoding information, it might still be a useful
*alternative* method for specifying the encoding.

One useful data point is that the XML specification provides
a command to specify the character encoding in use.
See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl
The XML spec also includes an appendix on auto-detection:
http://www.w3.org/TR/PR-xml-971208#sec-guessing

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 19:16                                         ` Per Bothner
@ 1998-12-21 19:20                                           ` Per Bothner
  1998-12-23  0:35                                           ` Richard Stallman
  1 sibling, 0 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-21 19:20 UTC (permalink / raw)
  To: Per Bothner; +Cc: gcc2, egcs

> If you *don't* [convert your source files, but rely of #pragma locale
> or whatever], all your other tools (emacs,
> less, grep, etc) need to understand the #pragma locale statement,
> which again seems reasonable.

s/reasonable/unreasonable/
Sigh.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:58                                         ` Paul Eggert
  1998-12-21 19:07                                           ` Zack Weinberg
@ 1998-12-21 19:28                                           ` Ulrich Drepper
  1998-12-23  0:36                                           ` Richard Stallman
  2 siblings, 0 replies; 81+ messages in thread
From: Ulrich Drepper @ 1998-12-21 19:28 UTC (permalink / raw)
  To: Paul Eggert; +Cc: zack, rms, amylaar, martin, gcc2, egcs

Paul Eggert <eggert@twinsun.com> writes:

> Unfortunately, this suggestion doesn't solve the problem of unportable
> directives in practice, because the charset+encoding names are not
> standardized well either.

Take a look at the gconv-modules file in glibc.  A similar file could
be part of gcc.  The user could extend it in whatever way s/he needs.

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com   `------------------------

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:46                                         ` Per Bothner
@ 1998-12-21 19:44                                           ` Paul Eggert
  1998-12-21 20:30                                             ` Per Bothner
  1998-12-21 20:16                                           ` Paul Eggert
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-21 19:44 UTC (permalink / raw)
  To: bothner; +Cc: rms, amylaar, martin, gcc2, egcs

   Date: Mon, 21 Dec 1998 18:45:09 -0800
   From: Per Bothner <bothner@cygnus.com>

   Yes, we could have auto-detection for C but not Java,
   but that does seem rather clumsy.

It would be nice to use the same method for all languages, yes.
This is a good argument against autodetection.

   libc should be written in UTF-8, but an
   application may be written in a local character set.

libc's identifiers use only the "C" subset of ASCII, and therefore
libc will link to an application written in any locale, even if we use
the native multibyte encoding for identifiers.

   Given that [.o] symbols have to be in a common character encoding,
   it follows that you cannot possibly do autodetection, at least not
   for identifiers.

I don't see how this follows.  The compiler could use autodetection to
discover the input character set, and then translate the identifiers'
characters to UTF-8 when outputting assembly language.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:46                                         ` Per Bothner
  1998-12-21 19:44                                           ` Paul Eggert
@ 1998-12-21 20:16                                           ` Paul Eggert
  1998-12-21 20:28                                             ` Zack Weinberg
                                                               ` (2 more replies)
  1 sibling, 3 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-21 20:16 UTC (permalink / raw)
  To: bothner; +Cc: rms, amylaar, martin, gcc2, egcs

   Date: Mon, 21 Dec 1998 18:45:09 -0800
   From: Per Bothner <bothner@cygnus.com>

   we want to support linking together source files written in
   different locales...  The only plauible contender for such a common
   character set is UTF-8.

OK, how about this proposal?  I've tried to formulate it to address
everybody's concerns:

(1) The input character set is determined by #pragma charset FOO
    (or _Pragma ("charset FOO")) directive, compile-time option, or
    environment variable (in that order).  For the last alternative,
    the default is to use setlocale (LC_CTYPE, ""); nl_langinfo
    (CODESET) if these two functions are available.  The default
    is UTF-8.

(2) GCC uses the iconv function to translate from the input multibyte
    encoding to UTF-8 internally (for identifiers), and to determine
    character boundaries (in strings and comments).  If the
    implementation doesn't have iconv, GCC normally supports only
    UTF-8; however, if the installer wants to build a compiler that
    knows about other encodings (e.g. for cross-compilation), we
    supply an easy way to use glibc's iconv.  We can then remove the
    existing local_mblen function and friends, as they're no longer
    needed.

(3) GCC transliterates each \u escape in a string to the string's charset,
    which is specified as described in (1) above.

(4) After the translation in (3) (and after processing the other
    escapes like \n), GCC copies the contents of strings straight
    through to the assembler, if possible.  As is currently the case,
    characters like \ and " that need escaping are escaped.  However,
    a new feature is that if a string contains troublesome multibyte
    characters (e.g. the characters contain the bytes for ASCII \ or
    "), then those characters are output using octal escapes for each
    byte.  Similarly, if there is a string of multibyte characters not
    in the initial shift state that contains a \ or " byte, the entire
    string is output using octal escapes.

(5) GCC transliterates all identifiers to UTF-8 for the assembly
    language output.  If the input character set is a superset of UTF-8
    (e.g. ISO-2022-JP), then the extra information is lost.  If the
    assembler doesn't support UTF-8 identifiers, GCC transliterates
    identifiers to some ASCII escape sequence representing the UTF-8
    identifiers.

(6) GCC transliterates all identifiers to the working charset for all
    other output (e.g. diagnostics).

Here are some properties of this proposal:

* If the input file uses UTF-8, then the assembly language output
  uses UTF-8 as well.

* If the input file is a text file that does not use \u escapes, and
  does not use multibyte characters in identifiers, then the assembly
  language output is a text file that uses the same encoding.  This
  should accommodate existing practice reasonably well.

* The assembler needn't know about encodings.

* You can link together source files written in different locales,
  since all the identifiers are transliterated to some encoding
  of Unicode.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 20:16                                           ` Paul Eggert
@ 1998-12-21 20:28                                             ` Zack Weinberg
  1998-12-22  2:59                                               ` Paul Eggert
  1998-12-21 21:03                                             ` Per Bothner
  1998-12-25  0:05                                             ` Richard Stallman
  2 siblings, 1 reply; 81+ messages in thread
From: Zack Weinberg @ 1998-12-21 20:28 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs

I like Paul's proposal in general but I have two nits relating to the
implementation.

>(1) The input character set is determined by #pragma charset FOO
>    (or _Pragma ("charset FOO")) directive, compile-time option, or
>    environment variable (in that order).

It is going to be extremely difficult to put _Pragma into the
preprocessor as it's specified in the current standard.  I'll talk
about this in another message.

>(2) GCC uses the iconv function to translate from the input multibyte
>    encoding to UTF-8 internally (for identifiers), and to determine
>    character boundaries (in strings and comments).

To do this we'd need to translate the entire file to UTF-8 in order to
know where identifiers begin and end, and then translate strings
back.  That can lose information - say strings are in ISO 2022-JP but
nothing else is.

zw

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 19:44                                           ` Paul Eggert
@ 1998-12-21 20:30                                             ` Per Bothner
  1998-12-23  0:35                                               ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Per Bothner @ 1998-12-21 20:30 UTC (permalink / raw)
  To: Paul Eggert; +Cc: gcc2, egcs

> libc's identifiers use only the "C" subset of ASCII, and therefore
> libc will link to an application written in any locale, even if we use
> the native multibyte encoding for identifiers.

It was an example.  In practice, libraries meant for other-than-internal
use will probably stick to the C subset - but I don't want to
depend on that.

> I don't see how this follows.  The compiler could use autodetection to
> discover the input character set, and then translate the identifiers'
> characters to UTF-8 when outputting assembly language.

We've already established that the compiler cannot use autodetection
to discover the input character set except in very specific environments.
I guess I was responding to the idea of passing uninterpreted
bytes through, and pointing that that is a bad idea for at least
external identifiers and for Java.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 20:16                                           ` Paul Eggert
  1998-12-21 20:28                                             ` Zack Weinberg
@ 1998-12-21 21:03                                             ` Per Bothner
  1998-12-22  2:35                                               ` Paul Eggert
  1998-12-28  8:10                                               ` Martin von Loewis
  1998-12-25  0:05                                             ` Richard Stallman
  2 siblings, 2 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-21 21:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: gcc2, egcs

> (2) GCC uses the iconv function to translate from the input multibyte
>    encoding to UTF-8 internally (for identifiers

and strings in Java.

> (3) GCC transliterates each \u escape in a string to the string's charset,
>     which is specified as described in (1) above.

Hm.  (1) above specifies the *file's* charset.  It does not follow
that the *string's* charset is the same.  Certainly for Java, it
would not be.

What happens to:
	wchar_t x = '\u1234';  /* or:  L'\u1234' */
are these different from:
	wchar_t x = (wchar_t) 0x1234;

I assume your proposal is that the string charset at least
by default should be the file charset except for Java where
the string charset is Unicode.  I don't know if that is
reasonable;  I guess so.

> If the input character set is a superset of UTF-8
> (e.g. ISO-2022-JP), then the extra information is lost.

I'm confused.  I thought that Unicode was specifically designed
so that dictinct characters in existing Japanese character
standards were mapped into distinct Unicode characters.
Did I misunderstand, or is ISO-2022-JP not one of the "source"
character sets the Unicode designers used?

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21  1:52                                       ` Andreas Schwab
@ 1998-12-22  1:09                                         ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-22  1:09 UTC (permalink / raw)
  To: schwab; +Cc: eggert, amylaar, martin, gcc2, egcs

Amazing--they actually listened.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 21:03                                             ` Per Bothner
@ 1998-12-22  2:35                                               ` Paul Eggert
  1998-12-28  8:10                                               ` Martin von Loewis
  1 sibling, 0 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-22  2:35 UTC (permalink / raw)
  To: bothner; +Cc: gcc2, egcs

   Date: Mon, 21 Dec 1998 21:02:31 -0800
   From: Per Bothner <bothner@cygnus.com>

   > (3) GCC transliterates each \u escape in a string to the string's charset,
   >     which is specified as described in (1) above.

   Hm.  (1) above specifies the *file's* charset.  It does not follow
   that the *string's* charset is the same.  Certainly for Java, it
   would not be.

(1) also specifies the string's charset in C, because you can switch
charsets in the middle of a file e.g. with _Pragma ("charset Shift_JIS")
or whatever.

   What happens to:
	   wchar_t x = '\u1234';  /* or:  L'\u1234' */
   are these different from:
	   wchar_t x = (wchar_t) 0x1234;

Yes, e.g. the string's charset might specify JIS for wide characters.

   I assume your proposal is that the string charset at least
   by default should be the file charset except for Java where
   the string charset is Unicode.

Yes.

   > If the input character set is a superset of UTF-8
   > (e.g. ISO-2022-JP), then the extra information is lost.

   I'm confused.  I thought that Unicode was specifically designed
   so that dictinct characters in existing Japanese character
   standards were mapped into distinct Unicode characters.
   Did I misunderstand, or is ISO-2022-JP not one of the "source"
   character sets the Unicode designers used?

You understood correctly.  To some extent, ISO-2022-JP and Unicode are
competing standards.  ISO-2022-JP distinguishes between (say) the
Japanese and Chinese forms of the same character, whereas Unicode does
not.

Right now, my impression is that ISO-2022-JP is used more often in
Japanese world than Unicode is.  This is certainly true for email.
Microsoft is pushing Unicode mightily in the DOS and NT domains,
though.

There is little call for distinguishing Chinese from Japanese in
identifiers.  So it's OK if GCC supports only the Unicode ``subset''
of ISO-2022-JP in identifiers.

If there are ISO-2022-JP partisans who are disturbed by this part of
my proposal, then I have some reassurance for them.  Rumor has it that
ISO 10646 might be officially extended so that it will become a
functional superset of ISO-2022-JP.  (This is the ``plane-14''
language-tagging effort.)  This will require more than 16 bits per
character, so it won't be Unicode, and presumably Java char and string
won't support it (unless Java is also extended); but C and C++ will
support plane-14, because they already have \u escapes for 32-bit
characters, and allow UTF-8 implementations (which also supports
32-bit chars).  If and when the plane-14 proposal becomes a standard,
then C and C++ could distinguish between Chinese and Japanese in
identifiers under my proposal.

Isn't internationalization fun?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 20:28                                             ` Zack Weinberg
@ 1998-12-22  2:59                                               ` Paul Eggert
  1998-12-23 17:16                                                 ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-22  2:59 UTC (permalink / raw)
  To: zack; +Cc: rms, amylaar, martin, gcc2, egcs

   Date: Mon, 21 Dec 1998 23:28:09 -0500
   From: Zack Weinberg <zack@rabi.columbia.edu>

   >(2) GCC uses the iconv function to translate from the input multibyte
   >    encoding to UTF-8 internally (for identifiers), and to determine
   >    character boundaries (in strings and comments).

   To do this we'd need to translate the entire file to UTF-8 in order to
   know where identifiers begin and end, and then translate strings
   back.

We can avoid this problem by using iconv in ``byte-at-a-time'' mode.
I.e. we can use iconv to discover the minimal nonempty sequence of
input bytes S such that S is a multibyte character string, and such
that the first char after S is a "C" char.  If we find such an S in a
string, we copy its value through unchanged; if we find it in an
identifier, we translate it to UTF-8.  This would mean we wouldn't
have to translate the entire file to UTF-8.

As an optimization, we don't need to call iconv at all in the common
case where the input file uses only the "C" subset of ASCII.  This is
because such files cannot contain multibyte chars.  We need to call
iconv only if the input contains non-"C" bytes (e.g. bytes with the
top bit on, or the ESC character).

It might be nice if there was something faster than invoking iconv in
byte-at-a-time mode, for files that contain lots of multibyte chars.
E.g. it might be nice if we could use an efficient primitive that acts
like iconv, except it stops translating when it finds a "C" char.  If
necessary to improve performance, we can add such a primitive to
glibc, and use it if it's available.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:11                                       ` Paul Eggert
  1998-12-21 18:46                                         ` Per Bothner
  1998-12-21 19:16                                         ` Per Bothner
@ 1998-12-22  3:09                                         ` Joern Rennecke
  1998-12-22 10:52                                           ` Paul Eggert
  1998-12-23  0:36                                         ` Richard Stallman
  3 siblings, 1 reply; 81+ messages in thread
From: Joern Rennecke @ 1998-12-22  3:09 UTC (permalink / raw)
  To: Paul Eggert; +Cc: rms, martin, gcc2, egcs

> 	In our company, when we import files from other sources, we typically
> 	transliterate them to an encoding suitable for our preferred working
> 	locale.  This is the only plausible way to do things; otherwise, few
> 	of our text-processing tools would work.  Even Emacs supports only
> 	_some_ of the Japanese encodings that we import -- e.g. it doesn't
> 	support UTF-8 or DBCS.  Most other tools support only one character
> 	set and encoding at a time, and it is set from the locale environment
> 	variables in the usual way.
> 
> 	If #locale were part of the source, we'd have more work to do, since
> 	we'd also have to munge the #locale directives of imported sources.
> 	This would be doable, but it would be a hassle, particularly when
> 	trading patches with our correspondents who use different encodings.
> 	I can easily see where people would screw this up.

Ok, how about not naming the locale, but describing it in a way so that it
gets automatically adjusted when you transliterate the file?

I.e. you pick a set of non-ASCII characters that are sufficient to
identify the locale, and for each of them, state their name, followed by
their encoding, followed by an ASCII delimiter that makes it possible to
detect where the end of a multibyte encoding is.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-22  3:09                                         ` Joern Rennecke
@ 1998-12-22 10:52                                           ` Paul Eggert
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-22 10:52 UTC (permalink / raw)
  To: amylaar; +Cc: rms, martin, gcc2, egcs

   From: Joern Rennecke <amylaar@cygnus.co.uk>
   Date: Tue, 22 Dec 1998 11:08:52 +0000 (GMT)

   pick a set of non-ASCII characters that are sufficient to identify
   the locale, and for each of them, state their name, followed by
   their encoding, followed by an ASCII delimiter that makes it
   possible to detect where the end of a multibyte encoding is.

I think that would be too brittle to work well in practice.

The magic cookie would be long and would be hard to explain to users.
For example, they couldn't just cut and paste the magic cookie's bytes
out of a recipe file; instead, they'd have to transliterate it to
their locale's character set, and they'd have to know what to do when
their locale can't represent all the characters.

Also, the set of characters would have to be large -- enough to
distinguish all the ISO 8859 variants, among other things.  Worse, the
set would have to change with time as new character sets were added to
GCC's set of supported charsets.  I'd hate to see a new GCC release
required because of the Euro!

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 19:16                                         ` Per Bothner
  1998-12-21 19:20                                           ` Per Bothner
@ 1998-12-23  0:35                                           ` Richard Stallman
  1 sibling, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-23  0:35 UTC (permalink / raw)
  To: bothner; +Cc: gcc2, egcs

    If you *don't* do the translation, all your other tools (emacs,
    less, grep, etc) need to understand the #pragma locale statement,

On the contrary, most other programs have no need to understand it.
Most of these "tools" don't pay attention to the character encoding.
Only Emacs does--and it has its own way you can specify the encoding,
if it guesses wrong.

It is ok for Emacs to guess, since the results are shown to you
straightaway; if it guessed wrong, you will see that on the screen.

In many cases, you don't need to care.  If you visit a file, change
some text at the beginning and save it, and if some part later in the
file (which you did not look at) contained some Latin-N characters, it
makes no difference to you whether Emacs thought they were Latin-1 or
Latin-2.  All that matters is that they are saved the same as they
were before.

But if GCC gets this wrong, you will get errors or incorrect behavior
later on, and it may take some time for you to even notice, let alone
figure out the cause.

    Another problem is that switching character encoding
    in-band may be difficult.  Many libraries do not support it.
    The Java FileReader class requires you to specify the encoding
    at *open* time.

GCC is not written in Java and does not use this class,
so this limitation is not a factor for us.

      Perferably
    each file should specify its encoding out-of-bound,
    just like MIME does.

I would not object to this sort of system, if users were happy with
it.  It would avoid depending on the environment.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 20:30                                             ` Per Bothner
@ 1998-12-23  0:35                                               ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-23  0:35 UTC (permalink / raw)
  To: bothner; +Cc: eggert, gcc2, egcs

    I guess I was responding to the idea of passing uninterpreted
    bytes through, and pointing that that is a bad idea for at least
    external identifiers and for Java.

For non-ASCII bytes in external identifiers, we can't simply
pass them through, because many assemblers won't accept them
as identifier characters.  Some sort of encoding is needed
in the .s file.  I proposed one, but others might be better.

Conversion to UTF-8 won't work, because the assembler probably won't
accept that either.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:11                                       ` Paul Eggert
                                                           ` (2 preceding siblings ...)
  1998-12-22  3:09                                         ` Joern Rennecke
@ 1998-12-23  0:36                                         ` Richard Stallman
  3 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-23  0:36 UTC (permalink / raw)
  To: eggert; +Cc: amylaar, martin, gcc2, egcs

    The committee is also requiring _Pragma("FOO") to have the same
    meaning as #pragma FOO.  _Pragma("FOO") can be output by macros.
    Does this overcome your objection to pragmas?

Yes.  I am still not convinced that we should use #pragma for this
particular job, but _Pragma certainly solves that problem for pragmas
in general.

	    GCC currently doesn't have directives like this:

	    #character-set ASCII
	    #character-set EBCDIC

	    because they're not needed

GCC does not have any way to specify ASCII vs EBCDIC at run time.
The choice of ASCII vs EBCDIC is fixed, given your host platform.
So it does not shed any light on the question at hand.

    No, it doesn't avoid the danger.  You can specify the wrong locale
    just as easily, if not more easily, with #locale -- e.g. see the
    transliteration scenario in my longer answer above.

ANY method of specifying the locale leaves a chance you get it wrong
when you CHANGE the locale.  But at that time, you will be on the
alert for having made a mistake in operating on the file.

Using an environment variable makes lossage possible any time, if you
changed your environment for some reason--even when the same file
worked correctly yesterday and you have not changed it in weeks.

We cannot get rid of the former problem, so we have to accept it.
But we can get rid of the latter problem, and we should.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 18:58                                         ` Paul Eggert
  1998-12-21 19:07                                           ` Zack Weinberg
  1998-12-21 19:28                                           ` Ulrich Drepper
@ 1998-12-23  0:36                                           ` Richard Stallman
  2 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-23  0:36 UTC (permalink / raw)
  To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs

      It sounds like we might
    slide through with "SJIS" for Shift-JIS, even though it's not in the
    MIME standard; but for EUC-JIS, Solaris 7 has the name "eucJP", glibc
    2.0.108 has "EUC-JP", and MIME has the aliases "EUC-JP",
    "csEUCPkdFmtJapanese", and
    "Extended_UNIX_Code_Packed_Format_for_Japanese";

At worst, we may have to pick one of them.

						     so no single name
    will do in practice for EUC-JIS.

Any one of them will do in practice--we just have to document which
one to use.  Or we could support all three of them, as equivalent
aliases.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-22  2:59                                               ` Paul Eggert
@ 1998-12-23 17:16                                                 ` Richard Stallman
  1998-12-23 18:11                                                   ` Zack Weinberg
  1998-12-23 19:21                                                   ` Paul Eggert
  0 siblings, 2 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-23 17:16 UTC (permalink / raw)
  To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs

The idea of translating everything into UTF-8 is not useful for C.

It is pointless and mistaken to translate symbols to UTF-8.  The
assembler won't accept them in UTF-8, and users who use other
encodings wouldn't want them in UTF-8 anyway.

It is pointless and buggy to translate strings to UTF-8 and then
translate them back.  As Handa pointed out, it's impossible to
translate them back.

If that translation is required for correct handling of Java, then
let's do it for Java.  But not for C.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-23 17:16                                                 ` Richard Stallman
@ 1998-12-23 18:11                                                   ` Zack Weinberg
  1998-12-25  0:05                                                     ` Richard Stallman
  1998-12-23 19:21                                                   ` Paul Eggert
  1 sibling, 1 reply; 81+ messages in thread
From: Zack Weinberg @ 1998-12-23 18:11 UTC (permalink / raw)
  To: rms; +Cc: zack, amylaar, martin, gcc2, egcs

On Wed, 23 Dec 1998 18:16:42 -0700 (MST), Richard Stallman wrote:
>The idea of translating everything into UTF-8 is not useful for C.
>
>It is pointless and mistaken to translate symbols to UTF-8.  The
>assembler won't accept them in UTF-8, and users who use other
>encodings wouldn't want them in UTF-8 anyway.

I think you may have missed a few things.  gas has no problem with
symbols in UTF-8 (I am told).  ASCII <-> UTF-8 is a no-op, and gcc
does not currently accept non-ASCII identifiers, so no existing code
will be broken by the change.  Converting all symbol names to UTF-8 is
desirable for all languages for two reasons.  First, Java requires
this and we want to be able to link modules written in Java with
modules written in any other language supported by gcc.  Second, we
want to be able to link modules written in encoding X with other
modules in encoding Y.  One way translation of all identifiers to UTF8
achieves this.

>It is pointless and buggy to translate strings to UTF-8 and then
>translate them back.  As Handa pointed out, it's impossible to
>translate them back.

No argument here.

zw

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-23 17:16                                                 ` Richard Stallman
  1998-12-23 18:11                                                   ` Zack Weinberg
@ 1998-12-23 19:21                                                   ` Paul Eggert
  1998-12-25  0:05                                                     ` Richard Stallman
  1998-12-25  0:05                                                     ` Richard Stallman
  1 sibling, 2 replies; 81+ messages in thread
From: Paul Eggert @ 1998-12-23 19:21 UTC (permalink / raw)
  To: rms; +Cc: zack, amylaar, martin, gcc2, egcs

   Date: Wed, 23 Dec 1998 18:16:42 -0700 (MST)
   From: Richard Stallman <rms@gnu.org>

   It is pointless and buggy to translate strings to UTF-8 and then
   translate them back.

I agree, and my proposal doesn't do that for C.  String bytes are
copied straight through.

   It is pointless and mistaken to translate symbols to UTF-8.  The
   assembler won't accept them in UTF-8, and users who use other
   encodings wouldn't want them in UTF-8 anyway.

For non-GNU platforms like Solaris, we'll have to follow the
platform's convention in this area, so that GCC-compiled code can link
to non-GCC-compiled code.  Most likely we'll need a way to configure
the method GCC uses to output non-ASCII identifiers in assembly
language, as there probably won't be a universally accepted standard
method.  Possibly, some platforms will require symbols to be
translated to a canonical form (allowing cross-locale linking) and
other platforms will just use the symbol bytes as-is (disallowing
cross-locale linking); GCC will just have to go with the flow.

For GNU platforms, my understanding is that GAS allows arbitrary bytes
in symbols, so it is plausible to use UTF-8 for the canonical symbol
encoding.  If we go this route, assembler files will be UTF-8.  In
general, GCC will have to use \x escapes in strings to represent the
bytes of non-ASCII characters, so that string bytes are copied
straight-through without loss of information -- but \x escapes will be
required no matter what solution is employed, since we want the
assembler to be locale-independent, so requiring \x escapes is not a
major loss.

Another possibility for GNU is to mangle symbols into some form of
ASCII.  To do this, we'll have to come up with a mangling method that
is compatible with existing C++ mangling, and which doesn't usurp
existing user identifier space.  You proposed a method, but someone
else found a problem with it (sorry, I don't recall the details).
Even if we solve the mangling problem, though, the ASCII-only
name-mangling method seems less useful than UTF-8 name mangling.
Neither mangling method allows an arbitrary native encoding
(e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, but at least the
UTF-8 mangling method allows UTF-8 to be used uniformly.


By the way, even if we don't care about linking from different
locales, GCC must still translate symbols to a canonical form.  For
example, suppose `@' denotes the character MICRO SIGN (Unicode
character 00b5).  Then `@' (1 character) and `\u00b5' (6 characters)
are different spellings of the same symbol, and GCC must unify the two
spellings.  This is true no matter how the symbol is represented in
assembly language output.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 20:16                                           ` Paul Eggert
  1998-12-21 20:28                                             ` Zack Weinberg
  1998-12-21 21:03                                             ` Per Bothner
@ 1998-12-25  0:05                                             ` Richard Stallman
  1998-12-26  0:36                                               ` Paul Eggert
  2 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-25  0:05 UTC (permalink / raw)
  To: eggert; +Cc: bothner, amylaar, martin, gcc2, egcs

    OK, how about this proposal?  I've tried to formulate it to address
    everybody's concerns:

You've designed this based on the assmuption of converting everything
to UTF-8.  It's useful to offer that as one alternative, and your
design seems reasonable as a way to do it.  But there also needs to be
a mode which does not convert.

There are some demanding situations in which people would want to link
together the results of compiling files written in different
encodings, but that will be rare.  It will be much more common for
people to use one encoding.  So the default mode should be not to
convert, and in that case, GCC doesn't need to know what the encoding
is (unless /u is used).

I expect that all general-purpose libraries will limit themselves to
ASCII for symbol names for many years to come.  That is the wise
choice for anyone writing a general-purpose library, and the GNU
coding standards will call for that.  (Perhaps the situations will be
different in the future, if use of Unicode become universal, but that
will take years.)  If and when it happens, we can change the
standards.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-23 19:21                                                   ` Paul Eggert
@ 1998-12-25  0:05                                                     ` Richard Stallman
  1998-12-25  0:05                                                     ` Richard Stallman
  1 sibling, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-25  0:05 UTC (permalink / raw)
  To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs

    Even if we solve the mangling problem, though, the ASCII-only
    name-mangling method seems less useful than UTF-8 name mangling.
    Neither mangling method allows an arbitrary native encoding
    (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, 

ASCII-only name mangling ought to achieve that.  Could you
please explain why you think it will not?


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-23 19:21                                                   ` Paul Eggert
  1998-12-25  0:05                                                     ` Richard Stallman
@ 1998-12-25  0:05                                                     ` Richard Stallman
  1 sibling, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-25  0:05 UTC (permalink / raw)
  To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs

    By the way, even if we don't care about linking from different
    locales, GCC must still translate symbols to a canonical form.  For
    example, suppose `@' denotes the character MICRO SIGN (Unicode
    character 00b5).  Then `@' (1 character) and `\u00b5' (6 characters)
    are different spellings of the same symbol, and GCC must unify the two
    spellings.  This is true no matter how the symbol is represented in
    assembly language output.

That's right: GCC will have to convert \u00b5 into whatever is
the proper thing to output for @, and likewise for unicode characters
that have a multibyte representation in the encoding system it is using.

This is why GCC has to depend on the encoding, when \u is used.





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-23 18:11                                                   ` Zack Weinberg
@ 1998-12-25  0:05                                                     ` Richard Stallman
  1998-12-28  5:55                                                       ` Martin von Loewis
  0 siblings, 1 reply; 81+ messages in thread
From: Richard Stallman @ 1998-12-25  0:05 UTC (permalink / raw)
  To: zack; +Cc: zack, amylaar, martin, gcc2, egcs

    I think you may have missed a few things.  gas has no problem with
    symbols in UTF-8 (I am told).

GCC works with many assemblers.  I doubt that they all support UTF-8,
and it would be hard even to check them all.  So I think we will have
to mangle non-ASCII byte values somehow in the .s files, whether the
encoding used is UTF-8 or not.

Anyway, there are other reasons not to always use UTF-8.

      ascii <-> UTF-8 is a no-op, and gcc
    does not currently accept non-ASCII identifiers, so no existing code
    will be broken by the change.

This is true, but does not eliminate the problems.

      Second, we
    want to be able to link modules written in encoding X with other
    modules in encoding Y.

This is a useful feature.  However, not needing to specify what
encoding the file is in is also a useful feature.

These two features are inherently incompatible, so perhaps we should
give the user a choice, through an option.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-25  0:05                                             ` Richard Stallman
@ 1998-12-26  0:36                                               ` Paul Eggert
  1998-12-27 17:24                                                 ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Eggert @ 1998-12-26  0:36 UTC (permalink / raw)
  To: rms; +Cc: zack, bothner, amylaar, martin, gcc2, egcs

   Date: Fri, 25 Dec 1998 03:07:56 -0500
   From: Richard Stallman <rms@gnu.org>

       Even if we solve the mangling problem, though, the ASCII-only
       name-mangling method seems less useful than UTF-8 name mangling.
       Neither mangling method allows an arbitrary native encoding
       (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, 

   ASCII-only name mangling ought to achieve that.  Could you
   please explain why you think it will not?

Here's what I was thinking:

* Unsafe native encodings can't be used in assembly-language strings.
  The simplest way to handle this is to do what GCC currently does:
  escape non-ASCII bytes in assembly-language strings using notation
  like `\377'.

* Hence, if ASCII-only name mangling is also used, assembly language
  files will contain only ASCII, regardless of the input encoding.

* This will work, but it's unfriendly for non-English writers, because
  it means that assembly language uses ASCII instead of the native
  encoding -- i.e. the native encoding isn't being used uniformly in
  both source and assembly language output.  E.g. suppose we have the
  following code:

	const char message[] = "contents";

  except that the words `message and `contents' are in Japanese.  A
  Japanese reader would naturally desire to see something like the
  following assembly language output:

	message:
		.asciz	"contents"

  except, of course, the words `message' and `contents' would be in
  Japanese.  Unfortunately, though, with ASCII name mangling, and with
  string mangling as described above, the Japanese reader will see
  something like the following instead:

	.x8c.x32.x9c.x41.x91.x32.xac.x90:
		.asciz "\200 \x309!\x240@\x201\\\x300\""

  which is painful to work with.

If GCC outputs bytes with the top bit on in assembly language
identifiers and strings, then at least safe encodings like UTF-8, ISO
8859, and EUC will yield the naturally desired assembly language
output.  (Shift-JIS and other unsafe encodings may still yield
undesirable escapes in output, but this is no worse than the escapes
they already get.)  I believe this is what is partly motivating
martin's proposed patch, and I'm sympathetic to this motivation.

   Date: Fri, 25 Dec 1998 03:09:25 -0500
   From: Richard Stallman <rms@gnu.org>

   the default mode should be not to convert, and in that case, GCC
   doesn't need to know what the encoding is (unless /u is used).

Even when not converting, GCC needs to know the input encoding if it's
an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some
multibyte chars contain ASCII bytes'') -- otherwise GCC won't be able
to parse comments, strings, and identifiers correctly.  Much (if not
most) east Asian text currently uses unsafe encodings, so this is not
a minor point.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-26  0:36                                               ` Paul Eggert
@ 1998-12-27 17:24                                                 ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-27 17:24 UTC (permalink / raw)
  To: eggert; +Cc: zack, bothner, amylaar, martin, gcc2, egcs

       ASCII-only name mangling ought to achieve that.  Could you
       please explain why you think it will not?

    * This will work, but it's unfriendly for non-English writers, 

Making .s files "friendly" for non-ASCII scripts
is a very low-priority goal.  There are, or have been, some
assemblers which required all strings to be expressed with .byte.
So what?  Above all, we have to do the right thing for compiler
*users*; making .s files look nice has to come second.

    Even when not converting, GCC needs to know the input encoding if it's
    an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some
    multibyte chars contain ASCII bytes'')

Yes, that's true.  However, we can arrange for GCC to do the right
thing without knowing the actual encoding, when the real encoding is a
safe one--as long as GCC does not think you have specified an unsafe
encoding.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-25  0:05                                                     ` Richard Stallman
@ 1998-12-28  5:55                                                       ` Martin von Loewis
  1998-12-30  5:19                                                         ` Richard Stallman
  0 siblings, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-28  5:55 UTC (permalink / raw)
  To: rms; +Cc: zack, zack, amylaar, gcc2, egcs

> GCC works with many assemblers.  I doubt that they all support UTF-8,
> and it would be hard even to check them all.  So I think we will have
> to mangle non-ASCII byte values somehow in the .s files, whether the
> encoding used is UTF-8 or not.

Alternatively, we could reject code that uses non-ASCII identifiers if
the assembler cannot reasonably represent them. That might mean that
the full feature set is only available on GNU systems. I could live
with that.

Regards,
Martin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-21 21:03                                             ` Per Bothner
  1998-12-22  2:35                                               ` Paul Eggert
@ 1998-12-28  8:10                                               ` Martin von Loewis
  1998-12-28 11:00                                                 ` Per Bothner
  1 sibling, 1 reply; 81+ messages in thread
From: Martin von Loewis @ 1998-12-28  8:10 UTC (permalink / raw)
  To: bothner; +Cc: eggert, gcc2, egcs

> I'm confused.  I thought that Unicode was specifically designed
> so that dictinct characters in existing Japanese character
> standards were mapped into distinct Unicode characters.

Paul already answered that, I'd like to add from a different angle.

ISO 2022 uses escapes sequences to switch between different character
sets. ISO-2022-JP combines four different character sets in this way.
Now, there are potential overlappings between the character sets. In
such cases, Unicode typically unifies the overlappings, whereas ISO
2022 leaves them as-is.

The argument is which is the right thing. For example, there are four
encodings for "LATIN CAPITAL LETTER A": 
ESC ( B A         (ASCII)
ESC ( J A         (JIS X 0201)
ESC $ @ # A       (JIS X 0208-1978)
ESC $ B # A       (JIS X 0208-1983) (*)
Unicode has only one character here (U+0041). In other places, Unicode
probably was wrong to unify (Han Unification).

Not that I want to push a particular solution: Converted to Unicode,
encoded in UTF-8, we would get the following for all four encodings:
A

Regards,
Martin

(*) Somebody correct me if my tables are wrong. The three-bytes
escape-sequence can be omitted if previous characters are already in
this encoding.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-28  8:10                                               ` Martin von Loewis
@ 1998-12-28 11:00                                                 ` Per Bothner
  0 siblings, 0 replies; 81+ messages in thread
From: Per Bothner @ 1998-12-28 11:00 UTC (permalink / raw)
  To: Martin von Loewis; +Cc: gcc2, egcs

> In other places, Unicode
> probably was wrong to unify (Han Unification).

Unicode may have been wrong politically (but had no choice given
a 16-bit limit).

I still havan't seen any plausible argument explaining why
Han Unification is wrong in theory or causes problems in
practice.  The problem with unification concerns high-quality
multi-lingual documents - which are rare, and which should be
using language or fonts attributes to dis-ambiguate.  (I.e.
if you need to explicitly distinguish Chinese and Japanese,
should should be using styled text, not plain text.)

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: thoughts on martin's proposed patch for GCC and UTF-8
  1998-12-28  5:55                                                       ` Martin von Loewis
@ 1998-12-30  5:19                                                         ` Richard Stallman
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Stallman @ 1998-12-30  5:19 UTC (permalink / raw)
  To: martin; +Cc: zack, zack, amylaar, gcc2, egcs

    Alternatively, we could reject code that uses non-ASCII identifiers if
    the assembler cannot reasonably represent them. That might mean that
    the full feature set is only available on GNU systems. I could live
    with that.

It would be much better to mangle the names.

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~1998-12-30  5:19 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <19981204032449.3033.qmail@comton.airs.com>
     [not found] ` <199812060519.VAA07309@shade.twinsun.com>
     [not found]   ` <366C0645.61C48A38@cygnus.com>
     [not found]     ` <199812080057.QAA00491@shade.twinsun.com>
     [not found]       ` <366D460E.4FB0ECD0@cygnus.com>
1998-12-09 13:44         ` thoughts on martin's proposed patch for GCC and UTF-8 Paul Eggert
1998-12-09 14:38           ` Martin von Loewis
1998-12-09 14:56             ` Per Bothner
1998-12-09 22:57               ` Martin von Loewis
1998-12-09 23:16                 ` Per Bothner
1998-12-11 19:27                   ` Paul Eggert
1998-12-09 17:46             ` Paul Eggert
1998-12-09 18:01               ` Tim Hollebeek
1998-12-10  5:58                 ` Craig Burley
1998-12-10 10:21                   ` Tim Hollebeek
1998-12-10 11:50                     ` Craig Burley
1998-12-10 14:23                   ` Chip Salzenberg
1998-12-09 23:03               ` Per Bothner
1998-12-10  7:49                 ` Ian Lance Taylor
1998-12-11 19:23                 ` Paul Eggert
1998-12-12  2:21                   ` Martin von Loewis
1998-12-13  6:23                     ` Richard Stallman
1998-12-13 12:27                       ` Martin von Loewis
1998-12-14  2:22                         ` Richard Stallman
1998-12-15 10:47                           ` Paul Eggert
1998-12-17 18:10                             ` Richard Stallman
1998-12-17 21:41                               ` Paul Eggert
1998-12-18  1:23                                 ` Martin von Loewis
1998-12-17 23:55                               ` Joern Rennecke
1998-12-19  5:13                                 ` Richard Stallman
1998-12-19 10:36                                   ` Paul Eggert
1998-12-20 20:29                                     ` Richard Stallman
1998-12-21  7:00                                       ` Zack Weinberg
1998-12-21 18:58                                         ` Paul Eggert
1998-12-21 19:07                                           ` Zack Weinberg
1998-12-21 19:28                                           ` Ulrich Drepper
1998-12-23  0:36                                           ` Richard Stallman
1998-12-21 18:11                                       ` Paul Eggert
1998-12-21 18:46                                         ` Per Bothner
1998-12-21 19:44                                           ` Paul Eggert
1998-12-21 20:30                                             ` Per Bothner
1998-12-23  0:35                                               ` Richard Stallman
1998-12-21 20:16                                           ` Paul Eggert
1998-12-21 20:28                                             ` Zack Weinberg
1998-12-22  2:59                                               ` Paul Eggert
1998-12-23 17:16                                                 ` Richard Stallman
1998-12-23 18:11                                                   ` Zack Weinberg
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-28  5:55                                                       ` Martin von Loewis
1998-12-30  5:19                                                         ` Richard Stallman
1998-12-23 19:21                                                   ` Paul Eggert
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-21 21:03                                             ` Per Bothner
1998-12-22  2:35                                               ` Paul Eggert
1998-12-28  8:10                                               ` Martin von Loewis
1998-12-28 11:00                                                 ` Per Bothner
1998-12-25  0:05                                             ` Richard Stallman
1998-12-26  0:36                                               ` Paul Eggert
1998-12-27 17:24                                                 ` Richard Stallman
1998-12-21 19:16                                         ` Per Bothner
1998-12-21 19:20                                           ` Per Bothner
1998-12-23  0:35                                           ` Richard Stallman
1998-12-22  3:09                                         ` Joern Rennecke
1998-12-22 10:52                                           ` Paul Eggert
1998-12-23  0:36                                         ` Richard Stallman
1998-12-20 20:29                                     ` Richard Stallman
1998-12-21  1:52                                       ` Andreas Schwab
1998-12-22  1:09                                         ` Richard Stallman
1998-12-21 12:25                                     ` Samuel Figueroa
1998-12-15 22:00                     ` Paul Eggert
1998-12-15 23:17                       ` Martin von Loewis
1998-12-17  7:32                         ` Paul Eggert
1998-12-17 16:48                           ` Martin von Loewis
1998-12-17 22:10                             ` Paul Eggert
1998-12-18 21:31                           ` Richard Stallman
1998-12-16  0:18                       ` Per Bothner
1998-12-09 23:18               ` Martin von Loewis
1998-12-10  7:57                 ` Ian Lance Taylor
1998-12-10 13:12                   ` Martin von Loewis
1998-12-11 19:32                   ` Paul Eggert
1998-12-11 19:34                   ` Ken Raeburn
1998-12-14 17:05                     ` Ian Lance Taylor
1998-12-11 19:28                 ` Paul Eggert
1998-12-12  1:06                   ` Martin von Loewis
     [not found]               ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
1998-12-10 11:31                 ` Jonathan Larmour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).