From: Paul Eggert <eggert@twinsun.com>
To: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>, brolley@cygnus.com
Cc: gcc2@gnu.org, egcs@cygnus.com
Subject: thoughts on martin's proposed patch for GCC and UTF-8
Date: Wed, 09 Dec 1998 13:44:00 -0000 [thread overview]
Message-ID: <199812092143.NAA04890@shade.twinsun.com> (raw)
In-Reply-To: <366D460E.4FB0ECD0@cygnus.com>
I took a look at martin's proposed patch for UTF-8 support in GCC, and
have the following thoughts and suggestions.
* GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and
\U000000B5; but GCC should not always unify these four identifiers
to the identifier with character code B5, as this is incorrect in
non-UTF-8 locales.
The latest EGCS and GCC2 code already contains support for non-UTF-8
locales, and this support is incompatible with the proposed patch.
To get started, perhaps the proposed patch could be modified to
report an error if it encounters \u or \U in a non-UTF-8 locale,
saying that this is not supported yet.
* GCC should represent non-ASCII identifiers using the locale's
preferred multibyte encoding; e.g. it should use EUC-JIS if that's
what the locale uses. This is the best way to make GCC work well
with other tools in that locale. If the locale cannot represent a
particular Unicode character, GCC should store it in a canonicalized
escape form (e.g. the locale's encoding for \u with lowercase alpha
digits if it fits in 16 bits, \U with lowercase alpha digits
otherwise); this is along the lines of what draft C9x suggests.
Proper support for \u in non-UTF-8 locales requires a
locale-specific translation table from Unicode to the locale's
encoding. We'll also need a locale-specific table that specifies
which characters are C letters and digits, but this can be derived
from the other table automatically.
One way to translate from Unicode to non-UTF-8 is to have GCC use
the iconv function if available. iconv will be supported by glibc
2.1; it's also been supported by Solaris 2.x for some time. GCC
could supply its own substitute for iconv if that's needed by
cross-compilers, but the native iconv is generally preferable.
* Given the above, I don't see the need for TREE_UNIVERSAL_CHAR. The
identifier should be stored using the locale's multibyte chars as
suggested above (with canonical escapes if needed), and output
as-is, just as identifiers are now.
* HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's
current philosophy that the user knows what he or she is doing.
People who use multibyte chars in identifiers will expect them to go
through to the assembler; if the assembler doesn't support them,
they'll understand the assembler's error message. So GCC's behavior
shouldn't depend on whether the assembler supports multibyte chars.
There's precedent for this: GCC already doesn't care whether the
assembler supports dollar signs in identifiers. If the user writes
a function named `a$b', and the assembler doesn't support that name,
then the assembler will report the error. That's preferable to
having GCC second-guess the assembler.
Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it. This
won't work with older shells that don't allow UTF-8. It's simpler if
we just remove HAVE_GAS_UTF8.
* I assume that cp/universal.c is supposed to support the constraints
on identifiers required by ISO/IEC TR 10176? If so, it should be
commented that way. The code needs to be fixed to have an
is_universal_digit function, since letters and digits have distinct
roles in identifiers. You need to remove `,' before `}' in the
code, for portability to older compilers. The code currently dumps
core if is_uni[h]==NULL.
* The universal-char code needs to be exported out to the main GCC
level; it's not specific to C++.
* The C compiler and preprocessor also need to support \u and
multibyte chars. I'll take a look at doing this, taking inspiration
from martin's proposed patch.
* GAS should be extended to support locales with encodings other than
UTF-8; in particular, this means that GAS should support \u, if it
doesn't already, as \u is needed for characters that can't be
represented in the locale's multibyte encoding.
next parent reply other threads:[~1998-12-09 13:44 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <19981204032449.3033.qmail@comton.airs.com>
[not found] ` <199812060519.VAA07309@shade.twinsun.com>
[not found] ` <366C0645.61C48A38@cygnus.com>
[not found] ` <199812080057.QAA00491@shade.twinsun.com>
[not found] ` <366D460E.4FB0ECD0@cygnus.com>
1998-12-09 13:44 ` Paul Eggert [this message]
1998-12-09 14:38 ` Martin von Loewis
1998-12-09 14:56 ` Per Bothner
1998-12-09 22:57 ` Martin von Loewis
1998-12-09 23:16 ` Per Bothner
1998-12-11 19:27 ` Paul Eggert
1998-12-09 17:46 ` Paul Eggert
1998-12-09 18:01 ` Tim Hollebeek
1998-12-10 5:58 ` Craig Burley
1998-12-10 10:21 ` Tim Hollebeek
1998-12-10 11:50 ` Craig Burley
1998-12-10 14:23 ` Chip Salzenberg
1998-12-09 23:03 ` Per Bothner
1998-12-10 7:49 ` Ian Lance Taylor
1998-12-11 19:23 ` Paul Eggert
1998-12-12 2:21 ` Martin von Loewis
1998-12-13 6:23 ` Richard Stallman
1998-12-13 12:27 ` Martin von Loewis
1998-12-14 2:22 ` Richard Stallman
1998-12-15 10:47 ` Paul Eggert
1998-12-17 18:10 ` Richard Stallman
1998-12-17 21:41 ` Paul Eggert
1998-12-18 1:23 ` Martin von Loewis
1998-12-17 23:55 ` Joern Rennecke
1998-12-19 5:13 ` Richard Stallman
1998-12-19 10:36 ` Paul Eggert
1998-12-20 20:29 ` Richard Stallman
1998-12-21 1:52 ` Andreas Schwab
1998-12-22 1:09 ` Richard Stallman
1998-12-20 20:29 ` Richard Stallman
1998-12-21 7:00 ` Zack Weinberg
1998-12-21 18:58 ` Paul Eggert
1998-12-21 19:07 ` Zack Weinberg
1998-12-21 19:28 ` Ulrich Drepper
1998-12-23 0:36 ` Richard Stallman
1998-12-21 18:11 ` Paul Eggert
1998-12-21 18:46 ` Per Bothner
1998-12-21 19:44 ` Paul Eggert
1998-12-21 20:30 ` Per Bothner
1998-12-23 0:35 ` Richard Stallman
1998-12-21 20:16 ` Paul Eggert
1998-12-21 20:28 ` Zack Weinberg
1998-12-22 2:59 ` Paul Eggert
1998-12-23 17:16 ` Richard Stallman
1998-12-23 18:11 ` Zack Weinberg
1998-12-25 0:05 ` Richard Stallman
1998-12-28 5:55 ` Martin von Loewis
1998-12-30 5:19 ` Richard Stallman
1998-12-23 19:21 ` Paul Eggert
1998-12-25 0:05 ` Richard Stallman
1998-12-25 0:05 ` Richard Stallman
1998-12-21 21:03 ` Per Bothner
1998-12-22 2:35 ` Paul Eggert
1998-12-28 8:10 ` Martin von Loewis
1998-12-28 11:00 ` Per Bothner
1998-12-25 0:05 ` Richard Stallman
1998-12-26 0:36 ` Paul Eggert
1998-12-27 17:24 ` Richard Stallman
1998-12-21 19:16 ` Per Bothner
1998-12-21 19:20 ` Per Bothner
1998-12-23 0:35 ` Richard Stallman
1998-12-22 3:09 ` Joern Rennecke
1998-12-22 10:52 ` Paul Eggert
1998-12-23 0:36 ` Richard Stallman
1998-12-21 12:25 ` Samuel Figueroa
1998-12-15 22:00 ` Paul Eggert
1998-12-15 23:17 ` Martin von Loewis
1998-12-17 7:32 ` Paul Eggert
1998-12-17 16:48 ` Martin von Loewis
1998-12-17 22:10 ` Paul Eggert
1998-12-18 21:31 ` Richard Stallman
1998-12-16 0:18 ` Per Bothner
1998-12-09 23:18 ` Martin von Loewis
1998-12-10 7:57 ` Ian Lance Taylor
1998-12-10 13:12 ` Martin von Loewis
1998-12-11 19:32 ` Paul Eggert
1998-12-11 19:34 ` Ken Raeburn
1998-12-14 17:05 ` Ian Lance Taylor
1998-12-11 19:28 ` Paul Eggert
1998-12-12 1:06 ` Martin von Loewis
[not found] ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
1998-12-10 11:31 ` Jonathan Larmour
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=199812092143.NAA04890@shade.twinsun.com \
--to=eggert@twinsun.com \
--cc=brolley@cygnus.com \
--cc=egcs@cygnus.com \
--cc=gcc2@gnu.org \
--cc=martin@mira.isdn.cs.tu-berlin.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).