thoughts on martin's proposed patch for GCC and UTF-8

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: Paul Eggert <eggert@twinsun.com>
To: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>, brolley@cygnus.com
Cc: gcc2@gnu.org, egcs@cygnus.com
Subject: thoughts on martin's proposed patch for GCC and UTF-8
Date: Wed, 09 Dec 1998 13:44:00 -0000	[thread overview]
Message-ID: <199812092143.NAA04890@shade.twinsun.com> (raw)
In-Reply-To: <366D460E.4FB0ECD0@cygnus.com>

I took a look at martin's proposed patch for UTF-8 support in GCC, and
have the following thoughts and suggestions.

* GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and
  \U000000B5; but GCC should not always unify these four identifiers
  to the identifier with character code B5, as this is incorrect in
  non-UTF-8 locales.

  The latest EGCS and GCC2 code already contains support for non-UTF-8
  locales, and this support is incompatible with the proposed patch.
  To get started, perhaps the proposed patch could be modified to
  report an error if it encounters \u or \U in a non-UTF-8 locale,
  saying that this is not supported yet.

* GCC should represent non-ASCII identifiers using the locale's
  preferred multibyte encoding; e.g. it should use EUC-JIS if that's
  what the locale uses.  This is the best way to make GCC work well
  with other tools in that locale.  If the locale cannot represent a
  particular Unicode character, GCC should store it in a canonicalized
  escape form (e.g. the locale's encoding for \u with lowercase alpha
  digits if it fits in 16 bits, \U with lowercase alpha digits
  otherwise); this is along the lines of what draft C9x suggests.

  Proper support for \u in non-UTF-8 locales requires a
  locale-specific translation table from Unicode to the locale's
  encoding.  We'll also need a locale-specific table that specifies
  which characters are C letters and digits, but this can be derived
  from the other table automatically.

  One way to translate from Unicode to non-UTF-8 is to have GCC use
  the iconv function if available.  iconv will be supported by glibc
  2.1; it's also been supported by Solaris 2.x for some time.  GCC
  could supply its own substitute for iconv if that's needed by
  cross-compilers, but the native iconv is generally preferable.

* Given the above, I don't see the need for TREE_UNIVERSAL_CHAR.  The
  identifier should be stored using the locale's multibyte chars as
  suggested above (with canonical escapes if needed), and output
  as-is, just as identifiers are now.

* HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's
  current philosophy that the user knows what he or she is doing.
  People who use multibyte chars in identifiers will expect them to go
  through to the assembler; if the assembler doesn't support them,
  they'll understand the assembler's error message.  So GCC's behavior
  shouldn't depend on whether the assembler supports multibyte chars.

  There's precedent for this: GCC already doesn't care whether the
  assembler supports dollar signs in identifiers.  If the user writes
  a function named `a$b', and the assembler doesn't support that name,
  then the assembler will report the error.  That's preferable to
  having GCC second-guess the assembler.

  Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it.  This
  won't work with older shells that don't allow UTF-8.  It's simpler if
  we just remove HAVE_GAS_UTF8.

* I assume that cp/universal.c is supposed to support the constraints
  on identifiers required by ISO/IEC TR 10176?  If so, it should be
  commented that way.  The code needs to be fixed to have an
  is_universal_digit function, since letters and digits have distinct
  roles in identifiers.  You need to remove `,' before `}' in the
  code, for portability to older compilers.  The code currently dumps
  core if is_uni[h]==NULL.

* The universal-char code needs to be exported out to the main GCC
  level; it's not specific to C++.

* The C compiler and preprocessor also need to support \u and
  multibyte chars.  I'll take a look at doing this, taking inspiration
  from martin's proposed patch.

* GAS should be extended to support locales with encodings other than
  UTF-8; in particular, this means that GAS should support \u, if it
  doesn't already, as \u is needed for characters that can't be
  represented in the locale's multibyte encoding.

next      parent reply	other threads:[~1998-12-09 13:44 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <19981204032449.3033.qmail@comton.airs.com>
     [not found] ` <199812060519.VAA07309@shade.twinsun.com>
     [not found]   ` <366C0645.61C48A38@cygnus.com>
     [not found]     ` <199812080057.QAA00491@shade.twinsun.com>
     [not found]       ` <366D460E.4FB0ECD0@cygnus.com>
1998-12-09 13:44         ` Paul Eggert [this message]
1998-12-09 14:38           ` Martin von Loewis
1998-12-09 14:56             ` Per Bothner
1998-12-09 22:57               ` Martin von Loewis
1998-12-09 23:16                 ` Per Bothner
1998-12-11 19:27                   ` Paul Eggert
1998-12-09 17:46             ` Paul Eggert
1998-12-09 18:01               ` Tim Hollebeek
1998-12-10  5:58                 ` Craig Burley
1998-12-10 10:21                   ` Tim Hollebeek
1998-12-10 11:50                     ` Craig Burley
1998-12-10 14:23                   ` Chip Salzenberg
1998-12-09 23:03               ` Per Bothner
1998-12-10  7:49                 ` Ian Lance Taylor
1998-12-11 19:23                 ` Paul Eggert
1998-12-12  2:21                   ` Martin von Loewis
1998-12-13  6:23                     ` Richard Stallman
1998-12-13 12:27                       ` Martin von Loewis
1998-12-14  2:22                         ` Richard Stallman
1998-12-15 10:47                           ` Paul Eggert
1998-12-17 18:10                             ` Richard Stallman
1998-12-17 21:41                               ` Paul Eggert
1998-12-18  1:23                                 ` Martin von Loewis
1998-12-17 23:55                               ` Joern Rennecke
1998-12-19  5:13                                 ` Richard Stallman
1998-12-19 10:36                                   ` Paul Eggert
1998-12-20 20:29                                     ` Richard Stallman
1998-12-21  1:52                                       ` Andreas Schwab
1998-12-22  1:09                                         ` Richard Stallman
1998-12-20 20:29                                     ` Richard Stallman
1998-12-21  7:00                                       ` Zack Weinberg
1998-12-21 18:58                                         ` Paul Eggert
1998-12-21 19:07                                           ` Zack Weinberg
1998-12-21 19:28                                           ` Ulrich Drepper
1998-12-23  0:36                                           ` Richard Stallman
1998-12-21 18:11                                       ` Paul Eggert
1998-12-21 18:46                                         ` Per Bothner
1998-12-21 19:44                                           ` Paul Eggert
1998-12-21 20:30                                             ` Per Bothner
1998-12-23  0:35                                               ` Richard Stallman
1998-12-21 20:16                                           ` Paul Eggert
1998-12-21 20:28                                             ` Zack Weinberg
1998-12-22  2:59                                               ` Paul Eggert
1998-12-23 17:16                                                 ` Richard Stallman
1998-12-23 18:11                                                   ` Zack Weinberg
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-28  5:55                                                       ` Martin von Loewis
1998-12-30  5:19                                                         ` Richard Stallman
1998-12-23 19:21                                                   ` Paul Eggert
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-25  0:05                                                     ` Richard Stallman
1998-12-21 21:03                                             ` Per Bothner
1998-12-22  2:35                                               ` Paul Eggert
1998-12-28  8:10                                               ` Martin von Loewis
1998-12-28 11:00                                                 ` Per Bothner
1998-12-25  0:05                                             ` Richard Stallman
1998-12-26  0:36                                               ` Paul Eggert
1998-12-27 17:24                                                 ` Richard Stallman
1998-12-21 19:16                                         ` Per Bothner
1998-12-21 19:20                                           ` Per Bothner
1998-12-23  0:35                                           ` Richard Stallman
1998-12-22  3:09                                         ` Joern Rennecke
1998-12-22 10:52                                           ` Paul Eggert
1998-12-23  0:36                                         ` Richard Stallman
1998-12-21 12:25                                     ` Samuel Figueroa
1998-12-15 22:00                     ` Paul Eggert
1998-12-15 23:17                       ` Martin von Loewis
1998-12-17  7:32                         ` Paul Eggert
1998-12-17 16:48                           ` Martin von Loewis
1998-12-17 22:10                             ` Paul Eggert
1998-12-18 21:31                           ` Richard Stallman
1998-12-16  0:18                       ` Per Bothner
1998-12-09 23:18               ` Martin von Loewis
1998-12-10  7:57                 ` Ian Lance Taylor
1998-12-10 13:12                   ` Martin von Loewis
1998-12-11 19:32                   ` Paul Eggert
1998-12-11 19:34                   ` Ken Raeburn
1998-12-14 17:05                     ` Ian Lance Taylor
1998-12-11 19:28                 ` Paul Eggert
1998-12-12  1:06                   ` Martin von Loewis
     [not found]               ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>
1998-12-10 11:31                 ` Jonathan Larmour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=199812092143.NAA04890@shade.twinsun.com \
    --to=eggert@twinsun.com \
    --cc=brolley@cygnus.com \
    --cc=egcs@cygnus.com \
    --cc=gcc2@gnu.org \
    --cc=martin@mira.isdn.cs.tu-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).