public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: loewis@informatik.hu-berlin.de (Martin v. Löwis)
To: Zack Weinberg <zack@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, java@gcc.gnu.org
Subject: Re: Implementing Universal Character Names in identifiers
Date: Tue, 29 Oct 2002 01:39:00 -0000	[thread overview]
Message-ID: <j4y98h7fdo.fsf@informatik.hu-berlin.de> (raw)
In-Reply-To: <20021028183910.GC24090@codesourcery.com>

Zack Weinberg <zack@codesourcery.com> writes:

> http://gcc.gnu.org/projects/cpplib.html#charset contains some
> discussion of the plan - comments would be appreciated.

Sounds all good. For mangling, I'd take a step back, though:
if the assembler does not support UTF-8, you lose.

As for Java: gcj mangles, say, \u0388 as __U388_.

> What you wrote in response to this is interesting but doesn't
> address the issue of Unicode normalization of identifiers.  It
> sounds more like an extended discussion of the previous point.  I'm
> talking about the process described in UAX 15
> (http://www.unicode.org/unicode/reports/tr15/) and in particular
> annex 7 of that document ("Programming Language Identifiers").

I see. For the characters allowed in C99, normalization is almost a
non-issue, since nearly every identifier will already be in NFC.

One exception are 20 characters which have a different canonical
equivalent (e.g. U+1f71 or U+2126); that might be a defect in ISO TR
10176 (they should not be allowed in identifiers - even though TR#15
allows them as well).

The other exception are 80 characters which have a non-zero combining
class; those would might need to be re-ordered under a normal form.

I agree with Joseph that we are not entitled to perform normalization
of the identifiers - there is nothing in the standard that says that
\u1f71 is the same identifier as \u03ac.

So *if* normalization matters, I think we should require the input to
be already normalized, and refuse compilation if it isn't. This could
be accomplished with a small database: we need to ban additional
characters, and we need to record the combining class for those 80
combining characters.

> Ugh.  IMO, this is a defect in both standards - they should simply
> reference UAX15a7 and be done with it.  It's been around since 1998,
> so they don't really have an excuse for not using it.

I completely disagree. The real problem with normalization (until
Unicode 3.1, 2001-05-16) is that it depends on the version of the
Unicode database: different normalization algorithms might normalize
the same string into different code point sequences. This is really
bad, and has been fixed, by setting Unicode 3.1 as the composition
version.

Even with that change, C++ would *still* have needed to pick a
particular version of the Unicode database, or else programs using
extended identifiers would not be portable across conforming
implementations.

So I think the approach of ISO TR 10176 is really sensible: it is
*better* than UAX15a7, as it gives a precise guideline, instead of a
wishy-washy one.

It is unfortunate that C++ would use a draft of that, but there is
really nobody to blame for that, I guess.

>  - GCC enforces the precise lists in C99 and C++98 only in -pedantic
>    mode.

I think GCC should be really careful when extending the language, and
there should be really good reasons for doing so. In the specific
case, I would require a real user with a real need before accepting
more characters than mandated by the language spec.

Merely saying that "C++ does not allow the FEMININE ORDINAL INDICATOR"
is not good enough. The user would need to say: I want the identifier
Internet\u00aa - which won't happen, because (to my understanding),
the FEMININE ORDINAL INDICATOR only applies to numbers, and it could
not be used there, anyway. I believe the same holds for many of the
other characters that would be allowed under UAX15a7 but are currently
banned - nobody would use them in identifiers even if they could.

So I very much doubt that the need for more characters than allowed in
C++98 shows up in the near time. When it does, the standards might
have been revised, so we just update the implementation. If some users
do have a real need, we can still reconsider, and maybe incorporate
the Unicode 4.5 database (in 2006).

> I am not sure about "few" some years down the road, when people start
> _using_ the ability to write identifiers in their own languages.  In
> any case, using "fwrite(ptr, 1, 1, file)" is just silly when
> "putc(*ptr, file)" will do.

Oops, yes. On the general remark: I'm not sure how to significantly
improve performance. I think cpplib should output \u00c0 if the input
was \u00c0 - and perhaps even if the input was in a native character
set. So direct copying to the output will not be appropriate.

Regards,
Martin

  parent reply	other threads:[~2002-10-29  9:39 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-10-27 23:15 Martin v. Löwis
2002-10-27 23:47 ` Fergus Henderson
2002-10-28  0:11   ` Martin v. Löwis
2002-10-27 23:51 ` Zack Weinberg
2002-10-28  0:53   ` Martin v. Löwis
2002-10-28  1:30     ` Fergus Henderson
2002-10-28  2:26     ` Joseph S. Myers
2002-10-28  3:29       ` Martin v. Löwis
2002-10-28 10:39     ` Zack Weinberg
2002-10-28 10:53       ` Joseph S. Myers
2002-10-29  1:39       ` Martin v. Löwis [this message]
2002-10-29 12:04       ` Joseph S. Myers
2002-10-31 11:08       ` Tom Tromey
2002-11-01  1:41         ` Martin v. Löwis
2002-11-01 11:17           ` Tom Tromey
2002-11-01 11:57             ` Martin v. Löwis
2002-11-01 14:56               ` Tom Tromey
2002-11-01 14:59                 ` Andrew Pinski
2002-11-03  6:08                   ` Martin v. Löwis
2002-11-03  6:05                 ` Martin v. Löwis
2002-11-10 10:39       ` Neil Booth
2002-11-11  8:36         ` Martin v. Löwis
2002-11-07  0:09 ` Neil Booth
2002-11-07  0:12   ` Neil Booth
2002-11-07  1:01   ` Martin v. Löwis
2002-11-07  1:11     ` Neil Booth
2002-11-07  1:47       ` Martin v. Löwis
2002-11-07 11:40         ` Neil Booth
2002-11-08  3:51           ` Martin v. Löwis
2002-11-08 11:45             ` Neil Booth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=j4y98h7fdo.fsf@informatik.hu-berlin.de \
    --to=loewis@informatik.hu-berlin.de \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=java@gcc.gnu.org \
    --cc=zack@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).