From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-70714-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 1662 invoked by alias); 28 Oct 2002 07:51:22 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 1607 invoked from network); 28 Oct 2002 07:51:22 -0000
Received: from unknown (HELO egil.codesourcery.com) (66.92.14.122)
  by sources.redhat.com with SMTP; 28 Oct 2002 07:51:22 -0000
Received: from zack by egil.codesourcery.com with local (Exim 3.36 #1 (Debian))
	id 1864g3-0000Px-00; Sun, 27 Oct 2002 23:51:11 -0800
Date: Sun, 27 Oct 2002 23:51:00 -0000
To: Martin =?iso-8859-1?Q?v=2E_L=F6wis?= <loewis@informatik.hu-berlin.de>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Implementing Universal Character Names in identifiers
Message-ID: <20021028075111.GB1273@codesourcery.com>
References: <200210280715.g9S7FdI2003815@paros.informatik.hu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <200210280715.g9S7FdI2003815@paros.informatik.hu-berlin.de>
User-Agent: Mutt/1.4i
From: Zack Weinberg <zack@codesourcery.com>
X-SW-Source: 2002-10/txt/msg01672.txt.bz2

On Mon, Oct 28, 2002 at 08:15:39AM +0100, Martin v. Löwis wrote:
> This patch implements UCNs in cpplib. It does so by converting the
> UCN to UTF-8, putting the UTF-8 bytes into the internal
> representation of the identifier.

This is the right general idea.  Thank you for the patch.

It would be worthwhile - as a separate patch, mind - to add support
for extended characters written in bare UTF-8 in identifiers.  My plan
for general extended-character-encoding support is to convert to UTF-8
and process that representation; that plus iconv plus some glue and
heuristics will get us most of the way there.

You want to look closely at what is currently done for UCNs in wide
character constants and string literals.  I'm pretty sure it's wrong,
and I would appreciate suggestions.

> The back-ends will transparently output the UTF-8 identifiers into the
> assembler file. If GNU as is used (or any other assembler supporting
> non-ASCII identifiers), these UTF-8 strings will be copied transparently
> into the object file. If the assembler does not support UTF-8, it
> will produce a diagnostic.

I thought we had some sort of encoding schema for assemblers that
don't support UTF-8?  How does this interact with the C++ ABI? 

We should normalize identifiers before entering them in the symbol
table, and for output; otherwise there will be great confusion.  That
needs to happen as part of the initial patch.

...
> + /* Returns nonzero if C is a universal-character-name.  Give an error if it
> +    is not one which may appear in an identifier, as per [extendid].
> + 
> +    Note that extended character support in identifiers has not yet been
> +    implemented.  It is my personal opinion that this is not a desirable
> +    feature.  Portable code cannot count on support for more than the basic
> +    identifier character set.  */

(1) This routine belongs in libiberty, as part of the safe-ctype.h interface.
(2) Isn't this comment now inaccurate?  You just did implement
    extended characters in identifiers.
(3) The ranges need to be updated from the latest Unicode standard,
    and the standard version noted in commentary.

Due to the size of this routine, and the concerns with the rest of
your change, please submit a patch that does just that, all by itself;
that will get in easily, and then we can iterate on the rest of it.

> +   else if (*s < 0xc0)
> +     {
> +       /* Cannot occur as first byte */
> +       abort();
> +     }

Don't use abort in cpplib; use cpp_error (pfile, DL_ICE, ...).
Further, this can happen as a result of ill-formed user input, can't
it?  Therefore this should be a plain error, not an ICE.

>     /* Check for slow-path cases.  */
>     if (*cur == '?' || *cur == '\\' || *cur == '$')
> !     number->text = parse_slow (pfile, cur, 1 + leading_period,
> !                                &number->len, &ignored);

I don't think the UTF8 flag should be ignored at this point.  Consider
what happens if we get

  asdf ## 12\u03F8

-- that is valid, and needs to turn into a single CPP_NAME token with
the UTF8 flag set.  It seems safe to me to carry around the UTF8 bit
on all CPP_NUMBER tokens.  Naturally, cpp_classify_number should
categorize such numbers as CPP_N_INVALID (allowing digits outside the
basic source character set strikes me as a bad idea).

>       spell_ident:
>       case SPELL_IDENT:
> !       if ((token->val.node->flags & NODE_UTF8) == 0)
> !         fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp);
> !       else
> !         {
> !           const unsigned char *s = NODE_NAME (token->val.node);
> !           int len = NODE_LEN (token->val.node);
> !           for (; len; len--)
> !             {
> !               if (*s < 128)
> !                 {
> !                   fwrite (s, 1, 1, fp);
> !                   s++;
> !                   len--;
> !                 }
> !               else
> !                 {
> !                   const unsigned char *old = s;
> !                   cppchar_t code = utf8_to_char (&s);
> !                   if (code < 0x10000)
> !                     fprintf (fp, "\\u%.4x", code);
> !                   else
> !                     fprintf (fp, "\\U%.8x", code);
> !                   len += s - old;
> !                 }
> !             }
> !         }

Please find a more efficient way to accomplish this.  This code is
already *the* bottleneck for textual preprocessing.  (For instance, if
you implement support for raw UTF8 as input encoding, we can just
splat out the identifier as is.)

zw