From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 1662 invoked by alias); 28 Oct 2002 07:51:22 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 1607 invoked from network); 28 Oct 2002 07:51:22 -0000 Received: from unknown (HELO egil.codesourcery.com) (66.92.14.122) by sources.redhat.com with SMTP; 28 Oct 2002 07:51:22 -0000 Received: from zack by egil.codesourcery.com with local (Exim 3.36 #1 (Debian)) id 1864g3-0000Px-00; Sun, 27 Oct 2002 23:51:11 -0800 Date: Sun, 27 Oct 2002 23:51:00 -0000 To: Martin =?iso-8859-1?Q?v=2E_L=F6wis?= Cc: gcc-patches@gcc.gnu.org Subject: Re: Implementing Universal Character Names in identifiers Message-ID: <20021028075111.GB1273@codesourcery.com> References: <200210280715.g9S7FdI2003815@paros.informatik.hu-berlin.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <200210280715.g9S7FdI2003815@paros.informatik.hu-berlin.de> User-Agent: Mutt/1.4i From: Zack Weinberg X-SW-Source: 2002-10/txt/msg01672.txt.bz2 On Mon, Oct 28, 2002 at 08:15:39AM +0100, Martin v. Löwis wrote: > This patch implements UCNs in cpplib. It does so by converting the > UCN to UTF-8, putting the UTF-8 bytes into the internal > representation of the identifier. This is the right general idea. Thank you for the patch. It would be worthwhile - as a separate patch, mind - to add support for extended characters written in bare UTF-8 in identifiers. My plan for general extended-character-encoding support is to convert to UTF-8 and process that representation; that plus iconv plus some glue and heuristics will get us most of the way there. You want to look closely at what is currently done for UCNs in wide character constants and string literals. I'm pretty sure it's wrong, and I would appreciate suggestions. > The back-ends will transparently output the UTF-8 identifiers into the > assembler file. If GNU as is used (or any other assembler supporting > non-ASCII identifiers), these UTF-8 strings will be copied transparently > into the object file. If the assembler does not support UTF-8, it > will produce a diagnostic. I thought we had some sort of encoding schema for assemblers that don't support UTF-8? How does this interact with the C++ ABI? We should normalize identifiers before entering them in the symbol table, and for output; otherwise there will be great confusion. That needs to happen as part of the initial patch. ... > + /* Returns nonzero if C is a universal-character-name. Give an error if it > + is not one which may appear in an identifier, as per [extendid]. > + > + Note that extended character support in identifiers has not yet been > + implemented. It is my personal opinion that this is not a desirable > + feature. Portable code cannot count on support for more than the basic > + identifier character set. */ (1) This routine belongs in libiberty, as part of the safe-ctype.h interface. (2) Isn't this comment now inaccurate? You just did implement extended characters in identifiers. (3) The ranges need to be updated from the latest Unicode standard, and the standard version noted in commentary. Due to the size of this routine, and the concerns with the rest of your change, please submit a patch that does just that, all by itself; that will get in easily, and then we can iterate on the rest of it. > + else if (*s < 0xc0) > + { > + /* Cannot occur as first byte */ > + abort(); > + } Don't use abort in cpplib; use cpp_error (pfile, DL_ICE, ...). Further, this can happen as a result of ill-formed user input, can't it? Therefore this should be a plain error, not an ICE. > /* Check for slow-path cases. */ > if (*cur == '?' || *cur == '\\' || *cur == '$') > ! number->text = parse_slow (pfile, cur, 1 + leading_period, > ! &number->len, &ignored); I don't think the UTF8 flag should be ignored at this point. Consider what happens if we get asdf ## 12\u03F8 -- that is valid, and needs to turn into a single CPP_NAME token with the UTF8 flag set. It seems safe to me to carry around the UTF8 bit on all CPP_NUMBER tokens. Naturally, cpp_classify_number should categorize such numbers as CPP_N_INVALID (allowing digits outside the basic source character set strikes me as a bad idea). > spell_ident: > case SPELL_IDENT: > ! if ((token->val.node->flags & NODE_UTF8) == 0) > ! fwrite (NODE_NAME (token->val.node), 1, NODE_LEN (token->val.node), fp); > ! else > ! { > ! const unsigned char *s = NODE_NAME (token->val.node); > ! int len = NODE_LEN (token->val.node); > ! for (; len; len--) > ! { > ! if (*s < 128) > ! { > ! fwrite (s, 1, 1, fp); > ! s++; > ! len--; > ! } > ! else > ! { > ! const unsigned char *old = s; > ! cppchar_t code = utf8_to_char (&s); > ! if (code < 0x10000) > ! fprintf (fp, "\\u%.4x", code); > ! else > ! fprintf (fp, "\\U%.8x", code); > ! len += s - old; > ! } > ! } > ! } Please find a more efficient way to accomplish this. This code is already *the* bottleneck for textual preprocessing. (For instance, if you implement support for raw UTF8 as input encoding, we can just splat out the identifier as is.) zw