From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-131803-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 2921 invoked by alias); 21 Feb 2005 14:15:26 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 2779 invoked by uid 48); 21 Feb 2005 14:15:13 -0000
Date: Mon, 21 Feb 2005 21:34:00 -0000
Message-ID: <20050221141513.2778.qmail@sourceware.org>
From: "jsm28 at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
In-Reply-To: <20030127145600.9449.rearnsha@arm.com>
References: <20030127145600.9449.rearnsha@arm.com>
Reply-To: gcc-bugzilla@gcc.gnu.org
Subject: [Bug preprocessor/9449] UCNs not recognized in identifiers (c++/c99)
X-Bugzilla-Reason: CC
X-SW-Source: 2005-02/txt/msg02483.txt.bz2
List-Id: <gcc-bugs.sourceware.org>


------- Additional Comments From jsm28 at gcc dot gnu dot org  2005-02-21 14:15 -------
The following checklist for implementation of extended identifiers has
been discussed with and prioritised by Zack.  No doubt Neil will point
out if there are any missing technical points.

External specifications
=======================

Reasonable efforts should be made to get specifications of handling of
extended identifiers (that UCNs and other non-ASCII characters in
identifiers are encoded in UTF-8, at least on platforms using ASCII in
the symbol names in the first place) into the following
specifications.  Actually succeeding in doing so is not a blocker for
getting an implementation into GCC.

* ELF:
<http://www.thescogroup.com/developers/gabi/latest/ch4.symtab.html>,
where it says "External C symbols have the same names in C and object
files' symbol tables.".  I have attempted to get such wording in, the
last version proposed being:

  Unless the operating system ABI specifies otherwise, it is
  recommended that characters in external C symbols, including
  characters outside the basic source character set whether or not
  designated in source files by universal character names, are encoded
  in UTF-8 in object files' symbol tables.

and discussions being with ia64-abi@unix-os.sc.intel.com.

* C++ ABI: <http://www.codesourcery.com/cxx-abi/abi.html>.  The
appropriate form would be to add a statement that once the ABI has
constructed a C symbol name which may contain UCNs, such name should
be encoded according to the underlying C ABI, following
<http://www.codesourcery.com/cxx-abi/cxx-closed.html#F8>.

The following specification already includes all the required text,
and GCC should implement it before a release is made supporting
extended identifiers:

* DWARF3: the DW_AT_use_UTF8 attribute should be set on the
compilation unit entry for each compilation unit with any UTF-8
identifiers (including ones such as structure element names which
appear in debug information but not otherwise in external
identifiers).  It may in fact be harmless to set it unconditionally.

GCC implementation issues
=========================

The following specific issues should be dealt with in the GCC
implementation.  Everything implemented needs appropriate tests in the
testsuite to cover it, for both C and C++.

(a) Probably implemented already; if not, should be done before
feature is turned on by default in mainline:

* The precise sets of characters permitted in identifiers in each
standard (C99 and C++03) should be followed.

* A UCN is equivalent to the character it denotes.  This should be
implemented initially for the case of $, but if we start accepting
other extended characters then it should be implemented for them as
well.

* The \U and \u UCNs for the same character, and UCNs differing in
upper or lower case for hex digits, are equivalent.

* The greedy algorithm applies for lexing UCNs: for example,
a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and
shouldn't get a diagnostic on lexing, presuming macros are defined
such that the eventual token sequence is valid).

* The spelling of UCNs is preserved for the # and ## operators.

* UCNs must not be accepted in identifiers or preprocessing numbers in
strict C90 mode: what in C99 would be an identifier with a UCN in C90
is multiple preprocessing tokens and if the identifier fragments are
defined appropriately as macros this could occur in a valid C90
program.

* I think the only reasonable interpretation of the lexing rules in
the context of forbidden characters is that first identifiers are
lexed (allowing any UCNs) then bad characters yield an error (rather
than stopping the identifier before the bad character and treating it
as not a UCN).

* These rules apply to identifiers as preprocessing tokens at any
time, including before concatenation.  So it is not the case in C99
that splitting an identifier anywhere yields two valid preprocessing
tokens: the second half could begin with a UCN for a digit and not be
a valid identifier.  (Invalid identifiers in C99 don't require
diagnostics, but I don't think we want to use this laxity.)

(b) Not done and needs to happen before the feature is turned on by
default in mainline:

* The GCC testsuite should include a test that the same UCN links
between C and an extern "C" C++ identifier.

* There should be a warning by default for all identifiers (as
preprocessing tokens at any stage, e.g. including both before and
after concatenation) not in NFKC, which may be disabled by -Wno-nfkc.

* Preprocessing numbers can contain UCNs (and extended characters such
as $ considered equivalent to them).

(c) Should happen before a release is made containing this feature:

* All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler
should be audited to determine what sort of identifier is appropriate
in each case.  All places where an identifier may appear in a
diagnostic must handle extended identifiers appropriately; if the
locale cannot handle all characters in the identifier, UCNs need to be
used in diagnostic output.  The %E diagnostic format could be made to
do this, but there are many places using %s / %qs for diagnostics
which need fixing.

* Testcases in the GCC testsuite should include all contexts of
identifiers such as macro names, external linkage, internal linkage
and no linkage.  There should be tests for debug information
generation for such cases.  It would be desirable, though not required
if the necessary support isn't already in GDB, to add corresponding
tests to the GDB testsuite and make sure extended identifiers can be
used with GDB, with both DWARF3 and stabs.

* C99 does not permit UCNs for digits at the start of identifiers, but
does permit them elsewhere in identifiers, while C++ does not have
such a restriction.  The restriction in C99 and its absence in C++
should be tested.

* If platforms with limited assemblers or linkers or debug formats
come up, it would be desirable to be able to use names with internal
or no linkage containing external characters on those plarforms, with
appropriate mangling, even if defining an ABI with mangling for
external names is felt inappropriate.

* The C++ requirement that extended source characters (including '$')
are translated to UCNs in translation phase 1 needs implementing.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jsm28 at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449