From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 2921 invoked by alias); 21 Feb 2005 14:15:26 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 2779 invoked by uid 48); 21 Feb 2005 14:15:13 -0000 Date: Mon, 21 Feb 2005 21:34:00 -0000 Message-ID: <20050221141513.2778.qmail@sourceware.org> From: "jsm28 at gcc dot gnu dot org" To: gcc-bugs@gcc.gnu.org In-Reply-To: <20030127145600.9449.rearnsha@arm.com> References: <20030127145600.9449.rearnsha@arm.com> Reply-To: gcc-bugzilla@gcc.gnu.org Subject: [Bug preprocessor/9449] UCNs not recognized in identifiers (c++/c99) X-Bugzilla-Reason: CC X-SW-Source: 2005-02/txt/msg02483.txt.bz2 List-Id: ------- Additional Comments From jsm28 at gcc dot gnu dot org 2005-02-21 14:15 ------- The following checklist for implementation of extended identifiers has been discussed with and prioritised by Zack. No doubt Neil will point out if there are any missing technical points. External specifications ======================= Reasonable efforts should be made to get specifications of handling of extended identifiers (that UCNs and other non-ASCII characters in identifiers are encoded in UTF-8, at least on platforms using ASCII in the symbol names in the first place) into the following specifications. Actually succeeding in doing so is not a blocker for getting an implementation into GCC. * ELF: , where it says "External C symbols have the same names in C and object files' symbol tables.". I have attempted to get such wording in, the last version proposed being: Unless the operating system ABI specifies otherwise, it is recommended that characters in external C symbols, including characters outside the basic source character set whether or not designated in source files by universal character names, are encoded in UTF-8 in object files' symbol tables. and discussions being with ia64-abi@unix-os.sc.intel.com. * C++ ABI: . The appropriate form would be to add a statement that once the ABI has constructed a C symbol name which may contain UCNs, such name should be encoded according to the underlying C ABI, following . The following specification already includes all the required text, and GCC should implement it before a release is made supporting extended identifiers: * DWARF3: the DW_AT_use_UTF8 attribute should be set on the compilation unit entry for each compilation unit with any UTF-8 identifiers (including ones such as structure element names which appear in debug information but not otherwise in external identifiers). It may in fact be harmless to set it unconditionally. GCC implementation issues ========================= The following specific issues should be dealt with in the GCC implementation. Everything implemented needs appropriate tests in the testsuite to cover it, for both C and C++. (a) Probably implemented already; if not, should be done before feature is turned on by default in mainline: * The precise sets of characters permitted in identifiers in each standard (C99 and C++03) should be followed. * A UCN is equivalent to the character it denotes. This should be implemented initially for the case of $, but if we start accepting other extended characters then it should be implemented for them as well. * The \U and \u UCNs for the same character, and UCNs differing in upper or lower case for hex digits, are equivalent. * The greedy algorithm applies for lexing UCNs: for example, a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and shouldn't get a diagnostic on lexing, presuming macros are defined such that the eventual token sequence is valid). * The spelling of UCNs is preserved for the # and ## operators. * UCNs must not be accepted in identifiers or preprocessing numbers in strict C90 mode: what in C99 would be an identifier with a UCN in C90 is multiple preprocessing tokens and if the identifier fragments are defined appropriately as macros this could occur in a valid C90 program. * I think the only reasonable interpretation of the lexing rules in the context of forbidden characters is that first identifiers are lexed (allowing any UCNs) then bad characters yield an error (rather than stopping the identifier before the bad character and treating it as not a UCN). * These rules apply to identifiers as preprocessing tokens at any time, including before concatenation. So it is not the case in C99 that splitting an identifier anywhere yields two valid preprocessing tokens: the second half could begin with a UCN for a digit and not be a valid identifier. (Invalid identifiers in C99 don't require diagnostics, but I don't think we want to use this laxity.) (b) Not done and needs to happen before the feature is turned on by default in mainline: * The GCC testsuite should include a test that the same UCN links between C and an extern "C" C++ identifier. * There should be a warning by default for all identifiers (as preprocessing tokens at any stage, e.g. including both before and after concatenation) not in NFKC, which may be disabled by -Wno-nfkc. * Preprocessing numbers can contain UCNs (and extended characters such as $ considered equivalent to them). (c) Should happen before a release is made containing this feature: * All uses of identifiers and DECL_ASSEMBLER_NAME in the compiler should be audited to determine what sort of identifier is appropriate in each case. All places where an identifier may appear in a diagnostic must handle extended identifiers appropriately; if the locale cannot handle all characters in the identifier, UCNs need to be used in diagnostic output. The %E diagnostic format could be made to do this, but there are many places using %s / %qs for diagnostics which need fixing. * Testcases in the GCC testsuite should include all contexts of identifiers such as macro names, external linkage, internal linkage and no linkage. There should be tests for debug information generation for such cases. It would be desirable, though not required if the necessary support isn't already in GDB, to add corresponding tests to the GDB testsuite and make sure extended identifiers can be used with GDB, with both DWARF3 and stabs. * C99 does not permit UCNs for digits at the start of identifiers, but does permit them elsewhere in identifiers, while C++ does not have such a restriction. The restriction in C99 and its absence in C++ should be tested. * If platforms with limited assemblers or linkers or debug formats come up, it would be desirable to be able to use names with internal or no linkage containing external characters on those plarforms, with appropriate mangling, even if defining an ABI with mangling for external names is felt inappropriate. * The C++ requirement that extended source characters (including '$') are translated to UCNs in translation phase 1 needs implementing. -- What |Removed |Added ---------------------------------------------------------------------------- CC| |jsm28 at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449