From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 3697 invoked by alias); 16 Sep 2005 00:02:23 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 3639 invoked by alias); 16 Sep 2005 00:02:03 -0000 Date: Fri, 16 Sep 2005 00:02:00 -0000 Message-ID: <20050916000203.3638.qmail@sourceware.org> From: "geoffk at geoffk dot org" To: gcc-bugs@gcc.gnu.org In-Reply-To: <20030127145600.9449.rearnsha@arm.com> References: <20030127145600.9449.rearnsha@arm.com> Reply-To: gcc-bugzilla@gcc.gnu.org Subject: [Bug preprocessor/9449] UCNs not recognized in identifiers (c++/c99) X-Bugzilla-Reason: CC X-SW-Source: 2005-09/txt/msg01920.txt.bz2 List-Id: ------- Additional Comments From geoffk at geoffk dot org 2005-09-16 00:01 ------- Subject: Re: UCNs not recognized in identifiers (c++/c99) On 15/09/2005, at 3:53 PM, joseph at codesourcery dot com wrote: > Yes, "spelling" is meant in terms of the source code characters. > The idea is to permit simple strcmp-like checking by the > preprocessor. Good, so that answers that question. You raise a good point about GCC not having documentation for phase 1. I don't have time to write all of it, but I think I can write the last part, about UCNs, so maybe together we can get it all done. My proposed wording is: @cite{The mapping between physical source file multibyte characters and the source character set in translation phase 1 (C90 and C99 5.1.1.2).} [CR/NL/CR-NL are turned into EOL markers, spaces are deleted between backslash and the end of a line, it's converted to UTF-8 using iconv based on -finput-charset---and what else?] Then, any character sequence which would form a UCN in an identifier in phase 3 of translation is converted into the corresponding UTF-8 sequence. Any backslash-newline combinations in the UCN are preserved and placed after the UTF-8 sequence. [note that there's no way for a user to tell whether a backslash- newline combination is placed before, in the middle of, or after, the UTF-8 sequence.] ... @cite{Which additional multibyte characters may appear in identifiers and their correspondence to universal character names (C99 6.4.2).} UTF-8 character sequences may appear in identifiers, and they correspond to the UCN that specifies that character. A UTF-8 sequence may appear only if the UCN that it corresponds to would be permitted in the identifier at that point. At present, only those UTF-8 sequences which were produced by the mapping from UCNs to UTF-8 sequences in translation phase 1 are permitted, but this is likely to change in the future. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449