From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13084 invoked by alias); 10 Sep 2019 23:47:33 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 13069 invoked by uid 89); 10 Sep 2019 23:47:33 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-3.0 required=5.0 tests=AWL,BAYES_00,KAM_SHORT,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.1 spammy=relying, @samp, lewis, Lewis X-HELO: esa2.mentor.iphmx.com Received: from esa2.mentor.iphmx.com (HELO esa2.mentor.iphmx.com) (68.232.141.98) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 10 Sep 2019 23:47:31 +0000 IronPort-SDR: t7PLyXmgy27bGXRooyJitqVE4S2aFguoybcQz8ghygxHhtLu3vB4KOajIdleaEk/jTuMFcxTkU GgvECKocGnr0gZZp60IEGhsGwTzYiYJZfxIFhWhaV5nod3E/fgjT3cogqtoBqdyG39jNJ221g9 tWO9zFo/5D4lMr3vMFno/su3O7afjHXlQEk0vo0zEntFjU3M1+tWJ7EEmqIjfD2kxP92C3dIYH gCfChtht44XPvcxEVq9rMvPdpvQ9GPfdapeXswwPVNIOKydVnCcFXCOc/AE1YMWb8z3sQDapVG dAg= Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 10 Sep 2019 15:47:28 -0800 IronPort-SDR: KBK4uAgGRob9aEx29fMQ9gkfH4T3KDabH9PwB4CwyAi2FuIydCwHpENxFXSvRCda4W3Yv2JdZG +7qYp5MNMyINpqNJD/MSoI4VkOwbHKGKph9G9luVuztFlTS+km6UlLwR2cUjZXp4CEWb0347Xt gfmwMmJbKzgEExEVVQvonuTsK97C2WzAwej522e7pYy7/aW0LTw66WSxuBHFkICPVsTVpJd5wW wxngblpcA2NZJP+b//lAIrPiBQcGSHCWin1WgIp84YKQ7LHzVvmUjDduevCn3r7pROVjUwf8BO dBY= Date: Tue, 10 Sep 2019 23:47:00 -0000 From: Joseph Myers To: Lewis Hyatt CC: Subject: Re: Patch to support extended characters in C/C++ identifiers In-Reply-To: <20190812220121.GA9251@ldh.local> Message-ID: References: <20190812220121.GA9251@ldh.local> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Return-Path: joseph@codesourcery.com X-SW-Source: 2019-09/txt/msg00702.txt.bz2 On Mon, 12 Aug 2019, Lewis Hyatt wrote: > Hello- > > The attached patch for libcpp adds support for extended characters (e.g. UTF-8) > in identifiers. A preliminary version of the patch was posted on PR c/67224 as > Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and > discussed with Joseph Myers. Here is an updated patch incorporating all > feedback received so far. I hope it is suitable now; please let me know if I > can do anything else to make it ready for you to apply. I am happy to work on > it further, whatever is needed. I can't easily test on anything other than > x86_64-linux though. I did bootstrap all languages and run all tests on that > platform, everything was good. > > The (relatively short) changes to libcpp are included inline here. I attached > the test cases as a gzipped patch to avoid any problems with the encoding (the > test cases contain some invalid UTF-8 and also other encodings such as latin-1 > as part of the testing). > > Thanks for taking a look at it! Thanks, I think this is OK with a few updates to the documentation. Specifically: cpp.texi says: In the 1999 C standard, identifiers may contain letters which are not part of the ``basic source character set'', at the implementation's discretion (such as accented Latin letters, Greek letters, or Chinese ideograms). This may be done with an extended character set, or the @samp{\u} and @samp{\U} escape sequences. GCC only accepts such characters in the @samp{\u} and @samp{\U} forms. and it's no longer accurate to say that only the \u and \U forms are accepted. cpp.texi, section "Implementation-defined behavior", discusses implementation-defined characters in identifiers. It should say that GCC accepts exactly those multibyte characters that correspond to UCNs for characters permitted by the chosen version of the C or C++ standard. cppopts.texi documents -fextended-identifiers as "Accept universal character names in identifiers.". That needs to say the characters are also accepted directly in the identifiers. I should also note that a few of the tests added by the test are testing things that are properties of the implementation that might arguably be bugs, rather than standard features, and so perhaps should at least have comments added saying they are testing those implementation properties. gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c, testing invalid UTF-8, is relying on GCC, in its default -finput-charset=utf-8 mode, not actually checking that the input is valid UTF-8. It's clear that avoiding such a check makes sense in strings and comments, both as a matter of efficiency and because it's likely to do the right thing for a lot of user programs that use non-UTF-8 character sets in those places and just need the bytes in the strings to be passed through to the compiler output (rather than requiring users to specify -finput-charset and -fexec-charset for those programs). Outside those contexts it's less obvious what's the best way to behave (this sort of test, where the stray non-UTF-8 bytes are in text that disappears as a result of macro expansion, is certainly a corner case). gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in C++, where strictly the results they expect show that GCC does not conform to the C++ standard requirement to convert all extended characters to UCNs (because C++ does not have the special C rule making it implementation-defined whether the \ of a UCN in a string literal is doubled when stringizing). -- Joseph S. Myers joseph@codesourcery.com