From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 28ACB3858D35 for ; Wed, 15 Feb 2023 18:39:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 28ACB3858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676486374; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cEo3jQub1SFEW0g+y523flz00tCwXBOxYeiIgN5+IZ4=; b=PQ7jP62e9YcKUi31hkCHMV2DMbbKejuYPbiCUfneITI8wPtzbxd/+urPTqC1MSwTKb5umV JSOWJ9wwbqiA3x84aTEN5/NYzZ+xs75FgY2n6ex3J5bdjX9C5MlxrO3Gw1mrdptSPNt4Pa 1JiajZYQBxcmovEx5gDbSV5vb8xyoPk= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-597-wgKrxQiOPTSxmDVo-fzgzg-1; Wed, 15 Feb 2023 13:39:33 -0500 X-MC-Unique: wgKrxQiOPTSxmDVo-fzgzg-1 Received: by mail-qt1-f199.google.com with SMTP id n1-20020ac85a01000000b003ba2a2c50f9so11734207qta.23 for ; Wed, 15 Feb 2023 10:39:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cEo3jQub1SFEW0g+y523flz00tCwXBOxYeiIgN5+IZ4=; b=NZYYBVnYwZE3zx9iGsFWJGHbJq6f80cTbgS1PbWWbRILN20q0SynXRUiwhmhr4m5JR 8WKK6e6caLZk6AqEvbsH8URe8SyPA79zjqAq4DhdY0yW7C69nzh0L6iR9NHrjwznDC8X f6aHRhIDDWYDnd98FfVsOybKQoviiT2yhUY9/ziM78MT3AhwS24O4e8cJxvF7kiv5dPi Mn+Nj2cG0q5xawxCRQNCGuK6h6FpSkXoN2rJlI2j+o4b8Sq1QESysKVxLe4Krf9b/uLC aNpc5OnuykFXV5ZWNy2E2t28902qR1022FPG0lpWXL+kBpELHeNYe2pOzcHOLPU/dnAt BadA== X-Gm-Message-State: AO0yUKVySjFxfB8pw7eNXjkG0bNNsBb2pUAaWfE/VjaiNVLXZXXxoWy8 VDNR/IDHBFfcL4P1UUHVEOVHx3dqTa7CyJsCsNAcHSrLlXNJbar5BjYfQFE3EQuCXheQI5rS/eh N8C1K1krhRJFWPmL2pswO8Fk= X-Received: by 2002:a05:622a:285:b0:3b9:a641:aa66 with SMTP id z5-20020a05622a028500b003b9a641aa66mr5149954qtw.15.1676486372647; Wed, 15 Feb 2023 10:39:32 -0800 (PST) X-Google-Smtp-Source: AK7set9OPrwav9mNHMIQumHUlIGj8Mx7fHxZ9uiMjL3qPDt8yosmOhChEvq0vD3WNmX/ddcy26gepA== X-Received: by 2002:a05:622a:285:b0:3b9:a641:aa66 with SMTP id z5-20020a05622a028500b003b9a641aa66mr5149919qtw.15.1676486372260; Wed, 15 Feb 2023 10:39:32 -0800 (PST) Received: from [192.168.1.108] (130-44-159-43.s15913.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.159.43]) by smtp.gmail.com with ESMTPSA id u123-20020a379281000000b0073b4d9e2e8dsm5230472qkd.43.2023.02.15.10.39.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 15 Feb 2023 10:39:31 -0800 (PST) Message-ID: <8af7ac06-02a0-40a6-5fd7-56a4e40cccee@redhat.com> Date: Wed, 15 Feb 2023 13:39:30 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902] To: Lewis Hyatt , gcc-patches@gcc.gnu.org References: <20220614212649.GA58025@ldh-imac.local> <20220615190616.GA70682@ldh-imac.local> <20220926222725.GA19652@ldh-imac.local> From: Jason Merrill In-Reply-To: <20220926222725.GA19652@ldh-imac.local> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 9/26/22 15:27, Lewis Hyatt wrote: > On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote: >> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote: >>> Hello- >>> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902 >>> >>> The attached patch resolves PR preprocessor/103902 as described in the patch >>> message inline below. bootstrap + regtest all languages was successful on >>> x86-64 Linux, with no new failures: >>> >>> FAIL 103 103 >>> PASS 542338 542371 >>> UNSUPPORTED 15247 15250 >>> UNTESTED 136 136 >>> XFAIL 4166 4166 >>> XPASS 17 17 >>> >>> Please let me know if it looks OK? >>> >>> A few questions I have: >>> >>> - A difference introduced with this patch is that after lexing something >>> like `operator ""_abc', then `_abc' is added to the identifier hash map, >>> whereas previously it was not. I feel like this must be OK because with the >>> optional space as in `operator "" _abc', it would be added with or without the >>> patch. >>> >>> - The behavior of `#pragma GCC poison' is not consistent (including prior to >>> my patch). I tried to make it more so but there is still one thing I want to >>> ask about. Leaving aside extended characters for now, the inconsistency is >>> that currently the poison is only checked, when the suffix appears as a >>> standalone token. >>> >>> #pragma GCC poison _X >>> bool operator ""_X (unsigned long long); //accepted before the patch, >>> //rejected after it >>> bool operator "" _X (unsigned long long); //rejected either before or after >>> const char * operator ""_X (const char *, unsigned long); //accepted before, >>> //rejected after >>> const char * operator "" _X (const char *, unsigned long); //rejected either >>> >>> const char * s = ""_X; //accepted before the patch, rejected after it >>> const bool b = 1_X; //accepted before or after **** >>> >>> I feel like after the patch, the behavior is the expected behavior for all >>> cases but the last one. Here, we allow the poisoned identifier because it's >>> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK >>> like this or does it need to be addressed? >> >> Sorry, that version actually did not handle the case of -Wc++11-compat in >> c++98 mode correctly. This updated version fixes that and adds the missing >> test coverage for that, if you could please review this one instead? >> >> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline >> attachments. I attached the patch also gzipped to address that for the >> archive, since the new testcases do use non-ASCII characters. >> >> Thanks for taking a look! > > Hello- > > May I please ping this patch again? Joseph suggested that it would be best if > a C++ maintainer has a look at it. This is one of just a few places left where > we don't handle UTF-8 properly in libcpp, it would be really nice to get them > fixed up if there is time to review this patch. Thanks! > > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html > > I re-attached it here as it required some trivial rebasing on top of recently > pushed changes. As before, I also attached the gzipped version so that the > UTF-8 testcases show up OK in the online archive, in case that's still an > issue. Thanks for taking a look! Thank you for the patch, sorry it slipped off my radar. > This patch fixes it by adding a new function scan_cur_identifier() that can be > used to lex an identifier while in the middle of lexing another token. It is > somewhat duplicative of the code in lex_identifier(), which handles the normal > case, but I think there's no good way to avoid that without pessimizing the > usual case, since lex_identifier() takes advantage of the fact that the first > character of the identifier has already been analyzed. So could you analyze the first character and then call lex_identifier? > With scan_cur_identifier(), we do also correctly warn about bidi and > normalization issues in the extended identifiers comprising the suffix, and we > check for poisoned identifiers there as well. Hmm, I don't think we want the check for poisoned identifiers; a suffix is not a name. That goes for the other diagnostics in identifier_diagnostics_on_lex, as well. At the meeting last week the committee decided to deprecate the declaration with a space to clarify this distinction. > + if (!accum.accum) > + create_literal2 (pfile, token, base, > + suffix_begin - base, > + NODE_NAME (sr.node), > + NODE_LEN (sr.node), > + type); > + else > + { > + accum.create_literal2 (pfile, token, base, > + suffix_begin - base, > + NODE_NAME (sr.node), > + NODE_LEN (sr.node), > + type); > + _cpp_release_buff (pfile, accum.first); > + } How about always calling accum.create_literal2? Jason