From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=k5kX=6L=gmail.com=lhyatt@sourceware.org>
Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131])
	by sourceware.org (Postfix) with ESMTPS id 8F17C3858425
	for <gcc-patches@gcc.gnu.org>; Wed, 15 Feb 2023 23:19:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8F17C3858425
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x131.google.com with SMTP id w11so594415lfu.11
        for <gcc-patches@gcc.gnu.org>; Wed, 15 Feb 2023 15:19:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=XpQ34Q95yGEzvPNhVMOoi+EvA4ep2QxQkfjfXS39ngc=;
        b=X405DmSrwBKKtzQ5Y+bGfcW0h+6dd5O3qFxv1kZEb9rQVIlhF2Dcp5G2ts1ZuTLNmV
         1IhMNOXZZlOR5WRHbudU8X2gPyQgYe1W5sIk48aQSDmQo8YWUVwEG2Tc/FHPf86Gcwui
         5TaH3Jb/UbJWvSb+EKEEekG3iKAz1o9GQcUhAp/KAuAfSfbVCUnLW6BZBb8BduCprkPg
         Ejl5RIcCQXViR65punbpCsNXdudD3U2LwDL7wvXK6//4SPU8Q86K6i08TlJjdl4K2WIz
         7baOflST3pbp3v01HdZflSl+qBpsD2qZ68KdwFQvOoqAPVSB2k/ZBL9GcOcGuXbN+4Ij
         uGrQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=XpQ34Q95yGEzvPNhVMOoi+EvA4ep2QxQkfjfXS39ngc=;
        b=AolcYSCL2TE6BSsVeN4IB9hXH1mJKhOvvOPcvjcc94merjQI3vpDe5Clx+NyKCSbd+
         zgcwnPT1N1uatX9nrY4miJghSJ9XBY2ikRLFBEGLzbkQRahz3acthgTcoFOds21ydol0
         mVDDhonC8vguv0cdxIahvCe0THP9dsgOOFg1XSLU3VieZ1TrnZrSFdK3/CBODSQrP9NR
         FkgkVJGynEGNThagD4+oW0LSHxBgvu0DQqY1NjV58xTtXWPFgjp25lrv8d0TtANKikmC
         UoEVk8cloXL8O3treM75uglj3pcyHL4i8DSSwgcvMPjv2M7qcl3sZVWkBDBVO76hEClJ
         dTlg==
X-Gm-Message-State: AO0yUKX7evEMWmRjdoV0n08qLyACBwLuiP/pXfhsqt9rfheLaq3vxGoA
	/DMKM84OeY9BZKQLTEFkmAg0vURQ5yIwkq2365VeiY8o
X-Google-Smtp-Source: AK7set93R7Sv9MPfccecJlrlOFmmz3lhCogbfAmPTgTWknTHBIT6IvZm6dAaqOkWodhKXutn3p54CJ5uJ7Wy7MoCDZ8=
X-Received: by 2002:ac2:43b9:0:b0:4d5:ca32:68a3 with SMTP id
 t25-20020ac243b9000000b004d5ca3268a3mr1030804lfl.8.1676503140116; Wed, 15 Feb
 2023 15:19:00 -0800 (PST)
MIME-Version: 1.0
References: <20220614212649.GA58025@ldh-imac.local> <20220615190616.GA70682@ldh-imac.local>
 <20220926222725.GA19652@ldh-imac.local> <8af7ac06-02a0-40a6-5fd7-56a4e40cccee@redhat.com>
In-Reply-To: <8af7ac06-02a0-40a6-5fd7-56a4e40cccee@redhat.com>
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Wed, 15 Feb 2023 18:18:48 -0500
Message-ID: <CAA_5UQ7-Sz4OAEB_qqAyswt6JGCOpiRxACCUjWoGPpwG4dEdVQ@mail.gmail.com>
Subject: Re: Ping^3: [PATCH] libcpp: Handle extended characters in
 user-defined literal suffix [PR103902]
To: Jason Merrill <jason@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3029.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, Feb 15, 2023 at 1:39 PM Jason Merrill <jason@redhat.com> wrote:
>
> On 9/26/22 15:27, Lewis Hyatt wrote:
> > On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
> >> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> >>> Hello-
> >>>
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> >>>
> >>> The attached patch resolves PR preprocessor/103902 as described in the patch
> >>> message inline below. bootstrap + regtest all languages was successful on
> >>> x86-64 Linux, with no new failures:
> >>>
> >>> FAIL 103 103
> >>> PASS 542338 542371
> >>> UNSUPPORTED 15247 15250
> >>> UNTESTED 136 136
> >>> XFAIL 4166 4166
> >>> XPASS 17 17
> >>>
> >>> Please let me know if it looks OK?
> >>>
> >>> A few questions I have:
> >>>
> >>> - A difference introduced with this patch is that after lexing something
> >>> like `operator ""_abc', then `_abc' is added to the identifier hash map,
> >>> whereas previously it was not. I feel like this must be OK because with the
> >>> optional space as in `operator "" _abc', it would be added with or without the
> >>> patch.
> >>>
> >>> - The behavior of `#pragma GCC poison' is not consistent (including prior to
> >>>    my patch). I tried to make it more so but there is still one thing I want to
> >>>    ask about. Leaving aside extended characters for now, the inconsistency is
> >>>    that currently the poison is only checked, when the suffix appears as a
> >>>    standalone token.
> >>>
> >>>    #pragma GCC poison _X
> >>>    bool operator ""_X (unsigned long long);   //accepted before the patch,
> >>>                                               //rejected after it
> >>>    bool operator "" _X (unsigned long long);  //rejected either before or after
> >>>    const char * operator ""_X (const char *, unsigned long); //accepted before,
> >>>                                                              //rejected after
> >>>    const char * operator "" _X (const char *, unsigned long); //rejected either
> >>>
> >>>    const char * s = ""_X; //accepted before the patch, rejected after it
> >>>    const bool b = 1_X; //accepted before or after ****
> >>>
> >>> I feel like after the patch, the behavior is the expected behavior for all
> >>> cases but the last one. Here, we allow the poisoned identifier because it's
> >>> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> >>> like this or does it need to be addressed?
> >>
> >> Sorry, that version actually did not handle the case of -Wc++11-compat in
> >> c++98 mode correctly. This updated version fixes that and adds the missing
> >> test coverage for that, if you could please review this one instead?
> >>
> >> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
> >> attachments. I attached the patch also gzipped to address that for the
> >> archive, since the new testcases do use non-ASCII characters.
> >>
> >> Thanks for taking a look!
> >
> > Hello-
> >
> > May I please ping this patch again? Joseph suggested that it would be best if
> > a C++ maintainer has a look at it. This is one of just a few places left where
> > we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> > fixed up if there is time to review this patch. Thanks!
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html
> >
> > I re-attached it here as it required some trivial rebasing on top of recently
> > pushed changes. As before, I also attached the gzipped version so that the
> > UTF-8 testcases show up OK in the online archive, in case that's still an
> > issue. Thanks for taking a look!
>
> Thank you for the patch, sorry it slipped off my radar.
>

Thanks for taking a look at it. It's certainly an edge case that is
not bothering anyone too much, so no rush with it.

> > This patch fixes it by adding a new function scan_cur_identifier() that can be
> > used to lex an identifier while in the middle of lexing another token. It is
> > somewhat duplicative of the code in lex_identifier(), which handles the normal
> > case, but I think there's no good way to avoid that without pessimizing the
> > usual case, since lex_identifier() takes advantage of the fact that the first
> > character of the identifier has already been analyzed.
>
> So could you analyze the first character and then call lex_identifier?
>

Yes, it can be done this way. lex_identifier may need some adaptations
though, since it does some other work like tracking the original
spelling of the identifier. Plus per your comments below, it would
need to avoid the poison and other checks too.
I think it's pretty straightforward to refactor a bit so that it works
out. I kinda thought it may not be desirable to touch lex_identifier,
which is called on everything, just to handle this rare case, however
I am happy to do it this way after confirming it won't hurt
performance.

> > With scan_cur_identifier(), we do also correctly warn about bidi and
> > normalization issues in the extended identifiers comprising the suffix, and we
> > check for poisoned identifiers there as well.
>
> Hmm, I don't think we want the check for poisoned identifiers; a suffix
> is not a name.  That goes for the other diagnostics in
> identifier_diagnostics_on_lex, as well.  At the meeting last week the
> committee decided to deprecate the declaration with a space to clarify
> this distinction.
>

OK thanks, interesting.

> > +           if (!accum.accum)
> > +             create_literal2 (pfile, token, base,
> > +                              suffix_begin - base,
> > +                              NODE_NAME (sr.node),
> > +                              NODE_LEN (sr.node),
> > +                              type);
> > +           else
> > +             {
> > +               accum.create_literal2 (pfile, token, base,
> > +                                      suffix_begin - base,
> > +                                      NODE_NAME (sr.node),
> > +                                      NODE_LEN (sr.node),
> > +                                      type);
> > +               _cpp_release_buff (pfile, accum.first);
> > +             }
>
> How about always calling accum.create_literal2?

Yes it should be equivalent and better that way.

Sounds like I should try a version that attempts to reuse the logic in
lex_identifier and cleans up your other points, so I'll look into that
next. Thanks!


-Lewis