Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Lewis Hyatt <lhyatt@gmail.com>
To: Jason Merrill <jason@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
Date: Wed, 15 Feb 2023 18:18:48 -0500	[thread overview]
Message-ID: <CAA_5UQ7-Sz4OAEB_qqAyswt6JGCOpiRxACCUjWoGPpwG4dEdVQ@mail.gmail.com> (raw)
In-Reply-To: <8af7ac06-02a0-40a6-5fd7-56a4e40cccee@redhat.com>

On Wed, Feb 15, 2023 at 1:39 PM Jason Merrill <jason@redhat.com> wrote:
>
> On 9/26/22 15:27, Lewis Hyatt wrote:
> > On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
> >> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> >>> Hello-
> >>>
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> >>>
> >>> The attached patch resolves PR preprocessor/103902 as described in the patch
> >>> message inline below. bootstrap + regtest all languages was successful on
> >>> x86-64 Linux, with no new failures:
> >>>
> >>> FAIL 103 103
> >>> PASS 542338 542371
> >>> UNSUPPORTED 15247 15250
> >>> UNTESTED 136 136
> >>> XFAIL 4166 4166
> >>> XPASS 17 17
> >>>
> >>> Please let me know if it looks OK?
> >>>
> >>> A few questions I have:
> >>>
> >>> - A difference introduced with this patch is that after lexing something
> >>> like `operator ""_abc', then `_abc' is added to the identifier hash map,
> >>> whereas previously it was not. I feel like this must be OK because with the
> >>> optional space as in `operator "" _abc', it would be added with or without the
> >>> patch.
> >>>
> >>> - The behavior of `#pragma GCC poison' is not consistent (including prior to
> >>>    my patch). I tried to make it more so but there is still one thing I want to
> >>>    ask about. Leaving aside extended characters for now, the inconsistency is
> >>>    that currently the poison is only checked, when the suffix appears as a
> >>>    standalone token.
> >>>
> >>>    #pragma GCC poison _X
> >>>    bool operator ""_X (unsigned long long);   //accepted before the patch,
> >>>                                               //rejected after it
> >>>    bool operator "" _X (unsigned long long);  //rejected either before or after
> >>>    const char * operator ""_X (const char *, unsigned long); //accepted before,
> >>>                                                              //rejected after
> >>>    const char * operator "" _X (const char *, unsigned long); //rejected either
> >>>
> >>>    const char * s = ""_X; //accepted before the patch, rejected after it
> >>>    const bool b = 1_X; //accepted before or after ****
> >>>
> >>> I feel like after the patch, the behavior is the expected behavior for all
> >>> cases but the last one. Here, we allow the poisoned identifier because it's
> >>> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> >>> like this or does it need to be addressed?
> >>
> >> Sorry, that version actually did not handle the case of -Wc++11-compat in
> >> c++98 mode correctly. This updated version fixes that and adds the missing
> >> test coverage for that, if you could please review this one instead?
> >>
> >> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
> >> attachments. I attached the patch also gzipped to address that for the
> >> archive, since the new testcases do use non-ASCII characters.
> >>
> >> Thanks for taking a look!
> >
> > Hello-
> >
> > May I please ping this patch again? Joseph suggested that it would be best if
> > a C++ maintainer has a look at it. This is one of just a few places left where
> > we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> > fixed up if there is time to review this patch. Thanks!
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html
> >
> > I re-attached it here as it required some trivial rebasing on top of recently
> > pushed changes. As before, I also attached the gzipped version so that the
> > UTF-8 testcases show up OK in the online archive, in case that's still an
> > issue. Thanks for taking a look!
>
> Thank you for the patch, sorry it slipped off my radar.
>

Thanks for taking a look at it. It's certainly an edge case that is
not bothering anyone too much, so no rush with it.

> > This patch fixes it by adding a new function scan_cur_identifier() that can be
> > used to lex an identifier while in the middle of lexing another token. It is
> > somewhat duplicative of the code in lex_identifier(), which handles the normal
> > case, but I think there's no good way to avoid that without pessimizing the
> > usual case, since lex_identifier() takes advantage of the fact that the first
> > character of the identifier has already been analyzed.
>
> So could you analyze the first character and then call lex_identifier?
>

Yes, it can be done this way. lex_identifier may need some adaptations
though, since it does some other work like tracking the original
spelling of the identifier. Plus per your comments below, it would
need to avoid the poison and other checks too.
I think it's pretty straightforward to refactor a bit so that it works
out. I kinda thought it may not be desirable to touch lex_identifier,
which is called on everything, just to handle this rare case, however
I am happy to do it this way after confirming it won't hurt
performance.

> > With scan_cur_identifier(), we do also correctly warn about bidi and
> > normalization issues in the extended identifiers comprising the suffix, and we
> > check for poisoned identifiers there as well.
>
> Hmm, I don't think we want the check for poisoned identifiers; a suffix
> is not a name.  That goes for the other diagnostics in
> identifier_diagnostics_on_lex, as well.  At the meeting last week the
> committee decided to deprecate the declaration with a space to clarify
> this distinction.
>

OK thanks, interesting.

> > +           if (!accum.accum)
> > +             create_literal2 (pfile, token, base,
> > +                              suffix_begin - base,
> > +                              NODE_NAME (sr.node),
> > +                              NODE_LEN (sr.node),
> > +                              type);
> > +           else
> > +             {
> > +               accum.create_literal2 (pfile, token, base,
> > +                                      suffix_begin - base,
> > +                                      NODE_NAME (sr.node),
> > +                                      NODE_LEN (sr.node),
> > +                                      type);
> > +               _cpp_release_buff (pfile, accum.first);
> > +             }
>
> How about always calling accum.create_literal2?

Yes it should be equivalent and better that way.

Sounds like I should try a version that attempts to reuse the logic in
lex_identifier and cleans up your other points, so I'll look into that
next. Thanks!


-Lewis

next prev parent reply	other threads:[~2023-02-15 23:19 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 21:26 Lewis Hyatt
2022-06-15 19:06 ` Lewis Hyatt
2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
2023-02-10 16:30     ` Jakub Jelinek
2023-02-10 16:58       ` Lewis Hyatt
2023-02-10 17:10         ` Jakub Jelinek
2023-02-15 18:39     ` Jason Merrill
2023-02-15 23:18       ` Lewis Hyatt [this message]
2023-03-02 23:07         ` [PATCH v2] " Lewis Hyatt
2023-07-18 20:33           ` Jason Merrill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA_5UQ7-Sz4OAEB_qqAyswt6JGCOpiRxACCUjWoGPpwG4dEdVQ@mail.gmail.com \
    --to=lhyatt@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jason@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).