From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=r/2I=C5=gmail.com=lhyatt@sourceware.org>
Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130])
	by sourceware.org (Postfix) with ESMTPS id 3C5EB3858D1E
	for <gcc-patches@gcc.gnu.org>; Tue, 11 Jul 2023 23:30:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3C5EB3858D1E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-4f95bf5c493so9546723e87.3
        for <gcc-patches@gcc.gnu.org>; Tue, 11 Jul 2023 16:30:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1689118248; x=1691710248;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=d2/+AefewqHEhg9J1yMtnVc27+xbkz0nsUsZGqc2LXQ=;
        b=Oivtqv8swoV4VMc5ikKMHSyrJ6iTuZHw38PIhmHFj5rSA1nRMVARmD2ExCCKngdHKu
         riyBbd74Udo7UB5Vx6ZPTdgveehgWiNqzJDA9KBaw6wMhRAXe8XOGz5do9NLkhENVL61
         8u4eNfAwG3lS0PHZ2+TU7QHnBSjjnyHKFzn8l1AjKeJ4bncqDhCKWfVav/SuH0aSJlUS
         ptmfyl2eiJcz5GU9tXoYKUmq7bPFEr70sOom6dzn8A/1grB+PJHDJxDBj248oN93f8Mx
         D+zzXxpSWJ+W19GnLqLJgIwaifGKXIKqUciIHlQOC3bVPpRh5P2UEKJhgSVsH8TWivge
         bq0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689118248; x=1691710248;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=d2/+AefewqHEhg9J1yMtnVc27+xbkz0nsUsZGqc2LXQ=;
        b=OozJ1LlIVmYd4vpgFeA+FveTW9Bjzj9/cGQV0a1glH+hX7CZHzjBZadrUuvz27Vhui
         Oltl6XTWn0JI7aB8kGDOugl2LA+rr+l9b7MMOftlyyd9+s6HxLU/hTc4pF5C/wE836QS
         ZobOw68GMzCCzIAtodR/P6BALIcnocSLYEHrMjZ8xJz5DqK3St9iZ61iwXFgcF7FVcyr
         bPkJbUfXHNeJqcKFZngLKoAJtW9GnvI7wXz5eRY7M7meJ3s9x3Ekea9wwp3Dt/O4fdjO
         vX/l8y9D2gfbpeieDZjjniX+yqOV6olRrtfUahSZU/ZKh68WSQa22BthBz1suHmK7I/g
         EVgg==
X-Gm-Message-State: ABy/qLZZio3FNzKIl2Omo2N1IzYHBAlgvGkcJC8QfH0COIbscbK0dkcX
	IOO77ctTNujYiwPTlAsGLs97hQgoO0+TkagsAsTPCGw7
X-Google-Smtp-Source: APBJJlF1q2SklFM2nolmMTmGQfVQBQ2YdSDuuuPRPHRqUQvsIdYD1lH51z0srX6drf5+70/LIV72KnOIDbgJ18mb5IU=
X-Received: by 2002:a19:7114:0:b0:4f9:56aa:26c5 with SMTP id
 m20-20020a197114000000b004f956aa26c5mr13366732lfc.25.1689118247821; Tue, 11
 Jul 2023 16:30:47 -0700 (PDT)
MIME-Version: 1.0
References: <CAA_5UQ4F6u5M++sRiF5zKVMACGokTqmBXe=W-b32qGBPeS0uSg@mail.gmail.com>
 <CAA_5UQ5dcrDukoMv=HQP6ypVs8-3HVFbaOi0PSauFyDRDKFr2w@mail.gmail.com>
In-Reply-To: <CAA_5UQ5dcrDukoMv=HQP6ypVs8-3HVFbaOi0PSauFyDRDKFr2w@mail.gmail.com>
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Tue, 11 Jul 2023 19:30:36 -0400
Message-ID: <CAA_5UQ412NKMeuTm0EvV8KAUNv57wnEcAhWRYrtNBEJo0fJPvA@mail.gmail.com>
Subject: Ping: [PATCH v2] libcpp: Handle extended characters in user-defined
 literal suffix [PR103902]
To: gcc-patches List <gcc-patches@gcc.gnu.org>
Cc: Jason Merrill <jason@redhat.com>, Jakub Jelinek <jakub@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-3035.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

May I please ping this patch again? I think it would be worthwhile to
close this gap in the support for UTF-8 sources. Thanks!
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html

-Lewis

On Fri, Jun 2, 2023 at 9:45=E2=80=AFAM Lewis Hyatt <lhyatt@gmail.com> wrote=
:
>
> Hello-
>
> Ping please? Thanks.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
>
> -Lewis
>
> On Tue, May 2, 2023 at 9:27=E2=80=AFAM Lewis Hyatt <lhyatt@gmail.com> wro=
te:
> >
> > May I please ping this one? Thanks...
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
> >
> > On Thu, Mar 2, 2023 at 6:21=E2=80=AFPM Lewis Hyatt <lhyatt@gmail.com> w=
rote:
> > >
> > > The PR complains that we do not handle UTF-8 in the suffix for a user=
-defined
> > > literal, such as:
> > >
> > > bool operator ""_=CF=80 (unsigned long long);
> > >
> > > In fact we don't handle any extended identifier characters there, whe=
ther
> > > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional spac=
e after
> > > the "" tokens is included, since then the identifier is lexed in the =
"normal"
> > > way as its own token. But when it is lexed as part of the string toke=
n, this
> > > is handled in lex_string() with a one-off loop that is not aware of e=
xtended
> > > characters.
> > >
> > > This patch fixes it by adding a new function scan_cur_identifier() th=
at can be
> > > used to lex an identifier while in the middle of lexing another token=
.
> > >
> > > BTW, the other place that has been mis-lexing identifiers is
> > > lex_identifier_intern(), which is used to implement #pragma push_macr=
o
> > > and #pragma pop_macro. This does not support extended characters eith=
er.
> > > I will add that in a subsequent patch, because it can't directly reus=
e the
> > > new function, but rather needs to lex from a string instead of a cpp_=
buffer.
> > >
> > > With scan_cur_identifier(), we do also correctly warn about bidi and
> > > normalization issues in the extended identifiers comprising the suffi=
x.
> > >
> > > libcpp/ChangeLog:
> > >
> > >         PR preprocessor/103902
> > >         * lex.cc (identifier_diagnostics_on_lex): New function refact=
oring
> > >         some common code.
> > >         (lex_identifier_intern): Use the new function.
> > >         (lex_identifier): Don't run identifier diagnostics here, rath=
er let
> > >         the call site do it when needed.
> > >         (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> > >         acccordingly.
> > >         (struct scan_id_result): New struct.
> > >         (scan_cur_identifier): New function.
> > >         (create_literal2): New function.
> > >         (lit_accum::create_literal2): New function.
> > >         (is_macro): Folded into new function...
> > >         (maybe_ignore_udl_macro_suffix): ...here.
> > >         (is_macro_not_literal_suffix): Folded likewise.
> > >         (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_ide=
ntifier ().
> > >         (lex_string): Likewise.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         PR preprocessor/103902
> > >         * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> > >         * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> > >         * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> > >         * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> > > ---
> > >
> > > Notes:
> > >     Hello-
> > >
> > >     This is the updated version of the patch, incorporating feedback =
from Jakub
> > >     and Jason, most recently discussed here:
> > >
> > >     https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.ht=
ml
> > >
> > >     Please let me know how it looks? It is simpler than before with t=
he new
> > >     approach. Thanks!
> > >
> > >     One thing to note. As Jason clarified for me, a usage like this:
> > >
> > >      #pragma GCC poison _x
> > >     const char * operator "" _x (const char *, unsigned long);
> > >
> > >     The space between the "" and the _x is currently allowed but will=
 be
> > >     deprecated in C++23. GCC currently will complain about the poison=
ed use of
> > >     _x in this case, and this patch, which is just focused on handlin=
g UTF-8
> > >     properly, does not change this. But it seems that it would be cor=
rect
> > >     not to apply poison in this case. I can try to follow up with a p=
atch to do
> > >     so, if it seems worthwhile? Given the syntax is deprecated, maybe=
 it's not
> > >     worth it...
> > >
> > >     For the time being, this patch does add a testcase for the above =
and xfails
> > >     it. For the case where no space is present, which is the part tou=
ched by the
> > >     present patch, existing behavior is preserved correctly and no di=
agnostics
> > >     such as poison are issued for the UDL suffix. (Contrary to v1 of =
this
> > >     patch.)
> > >
> > >     Thanks! bootstrap + regtested all languages on x86-64 Linux with
> > >     no regressions.
> > >
> > >     -Lewis
> > >
> > >  .../g++.dg/cpp0x/udlit-extended-id-1.C        |  68 ++++
> > >  .../g++.dg/cpp0x/udlit-extended-id-2.C        |   6 +
> > >  .../g++.dg/cpp0x/udlit-extended-id-3.C        |  15 +
> > >  .../g++.dg/cpp0x/udlit-extended-id-4.C        |  14 +
> > >  libcpp/lex.cc                                 | 382 ++++++++++------=
--
> > >  5 files changed, 317 insertions(+), 168 deletions(-)
> > >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> > >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> > >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
> > >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
> > >
> > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/t=
estsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> > > new file mode 100644
> > > index 00000000000..411d4fdd0ba
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> > > @@ -0,0 +1,68 @@
> > > +// { dg-do run { target c++11 } }
> > > +// { dg-additional-options "-Wno-error=3Dnormalized" }
> > > +#include <cstring>
> > > +using namespace std;
> > > +
> > > +constexpr unsigned long long operator "" _=CF=80 (unsigned long long=
 x)
> > > +{
> > > +  return 3 * x;
> > > +}
> > > +
> > > +/* Historically we didn't parse properly as part of the "" token, so=
 check that
> > > +   as well.  */
> > > +constexpr unsigned long long operator ""_=CE=A02 (unsigned long long=
 x)
> > > +{
> > > +  return 4 * x;
> > > +}
> > > +
> > > +char x1[1_=CF=80];
> > > +char x2[2_=CE=A02];
> > > +
> > > +static_assert (sizeof x1 =3D=3D 3, "test1");
> > > +static_assert (sizeof x2 =3D=3D 8, "test2");
> > > +
> > > +const char * operator "" _1=CF=83 (const char *s, unsigned long)
> > > +{
> > > +  return s + 1;
> > > +}
> > > +
> > > +const char * operator ""_=CE=A32 (const char *s, unsigned long)
> > > +{
> > > +  return s + 2;
> > > +}
> > > +
> > > +const char * operator "" _\U000000e61 (const char *s, unsigned long)
> > > +{
> > > +  return "ae";
> > > +}
> > > +
> > > +const char* operator ""_\u01532 (const char *s, unsigned long)
> > > +{
> > > +  return "oe";
> > > +}
> > > +
> > > +bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning=
 "not in NFC" }
> > > +bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warn=
ing "not in NFC" }
> > > +
> > > +#define x=CF=84y
> > > +const char * str =3D ""x=CF=84y; // { dg-warning "invalid suffix on =
literal" }
> > > +
> > > +int main()
> > > +{
> > > +  if (3_=CF=80 !=3D 9)
> > > +    __builtin_abort ();
> > > +  if (4_=CE=A02 !=3D 16)
> > > +    __builtin_abort ();
> > > +  if (strcmp ("abc"_1=CF=83, "bc"))
> > > +    __builtin_abort ();
> > > +  if (strcmp ("abcd"_=CE=A32, "cd"))
> > > +    __builtin_abort ();
> > > +  if (strcmp (R"(abcdef)"_1=CF=83, "bcdef"))
> > > +    __builtin_abort ();
> > > +  if (strcmp (R"(abcdef)"_=CE=A32, "cdef"))
> > > +    __builtin_abort ();
> > > +  if (strcmp ("xyz"_=C3=A61, "ae"))
> > > +    __builtin_abort ();
> > > +  if (strcmp ("xyz"_=C5=932, "oe"))
> > > +    __builtin_abort ();
> > > +}
> > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/t=
estsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> > > new file mode 100644
> > > index 00000000000..05a2804a463
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> > > @@ -0,0 +1,6 @@
> > > +// { dg-do compile { target c++11 } }
> > > +// { dg-additional-options "-Wbidi-chars=3Dany,ucn" }
> > > +bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line =
line1 }
> > > +// { dg-error "universal character \\\\u202a is not valid in an iden=
tifier" "test1" { target *-*-* } line1 }
> > > +// { dg-error "universal character \\\\u202c is not valid in an iden=
tifier" "test2" { target *-*-* } line1 }
> > > +// { dg-warning "found problematic Unicode character" "test3" { targ=
et *-*-* } line1 }
> > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/t=
estsuite/g++.dg/cpp0x/udlit-extended-id-3.C
> > > new file mode 100644
> > > index 00000000000..11292e476e3
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
> > > @@ -0,0 +1,15 @@
> > > +// { dg-do compile { target c++11 } }
> > > +
> > > +// Check that we do not look for poisoned identifier when it is a su=
ffix.
> > > +int _=C4=A7;
> > > +#pragma GCC poison _=C4=A7
> > > +const char * operator ""_=C4=A7 (const char *, unsigned long); // { =
dg-bogus "poisoned" }
> > > +bool operator ""_=C4=A7 (unsigned long long x); // { dg-bogus "poiso=
ned" }
> > > +bool b =3D 1_=C4=A7; // { dg-bogus "poisoned" }
> > > +const char *x =3D "hbar"_=C4=A7; // { dg-bogus "poisoned" }
> > > +
> > > +/* Ideally, we should not warn here either, but this is not implemen=
ted yet.  This
> > > +   syntax has been deprecated for C++23.  */
> > > +#pragma GCC poison _=C4=A72
> > > +const char * operator "" _=C4=A72 (const char *, unsigned long); // =
{ dg-bogus "poisoned" "" { xfail *-*-*} }
> > > +const char *x2 =3D "hbar2"_=C4=A72; // { dg-bogus "poisoned" }
> > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/t=
estsuite/g++.dg/cpp0x/udlit-extended-id-4.C
> > > new file mode 100644
> > > index 00000000000..d1683c4d892
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
> > > @@ -0,0 +1,14 @@
> > > +// { dg-options "-std=3Dc++98 -Wc++11-compat" }
> > > +#define END ;
> > > +#define =CE=B5ND ;
> > > +#define E=CE=B7D ;
> > > +#define EN\u0394 ;
> > > +
> > > +const char *s1 =3D "s1"END // { dg-warning "requires a space between=
 string literal and macro" }
> > > +const char *s2 =3D "s2"=CE=B5ND // { dg-warning "requires a space be=
tween string literal and macro" }
> > > +const char *s3 =3D "s3"E=CE=B7D // { dg-warning "requires a space be=
tween string literal and macro" }
> > > +const char *s4 =3D "s4"EN=CE=94 // { dg-warning "requires a space be=
tween string literal and macro" }
> > > +
> > > +/* Make sure we did not skip the token also in the case that it wasn=
't found to
> > > +   be a macro; compilation should fail here.  */
> > > +const char *s5 =3D "s5"N=C3=98T_A_MACRO; // { dg-error "expected ','=
 or ';' before" }
> > > diff --git a/libcpp/lex.cc b/libcpp/lex.cc
> > > index 45ea16a91bc..062935e2371 100644
> > > --- a/libcpp/lex.cc
> > > +++ b/libcpp/lex.cc
> > > @@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile,
> > >      }
> > >  }
> > >
> > > -/* Returns TRUE if the sequence starting at buffer->cur is valid in
> > > -   an identifier.  FIRST is TRUE if this starts an identifier.  */
> > > +/* Returns TRUE if the byte sequence starting at buffer->cur is a va=
lid
> > > +   extended character in an identifier.  If FIRST is TRUE, then the =
character
> > > +   must be valid at the beginning of an identifier as well.  If the =
return
> > > +   value is TRUE, then pfile->buffer->cur has been moved to point to=
 the next
> > > +   byte after the extended character.  */
> > >
> > >  static bool
> > >  forms_identifier_p (cpp_reader *pfile, int first,
> > > @@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile)
> > >      }
> > >  }
> > >
> > > +/* Helper function to perform diagnostics that are needed (rarely)
> > > +   when an identifier is lexed.  */
> > > +static void
> > > +identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node=
)
> > > +{
> > > +  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
> > > +                       || pfile->state.skipping, 1))
> > > +    return;
> > > +
> > > +  /* It is allowed to poison the same identifier twice.  */
> > > +  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
> > > +    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\""=
,
> > > +              NODE_NAME (node));
> > > +
> > > +  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> > > +     replacement list of a variadic macro.  */
> > > +  if (node =3D=3D pfile->spec_nodes.n__VA_ARGS__
> > > +      && !pfile->state.va_args_ok)
> > > +    {
> > > +      if (CPP_OPTION (pfile, cplusplus))
> > > +       cpp_error (pfile, CPP_DL_PEDWARN,
> > > +                  "__VA_ARGS__ can only appear in the expansion"
> > > +                  " of a C++11 variadic macro");
> > > +      else
> > > +       cpp_error (pfile, CPP_DL_PEDWARN,
> > > +                  "__VA_ARGS__ can only appear in the expansion"
> > > +                  " of a C99 variadic macro");
> > > +    }
> > > +
> > > +  /* __VA_OPT__ should only appear in the replacement list of a
> > > +     variadic macro.  */
> > > +  if (node =3D=3D pfile->spec_nodes.n__VA_OPT__)
> > > +    maybe_va_opt_error (pfile);
> > > +
> > > +  /* For -Wc++-compat, warn about use of C++ named operators.  */
> > > +  if (node->flags & NODE_WARN_OPERATOR)
> > > +    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> > > +                "identifier \"%s\" is a special operator name in C++=
",
> > > +                NODE_NAME (node));
> > > +}
> > > +
> > >  /* Helper function to get the cpp_hashnode of the identifier BASE.  =
*/
> > >  static cpp_hashnode *
> > >  lex_identifier_intern (cpp_reader *pfile, const uchar *base)
> > > @@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, cons=
t uchar *base)
> > >    hash =3D HT_HASHFINISH (hash, len);
> > >    result =3D CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
> > >                                               base, len, hash, HT_ALL=
OC));
> > > -
> > > -  /* Rarely, identifiers require diagnostics when lexed.  */
> > > -  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
> > > -                       && !pfile->state.skipping, 0))
> > > -    {
> > > -      /* It is allowed to poison the same identifier twice.  */
> > > -      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_=
ok)
> > > -       cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s=
\"",
> > > -                  NODE_NAME (result));
> > > -
> > > -      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> > > -        replacement list of a variadic macro.  */
> > > -      if (result =3D=3D pfile->spec_nodes.n__VA_ARGS__
> > > -         && !pfile->state.va_args_ok)
> > > -       {
> > > -         if (CPP_OPTION (pfile, cplusplus))
> > > -           cpp_error (pfile, CPP_DL_PEDWARN,
> > > -                      "__VA_ARGS__ can only appear in the expansion"
> > > -                      " of a C++11 variadic macro");
> > > -         else
> > > -           cpp_error (pfile, CPP_DL_PEDWARN,
> > > -                      "__VA_ARGS__ can only appear in the expansion"
> > > -                      " of a C99 variadic macro");
> > > -       }
> > > -
> > > -      if (result =3D=3D pfile->spec_nodes.n__VA_OPT__)
> > > -       maybe_va_opt_error (pfile);
> > > -
> > > -      /* For -Wc++-compat, warn about use of C++ named operators.  *=
/
> > > -      if (result->flags & NODE_WARN_OPERATOR)
> > > -       cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> > > -                    "identifier \"%s\" is a special operator name in=
 C++",
> > > -                    NODE_NAME (result));
> > > -    }
> > > -
> > > +  identifier_diagnostics_on_lex (pfile, result);
> > >    return result;
> > >  }
> > >
> > > @@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const c=
har *name)
> > >    return result;
> > >  }
> > >
> > > -/* Lex an identifier starting at BUFFER->CUR - 1.  */
> > > +/* Lex an identifier starting at BASE.  BUFFER->CUR is expected to p=
oint
> > > +   one past the first character at BASE, which may be a (possibly mu=
lti-byte)
> > > +   character if STARTS_UCN is true.  */
> > >  static cpp_hashnode *
> > >  lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_uc=
n,
> > >                 struct normalize_state *nst, cpp_hashnode **spelling)
> > > @@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const ucha=
r *base, bool starts_ucn,
> > >        *spelling =3D result;
> > >      }
> > >
> > > -  /* Rarely, identifiers require diagnostics when lexed.  */
> > > -  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
> > > -                       && !pfile->state.skipping, 0))
> > > -    {
> > > -      /* It is allowed to poison the same identifier twice.  */
> > > -      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_=
ok)
> > > -       cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s=
\"",
> > > -                  NODE_NAME (result));
> > > -
> > > -      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> > > -        replacement list of a variadic macro.  */
> > > -      if (result =3D=3D pfile->spec_nodes.n__VA_ARGS__
> > > -         && !pfile->state.va_args_ok)
> > > -       {
> > > -         if (CPP_OPTION (pfile, cplusplus))
> > > -           cpp_error (pfile, CPP_DL_PEDWARN,
> > > -                      "__VA_ARGS__ can only appear in the expansion"
> > > -                      " of a C++11 variadic macro");
> > > -         else
> > > -           cpp_error (pfile, CPP_DL_PEDWARN,
> > > -                      "__VA_ARGS__ can only appear in the expansion"
> > > -                      " of a C99 variadic macro");
> > > -       }
> > > +  return result;
> > > +}
> > >
> > > -      /* __VA_OPT__ should only appear in the replacement list of a
> > > -        variadic macro.  */
> > > -      if (result =3D=3D pfile->spec_nodes.n__VA_OPT__)
> > > -       maybe_va_opt_error (pfile);
> > > -
> > > -      /* For -Wc++-compat, warn about use of C++ named operators.  *=
/
> > > -      if (result->flags & NODE_WARN_OPERATOR)
> > > -       cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> > > -                    "identifier \"%s\" is a special operator name in=
 C++",
> > > -                    NODE_NAME (result));
> > > -    }
> > > +/* Struct to hold the return value of the scan_cur_identifier () hel=
per
> > > +   function below.  */
> > >
> > > +struct scan_id_result
> > > +{
> > > +  cpp_hashnode *node;
> > > +  normalize_state nst;
> > > +
> > > +  scan_id_result ()
> > > +    : node (nullptr)
> > > +  {
> > > +    nst =3D INITIAL_NORMALIZE_STATE;
> > > +  }
> > > +
> > > +  explicit operator bool () const { return node; }
> > > +};
> > > +
> > > +/* Helper function to scan an entire identifier beginning at
> > > +   pfile->buffer->cur, and possibly containing extended characters (=
UCNs
> > > +   and/or UTF-8).  Returns the cpp_hashnode for the identifier on su=
ccess, or
> > > +   else nullptr, as well as a normalize_state so that normalization =
warnings
> > > +   may be issued once the token lexing is complete.  */
> > > +
> > > +static scan_id_result
> > > +scan_cur_identifier (cpp_reader *pfile)
> > > +{
> > > +  const auto buffer =3D pfile->buffer;
> > > +  const auto begin =3D buffer->cur;
> > > +  scan_id_result result;
> > > +  if (ISIDST (*buffer->cur))
> > > +    {
> > > +      ++buffer->cur;
> > > +      cpp_hashnode *ignore;
> > > +      result.node =3D lex_identifier (pfile, begin, false, &result.n=
st, &ignore);
> > > +    }
> > > +  else if (forms_identifier_p (pfile, true, &result.nst))
> > > +    {
> > > +      /* buffer->cur has been moved already by the call
> > > +        to forms_identifier_p.  */
> > > +      cpp_hashnode *ignore;
> > > +      result.node =3D lex_identifier (pfile, begin, true, &result.ns=
t, &ignore);
> > > +    }
> > >    return result;
> > >  }
> > >
> > > @@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *=
token, const uchar *base,
> > >    token->val.str.text =3D cpp_alloc_token_string (pfile, base, len);
> > >  }
> > >
> > > +/* Like create_literal(), but construct it from two separate strings
> > > +   which are concatenated.  LEN2 may be 0 if no second string is
> > > +   required.  */
> > > +static void
> > > +create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *b=
ase1,
> > > +                unsigned int len1, const uchar *base2, unsigned int =
len2,
> > > +                enum cpp_ttype type)
> > > +{
> > > +  token->type =3D type;
> > > +  token->val.str.len =3D len1 + len2;
> > > +  uchar *const dest =3D _cpp_unaligned_alloc (pfile, len1 + len2 + 1=
);
> > > +  memcpy (dest, base1, len1);
> > > +  if (len2)
> > > +    memcpy (dest+len1, base2, len2);
> > > +  dest[len1 + len2] =3D 0;
> > > +  token->val.str.text =3D dest;
> > > +}
> > > +
> > >  const uchar *
> > >  cpp_alloc_token_string (cpp_reader *pfile,
> > >                         const unsigned char *ptr, unsigned len)
> > > @@ -2403,6 +2442,11 @@ struct lit_accum {
> > >        rpos =3D NULL;
> > >      return c;
> > >    }
> > > +
> > > +  void create_literal2 (cpp_reader *pfile, cpp_token *token,
> > > +                       const uchar *base1, unsigned int len1,
> > > +                       const uchar *base2, unsigned int len2,
> > > +                       enum cpp_ttype type);
> > >  };
> > >
> > >  /* Subroutine of lex_raw_string: Append LEN chars from BASE to the b=
uffer
> > > @@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile)
> > >    rpos =3D BUFF_FRONT (last);
> > >  }
> > >
> > > -/* Returns true if a macro has been defined.
> > > -   This might not work if compile with -save-temps,
> > > -   or preprocess separately from compilation.  */
> > > +/* Helper function to check if a string format macro, say from intty=
pes.h, is
> > > +   placed touching a string literal, in which case it could be parse=
d as a C++11
> > > +   user-defined string literal thus breaking the program.  User-defi=
ned literals
> > > +   outside of namespace std must start with a single underscore, so =
assume
> > > +   anything of that form really is a UDL suffix.  We don't need to w=
orry about
> > > +   UDLs defined inside namespace std because their names are reserve=
d, so cannot
> > > +   be used as macro names in valid programs.  Return TRUE if the UDL=
 should be
> > > +   ignored for now and preserved for potential macro expansion.  */
> > >
> > >  static bool
> > > -is_macro(cpp_reader *pfile, const uchar *base)
> > > +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc=
,
> > > +                              const uchar *suffix_begin, cpp_hashnod=
e *node)
> > >  {
> > > -  const uchar *cur =3D base;
> > > -  if (! ISIDST (*cur))
> > > +  if ((suffix_begin[0] =3D=3D '_' && suffix_begin[1] !=3D '_')
> > > +      || !cpp_macro_p (node))
> > >      return false;
> > > -  unsigned int hash =3D HT_HASHSTEP (0, *cur);
> > > -  ++cur;
> > > -  while (ISIDNUM (*cur))
> > > -    {
> > > -      hash =3D HT_HASHSTEP (hash, *cur);
> > > -      ++cur;
> > > -    }
> > > -  hash =3D HT_HASHFINISH (hash, cur - base);
> > >
> > > -  cpp_hashnode *result =3D CPP_HASHNODE (ht_lookup_with_hash (pfile-=
>hash_table,
> > > -                                       base, cur - base, hash, HT_NO=
_INSERT));
> > > -
> > > -  return result && cpp_macro_p (result);
> > > +  /* Maybe raise a warning here; caller should arrange not to consum=
e
> > > +     the tokens.  */
> > > +  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipp=
ing)
> > > +    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> > > +                          "invalid suffix on literal; C++11 requires=
 a space "
> > > +                          "between literal and string macro");
> > > +  return true;
> > >  }
> > >
> > > -/* Returns true if a literal suffix does not have the expected form
> > > -   and is defined as a macro.  */
> > > -
> > > -static bool
> > > -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
> > > +/* Like create_literal2(), but also prepend all the accumulated data=
 from
> > > +   the lit_accum struct.  */
> > > +void
> > > +lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
> > > +                           const uchar *base1, unsigned int len1,
> > > +                           const uchar *base2, unsigned int len2,
> > > +                           enum cpp_ttype type)
> > >  {
> > > -  /* User-defined literals outside of namespace std must start with =
a single
> > > -     underscore, so assume anything of that form really is a UDL suf=
fix.
> > > -     We don't need to worry about UDLs defined inside namespace std =
because
> > > -     their names are reserved, so cannot be used as macro names in v=
alid
> > > -     programs.  */
> > > -  if (base[0] =3D=3D '_' && base[1] !=3D '_')
> > > -    return false;
> > > -  return is_macro (pfile, base);
> > > +  const unsigned int tot_len =3D accum + len1 + len2;
> > > +  uchar *dest =3D _cpp_unaligned_alloc (pfile, tot_len + 1);
> > > +  token->type =3D type;
> > > +  token->val.str.len =3D tot_len;
> > > +  token->val.str.text =3D dest;
> > > +  for (_cpp_buff *buf =3D first; buf; buf =3D buf->next)
> > > +    {
> > > +      size_t len =3D BUFF_FRONT (buf) - buf->base;
> > > +      memcpy (dest, buf->base, len);
> > > +      dest +=3D len;
> > > +    }
> > > +  memcpy (dest, base1, len1);
> > > +  dest +=3D len1;
> > > +  if (len2)
> > > +    memcpy (dest, base2, len2);
> > > +  dest +=3D len2;
> > > +  *dest =3D '\0';
> > >  }
> > >
> > >  /* Lexes a raw string.  The stored string contains the spelling,
> > > @@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token =
*token, const uchar *base)
> > >
> > >    if (CPP_OPTION (pfile, user_literals))
> > >      {
> > > -      /* If a string format macro, say from inttypes.h, is placed to=
uching
> > > -        a string literal it could be parsed as a C++11 user-defined =
string
> > > -        literal thus breaking the program.  */
> > > -      if (is_macro_not_literal_suffix (pfile, pos))
> > > -       {
> > > -         /* Raise a warning, but do not consume subsequent tokens.  =
*/
> > > -         if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->stat=
e.skipping)
> > > -           cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
> > > -                                  token->src_loc, 0,
> > > -                                  "invalid suffix on literal; C++11 =
requires "
> > > -                                  "a space between literal and strin=
g macro");
> > > -       }
> > > -      /* Grab user defined literal suffix.  */
> > > -      else if (ISIDST (*pos))
> > > -       {
> > > -         type =3D cpp_userdef_string_add_type (type);
> > > -         ++pos;
> > > +      const uchar *const suffix_begin =3D pos;
> > > +      pfile->buffer->cur =3D pos;
> > >
> > > -         while (ISIDNUM (*pos))
> > > -           ++pos;
> > > +      if (const auto sr =3D scan_cur_identifier (pfile))
> > > +       {
> > > +         if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
> > > +                                            suffix_begin, sr.node))
> > > +             pfile->buffer->cur =3D suffix_begin;
> > > +         else
> > > +           {
> > > +             type =3D cpp_userdef_string_add_type (type);
> > > +             accum.create_literal2 (pfile, token, base, suffix_begin=
 - base,
> > > +                                    NODE_NAME (sr.node), NODE_LEN (s=
r.node),
> > > +                                    type);
> > > +             if (accum.first)
> > > +               _cpp_release_buff (pfile, accum.first);
> > > +             warn_about_normalization (pfile, token, &sr.nst, true);
> > > +             return;
> > > +           }
> > >         }
> > >      }
> > >
> > > @@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *=
token, const uchar *base)
> > >      create_literal (pfile, token, base, pos - base, type);
> > >    else
> > >      {
> > > -      size_t extra_len =3D pos - base;
> > > -      uchar *dest =3D _cpp_unaligned_alloc (pfile, accum.accum + ext=
ra_len + 1);
> > > -
> > > -      token->type =3D type;
> > > -      token->val.str.len =3D accum.accum + extra_len;
> > > -      token->val.str.text =3D dest;
> > > -      for (_cpp_buff *buf =3D accum.first; buf; buf =3D buf->next)
> > > -       {
> > > -         size_t len =3D BUFF_FRONT (buf) - buf->base;
> > > -         memcpy (dest, buf->base, len);
> > > -         dest +=3D len;
> > > -       }
> > > +      accum.create_literal2 (pfile, token, base, pos - base, nullptr=
, 0, type);
> > >        _cpp_release_buff (pfile, accum.first);
> > > -      memcpy (dest, base, extra_len);
> > > -      dest[extra_len] =3D '\0';
> > >      }
> > >  }
> > >
> > > @@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *tok=
en, const uchar *base)
> > >      cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c charac=
ter",
> > >                (int) terminator);
> > >
> > > +  pfile->buffer->cur =3D cur;
> > > +  const uchar *const suffix_begin =3D cur;
> > > +
> > >    if (CPP_OPTION (pfile, user_literals))
> > >      {
> > > -      /* If a string format macro, say from inttypes.h, is placed to=
uching
> > > -        a string literal it could be parsed as a C++11 user-defined =
string
> > > -        literal thus breaking the program.  */
> > > -      if (is_macro_not_literal_suffix (pfile, cur))
> > > -       {
> > > -         /* Raise a warning, but do not consume subsequent tokens.  =
*/
> > > -         if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->stat=
e.skipping)
> > > -           cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
> > > -                                  token->src_loc, 0,
> > > -                                  "invalid suffix on literal; C++11 =
requires "
> > > -                                  "a space between literal and strin=
g macro");
> > > -       }
> > > -      /* Grab user defined literal suffix.  */
> > > -      else if (ISIDST (*cur))
> > > +      if (const auto sr =3D scan_cur_identifier (pfile))
> > >         {
> > > -         type =3D cpp_userdef_char_add_type (type);
> > > -         type =3D cpp_userdef_string_add_type (type);
> > > -          ++cur;
> > > -
> > > -         while (ISIDNUM (*cur))
> > > -           ++cur;
> > > +         if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
> > > +                                            suffix_begin, sr.node))
> > > +           pfile->buffer->cur =3D suffix_begin;
> > > +         else
> > > +           {
> > > +             /* Grab user defined literal suffix.  */
> > > +             type =3D cpp_userdef_char_add_type (type);
> > > +             type =3D cpp_userdef_string_add_type (type);
> > > +             create_literal2 (pfile, token, base, suffix_begin - bas=
e,
> > > +                              NODE_NAME (sr.node), NODE_LEN (sr.node=
), type);
> > > +             warn_about_normalization (pfile, token, &sr.nst, true);
> > > +             return;
> > > +           }
> > >         }
> > >      }
> > >    else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
> > > -          && is_macro (pfile, cur)
> > >            && !pfile->state.skipping)
> > > -    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
> > > -                          token->src_loc, 0, "C++11 requires a space=
 "
> > > -                          "between string literal and macro");
> > > +    {
> > > +      const auto sr =3D scan_cur_identifier (pfile);
> > > +      /* Maybe raise a warning, but do not consume the tokens.  */
> > > +      pfile->buffer->cur =3D suffix_begin;
> > > +      if (sr && cpp_macro_p (sr.node))
> > > +       cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
> > > +                              token->src_loc, 0, "C++11 requires a s=
pace "
> > > +                              "between string literal and macro");
> > > +    }
> > >
> > > -  pfile->buffer->cur =3D cur;
> > >    create_literal (pfile, token, base, cur - base, type);
> > >  }
> > >
> > > @@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile)
> > >        result->type =3D CPP_NAME;
> > >        {
> > >         struct normalize_state nst =3D INITIAL_NORMALIZE_STATE;
> > > -       result->val.node.node =3D lex_identifier (pfile, buffer->cur =
- 1, false,
> > > -                                               &nst,
> > > -                                               &result->val.node.spe=
lling);
> > > +       const auto node =3D lex_identifier (pfile, buffer->cur - 1, f=
alse, &nst,
> > > +                                         &result->val.node.spelling)=
;
> > > +       result->val.node.node =3D node;
> > > +       identifier_diagnostics_on_lex (pfile, node);
> > >         warn_about_normalization (pfile, result, &nst, true);
> > >        }
> > >
> > > @@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile)
> > >         if (forms_identifier_p (pfile, true, &nst))
> > >           {
> > >             result->type =3D CPP_NAME;
> > > -           result->val.node.node =3D lex_identifier (pfile, base, tr=
ue, &nst,
> > > -                                                   &result->val.node=
.spelling);
> > > +           const auto node =3D lex_identifier (pfile, base, true, &n=
st,
> > > +                                             &result->val.node.spell=
ing);
> > > +           result->val.node.node =3D node;
> > > +           identifier_diagnostics_on_lex (pfile, node);
> > >             warn_about_normalization (pfile, result, &nst, true);
> > >             break;
> > >           }
> > > @@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type)
> > >  }
> > >
> > >  /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
> > > -   The buffer must already contain the enough space to hold the
> > > +   The buffer must already contain enough space to hold the
> > >     token's spelling.  Returns a pointer to the character after the
> > >     last character written.  */
> > >  unsigned char *
> > > @@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, c=
pp_hashnode *ident)
> > >  }
> > >
> > >  /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
> > > -   already contain the enough space to hold the token's spelling.
> > > +   already contain enough space to hold the token's spelling.
> > >     Returns a pointer to the character after the last character writt=
en.
> > >     FORSTRING is true if this is to be the spelling after translation
> > >     phase 1 (with the original spelling of extended identifiers), fal=
se