From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by sourceware.org (Postfix) with ESMTPS id 3C5EB3858D1E for ; Tue, 11 Jul 2023 23:30:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3C5EB3858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-4f95bf5c493so9546723e87.3 for ; Tue, 11 Jul 2023 16:30:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1689118248; x=1691710248; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d2/+AefewqHEhg9J1yMtnVc27+xbkz0nsUsZGqc2LXQ=; b=Oivtqv8swoV4VMc5ikKMHSyrJ6iTuZHw38PIhmHFj5rSA1nRMVARmD2ExCCKngdHKu riyBbd74Udo7UB5Vx6ZPTdgveehgWiNqzJDA9KBaw6wMhRAXe8XOGz5do9NLkhENVL61 8u4eNfAwG3lS0PHZ2+TU7QHnBSjjnyHKFzn8l1AjKeJ4bncqDhCKWfVav/SuH0aSJlUS ptmfyl2eiJcz5GU9tXoYKUmq7bPFEr70sOom6dzn8A/1grB+PJHDJxDBj248oN93f8Mx D+zzXxpSWJ+W19GnLqLJgIwaifGKXIKqUciIHlQOC3bVPpRh5P2UEKJhgSVsH8TWivge bq0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689118248; x=1691710248; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d2/+AefewqHEhg9J1yMtnVc27+xbkz0nsUsZGqc2LXQ=; b=OozJ1LlIVmYd4vpgFeA+FveTW9Bjzj9/cGQV0a1glH+hX7CZHzjBZadrUuvz27Vhui Oltl6XTWn0JI7aB8kGDOugl2LA+rr+l9b7MMOftlyyd9+s6HxLU/hTc4pF5C/wE836QS ZobOw68GMzCCzIAtodR/P6BALIcnocSLYEHrMjZ8xJz5DqK3St9iZ61iwXFgcF7FVcyr bPkJbUfXHNeJqcKFZngLKoAJtW9GnvI7wXz5eRY7M7meJ3s9x3Ekea9wwp3Dt/O4fdjO vX/l8y9D2gfbpeieDZjjniX+yqOV6olRrtfUahSZU/ZKh68WSQa22BthBz1suHmK7I/g EVgg== X-Gm-Message-State: ABy/qLZZio3FNzKIl2Omo2N1IzYHBAlgvGkcJC8QfH0COIbscbK0dkcX IOO77ctTNujYiwPTlAsGLs97hQgoO0+TkagsAsTPCGw7 X-Google-Smtp-Source: APBJJlF1q2SklFM2nolmMTmGQfVQBQ2YdSDuuuPRPHRqUQvsIdYD1lH51z0srX6drf5+70/LIV72KnOIDbgJ18mb5IU= X-Received: by 2002:a19:7114:0:b0:4f9:56aa:26c5 with SMTP id m20-20020a197114000000b004f956aa26c5mr13366732lfc.25.1689118247821; Tue, 11 Jul 2023 16:30:47 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Lewis Hyatt Date: Tue, 11 Jul 2023 19:30:36 -0400 Message-ID: Subject: Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902] To: gcc-patches List Cc: Jason Merrill , Jakub Jelinek Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3035.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: May I please ping this patch again? I think it would be worthwhile to close this gap in the support for UTF-8 sources. Thanks! https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html -Lewis On Fri, Jun 2, 2023 at 9:45=E2=80=AFAM Lewis Hyatt wrote= : > > Hello- > > Ping please? Thanks. > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html > > -Lewis > > On Tue, May 2, 2023 at 9:27=E2=80=AFAM Lewis Hyatt wro= te: > > > > May I please ping this one? Thanks... > > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html > > > > On Thu, Mar 2, 2023 at 6:21=E2=80=AFPM Lewis Hyatt w= rote: > > > > > > The PR complains that we do not handle UTF-8 in the suffix for a user= -defined > > > literal, such as: > > > > > > bool operator ""_=CF=80 (unsigned long long); > > > > > > In fact we don't handle any extended identifier characters there, whe= ther > > > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional spac= e after > > > the "" tokens is included, since then the identifier is lexed in the = "normal" > > > way as its own token. But when it is lexed as part of the string toke= n, this > > > is handled in lex_string() with a one-off loop that is not aware of e= xtended > > > characters. > > > > > > This patch fixes it by adding a new function scan_cur_identifier() th= at can be > > > used to lex an identifier while in the middle of lexing another token= . > > > > > > BTW, the other place that has been mis-lexing identifiers is > > > lex_identifier_intern(), which is used to implement #pragma push_macr= o > > > and #pragma pop_macro. This does not support extended characters eith= er. > > > I will add that in a subsequent patch, because it can't directly reus= e the > > > new function, but rather needs to lex from a string instead of a cpp_= buffer. > > > > > > With scan_cur_identifier(), we do also correctly warn about bidi and > > > normalization issues in the extended identifiers comprising the suffi= x. > > > > > > libcpp/ChangeLog: > > > > > > PR preprocessor/103902 > > > * lex.cc (identifier_diagnostics_on_lex): New function refact= oring > > > some common code. > > > (lex_identifier_intern): Use the new function. > > > (lex_identifier): Don't run identifier diagnostics here, rath= er let > > > the call site do it when needed. > > > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > > > acccordingly. > > > (struct scan_id_result): New struct. > > > (scan_cur_identifier): New function. > > > (create_literal2): New function. > > > (lit_accum::create_literal2): New function. > > > (is_macro): Folded into new function... > > > (maybe_ignore_udl_macro_suffix): ...here. > > > (is_macro_not_literal_suffix): Folded likewise. > > > (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_ide= ntifier (). > > > (lex_string): Likewise. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR preprocessor/103902 > > > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > > > --- > > > > > > Notes: > > > Hello- > > > > > > This is the updated version of the patch, incorporating feedback = from Jakub > > > and Jason, most recently discussed here: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.ht= ml > > > > > > Please let me know how it looks? It is simpler than before with t= he new > > > approach. Thanks! > > > > > > One thing to note. As Jason clarified for me, a usage like this: > > > > > > #pragma GCC poison _x > > > const char * operator "" _x (const char *, unsigned long); > > > > > > The space between the "" and the _x is currently allowed but will= be > > > deprecated in C++23. GCC currently will complain about the poison= ed use of > > > _x in this case, and this patch, which is just focused on handlin= g UTF-8 > > > properly, does not change this. But it seems that it would be cor= rect > > > not to apply poison in this case. I can try to follow up with a p= atch to do > > > so, if it seems worthwhile? Given the syntax is deprecated, maybe= it's not > > > worth it... > > > > > > For the time being, this patch does add a testcase for the above = and xfails > > > it. For the case where no space is present, which is the part tou= ched by the > > > present patch, existing behavior is preserved correctly and no di= agnostics > > > such as poison are issued for the UDL suffix. (Contrary to v1 of = this > > > patch.) > > > > > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > > > no regressions. > > > > > > -Lewis > > > > > > .../g++.dg/cpp0x/udlit-extended-id-1.C | 68 ++++ > > > .../g++.dg/cpp0x/udlit-extended-id-2.C | 6 + > > > .../g++.dg/cpp0x/udlit-extended-id-3.C | 15 + > > > .../g++.dg/cpp0x/udlit-extended-id-4.C | 14 + > > > libcpp/lex.cc | 382 ++++++++++------= -- > > > 5 files changed, 317 insertions(+), 168 deletions(-) > > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > > > > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/t= estsuite/g++.dg/cpp0x/udlit-extended-id-1.C > > > new file mode 100644 > > > index 00000000000..411d4fdd0ba > > > --- /dev/null > > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > > > @@ -0,0 +1,68 @@ > > > +// { dg-do run { target c++11 } } > > > +// { dg-additional-options "-Wno-error=3Dnormalized" } > > > +#include > > > +using namespace std; > > > + > > > +constexpr unsigned long long operator "" _=CF=80 (unsigned long long= x) > > > +{ > > > + return 3 * x; > > > +} > > > + > > > +/* Historically we didn't parse properly as part of the "" token, so= check that > > > + as well. */ > > > +constexpr unsigned long long operator ""_=CE=A02 (unsigned long long= x) > > > +{ > > > + return 4 * x; > > > +} > > > + > > > +char x1[1_=CF=80]; > > > +char x2[2_=CE=A02]; > > > + > > > +static_assert (sizeof x1 =3D=3D 3, "test1"); > > > +static_assert (sizeof x2 =3D=3D 8, "test2"); > > > + > > > +const char * operator "" _1=CF=83 (const char *s, unsigned long) > > > +{ > > > + return s + 1; > > > +} > > > + > > > +const char * operator ""_=CE=A32 (const char *s, unsigned long) > > > +{ > > > + return s + 2; > > > +} > > > + > > > +const char * operator "" _\U000000e61 (const char *s, unsigned long) > > > +{ > > > + return "ae"; > > > +} > > > + > > > +const char* operator ""_\u01532 (const char *s, unsigned long) > > > +{ > > > + return "oe"; > > > +} > > > + > > > +bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning= "not in NFC" } > > > +bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warn= ing "not in NFC" } > > > + > > > +#define x=CF=84y > > > +const char * str =3D ""x=CF=84y; // { dg-warning "invalid suffix on = literal" } > > > + > > > +int main() > > > +{ > > > + if (3_=CF=80 !=3D 9) > > > + __builtin_abort (); > > > + if (4_=CE=A02 !=3D 16) > > > + __builtin_abort (); > > > + if (strcmp ("abc"_1=CF=83, "bc")) > > > + __builtin_abort (); > > > + if (strcmp ("abcd"_=CE=A32, "cd")) > > > + __builtin_abort (); > > > + if (strcmp (R"(abcdef)"_1=CF=83, "bcdef")) > > > + __builtin_abort (); > > > + if (strcmp (R"(abcdef)"_=CE=A32, "cdef")) > > > + __builtin_abort (); > > > + if (strcmp ("xyz"_=C3=A61, "ae")) > > > + __builtin_abort (); > > > + if (strcmp ("xyz"_=C5=932, "oe")) > > > + __builtin_abort (); > > > +} > > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/t= estsuite/g++.dg/cpp0x/udlit-extended-id-2.C > > > new file mode 100644 > > > index 00000000000..05a2804a463 > > > --- /dev/null > > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > > > @@ -0,0 +1,6 @@ > > > +// { dg-do compile { target c++11 } } > > > +// { dg-additional-options "-Wbidi-chars=3Dany,ucn" } > > > +bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line = line1 } > > > +// { dg-error "universal character \\\\u202a is not valid in an iden= tifier" "test1" { target *-*-* } line1 } > > > +// { dg-error "universal character \\\\u202c is not valid in an iden= tifier" "test2" { target *-*-* } line1 } > > > +// { dg-warning "found problematic Unicode character" "test3" { targ= et *-*-* } line1 } > > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/t= estsuite/g++.dg/cpp0x/udlit-extended-id-3.C > > > new file mode 100644 > > > index 00000000000..11292e476e3 > > > --- /dev/null > > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > > > @@ -0,0 +1,15 @@ > > > +// { dg-do compile { target c++11 } } > > > + > > > +// Check that we do not look for poisoned identifier when it is a su= ffix. > > > +int _=C4=A7; > > > +#pragma GCC poison _=C4=A7 > > > +const char * operator ""_=C4=A7 (const char *, unsigned long); // { = dg-bogus "poisoned" } > > > +bool operator ""_=C4=A7 (unsigned long long x); // { dg-bogus "poiso= ned" } > > > +bool b =3D 1_=C4=A7; // { dg-bogus "poisoned" } > > > +const char *x =3D "hbar"_=C4=A7; // { dg-bogus "poisoned" } > > > + > > > +/* Ideally, we should not warn here either, but this is not implemen= ted yet. This > > > + syntax has been deprecated for C++23. */ > > > +#pragma GCC poison _=C4=A72 > > > +const char * operator "" _=C4=A72 (const char *, unsigned long); // = { dg-bogus "poisoned" "" { xfail *-*-*} } > > > +const char *x2 =3D "hbar2"_=C4=A72; // { dg-bogus "poisoned" } > > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/t= estsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > > new file mode 100644 > > > index 00000000000..d1683c4d892 > > > --- /dev/null > > > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > > @@ -0,0 +1,14 @@ > > > +// { dg-options "-std=3Dc++98 -Wc++11-compat" } > > > +#define END ; > > > +#define =CE=B5ND ; > > > +#define E=CE=B7D ; > > > +#define EN\u0394 ; > > > + > > > +const char *s1 =3D "s1"END // { dg-warning "requires a space between= string literal and macro" } > > > +const char *s2 =3D "s2"=CE=B5ND // { dg-warning "requires a space be= tween string literal and macro" } > > > +const char *s3 =3D "s3"E=CE=B7D // { dg-warning "requires a space be= tween string literal and macro" } > > > +const char *s4 =3D "s4"EN=CE=94 // { dg-warning "requires a space be= tween string literal and macro" } > > > + > > > +/* Make sure we did not skip the token also in the case that it wasn= 't found to > > > + be a macro; compilation should fail here. */ > > > +const char *s5 =3D "s5"N=C3=98T_A_MACRO; // { dg-error "expected ','= or ';' before" } > > > diff --git a/libcpp/lex.cc b/libcpp/lex.cc > > > index 45ea16a91bc..062935e2371 100644 > > > --- a/libcpp/lex.cc > > > +++ b/libcpp/lex.cc > > > @@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile, > > > } > > > } > > > > > > -/* Returns TRUE if the sequence starting at buffer->cur is valid in > > > - an identifier. FIRST is TRUE if this starts an identifier. */ > > > +/* Returns TRUE if the byte sequence starting at buffer->cur is a va= lid > > > + extended character in an identifier. If FIRST is TRUE, then the = character > > > + must be valid at the beginning of an identifier as well. If the = return > > > + value is TRUE, then pfile->buffer->cur has been moved to point to= the next > > > + byte after the extended character. */ > > > > > > static bool > > > forms_identifier_p (cpp_reader *pfile, int first, > > > @@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile) > > > } > > > } > > > > > > +/* Helper function to perform diagnostics that are needed (rarely) > > > + when an identifier is lexed. */ > > > +static void > > > +identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node= ) > > > +{ > > > + if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC) > > > + || pfile->state.skipping, 1)) > > > + return; > > > + > > > + /* It is allowed to poison the same identifier twice. */ > > > + if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok) > > > + cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\""= , > > > + NODE_NAME (node)); > > > + > > > + /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > > > + replacement list of a variadic macro. */ > > > + if (node =3D=3D pfile->spec_nodes.n__VA_ARGS__ > > > + && !pfile->state.va_args_ok) > > > + { > > > + if (CPP_OPTION (pfile, cplusplus)) > > > + cpp_error (pfile, CPP_DL_PEDWARN, > > > + "__VA_ARGS__ can only appear in the expansion" > > > + " of a C++11 variadic macro"); > > > + else > > > + cpp_error (pfile, CPP_DL_PEDWARN, > > > + "__VA_ARGS__ can only appear in the expansion" > > > + " of a C99 variadic macro"); > > > + } > > > + > > > + /* __VA_OPT__ should only appear in the replacement list of a > > > + variadic macro. */ > > > + if (node =3D=3D pfile->spec_nodes.n__VA_OPT__) > > > + maybe_va_opt_error (pfile); > > > + > > > + /* For -Wc++-compat, warn about use of C++ named operators. */ > > > + if (node->flags & NODE_WARN_OPERATOR) > > > + cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > > > + "identifier \"%s\" is a special operator name in C++= ", > > > + NODE_NAME (node)); > > > +} > > > + > > > /* Helper function to get the cpp_hashnode of the identifier BASE. = */ > > > static cpp_hashnode * > > > lex_identifier_intern (cpp_reader *pfile, const uchar *base) > > > @@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, cons= t uchar *base) > > > hash =3D HT_HASHFINISH (hash, len); > > > result =3D CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table, > > > base, len, hash, HT_ALL= OC)); > > > - > > > - /* Rarely, identifiers require diagnostics when lexed. */ > > > - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) > > > - && !pfile->state.skipping, 0)) > > > - { > > > - /* It is allowed to poison the same identifier twice. */ > > > - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_= ok) > > > - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s= \"", > > > - NODE_NAME (result)); > > > - > > > - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > > > - replacement list of a variadic macro. */ > > > - if (result =3D=3D pfile->spec_nodes.n__VA_ARGS__ > > > - && !pfile->state.va_args_ok) > > > - { > > > - if (CPP_OPTION (pfile, cplusplus)) > > > - cpp_error (pfile, CPP_DL_PEDWARN, > > > - "__VA_ARGS__ can only appear in the expansion" > > > - " of a C++11 variadic macro"); > > > - else > > > - cpp_error (pfile, CPP_DL_PEDWARN, > > > - "__VA_ARGS__ can only appear in the expansion" > > > - " of a C99 variadic macro"); > > > - } > > > - > > > - if (result =3D=3D pfile->spec_nodes.n__VA_OPT__) > > > - maybe_va_opt_error (pfile); > > > - > > > - /* For -Wc++-compat, warn about use of C++ named operators. *= / > > > - if (result->flags & NODE_WARN_OPERATOR) > > > - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > > > - "identifier \"%s\" is a special operator name in= C++", > > > - NODE_NAME (result)); > > > - } > > > - > > > + identifier_diagnostics_on_lex (pfile, result); > > > return result; > > > } > > > > > > @@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const c= har *name) > > > return result; > > > } > > > > > > -/* Lex an identifier starting at BUFFER->CUR - 1. */ > > > +/* Lex an identifier starting at BASE. BUFFER->CUR is expected to p= oint > > > + one past the first character at BASE, which may be a (possibly mu= lti-byte) > > > + character if STARTS_UCN is true. */ > > > static cpp_hashnode * > > > lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_uc= n, > > > struct normalize_state *nst, cpp_hashnode **spelling) > > > @@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const ucha= r *base, bool starts_ucn, > > > *spelling =3D result; > > > } > > > > > > - /* Rarely, identifiers require diagnostics when lexed. */ > > > - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) > > > - && !pfile->state.skipping, 0)) > > > - { > > > - /* It is allowed to poison the same identifier twice. */ > > > - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_= ok) > > > - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s= \"", > > > - NODE_NAME (result)); > > > - > > > - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > > > - replacement list of a variadic macro. */ > > > - if (result =3D=3D pfile->spec_nodes.n__VA_ARGS__ > > > - && !pfile->state.va_args_ok) > > > - { > > > - if (CPP_OPTION (pfile, cplusplus)) > > > - cpp_error (pfile, CPP_DL_PEDWARN, > > > - "__VA_ARGS__ can only appear in the expansion" > > > - " of a C++11 variadic macro"); > > > - else > > > - cpp_error (pfile, CPP_DL_PEDWARN, > > > - "__VA_ARGS__ can only appear in the expansion" > > > - " of a C99 variadic macro"); > > > - } > > > + return result; > > > +} > > > > > > - /* __VA_OPT__ should only appear in the replacement list of a > > > - variadic macro. */ > > > - if (result =3D=3D pfile->spec_nodes.n__VA_OPT__) > > > - maybe_va_opt_error (pfile); > > > - > > > - /* For -Wc++-compat, warn about use of C++ named operators. *= / > > > - if (result->flags & NODE_WARN_OPERATOR) > > > - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > > > - "identifier \"%s\" is a special operator name in= C++", > > > - NODE_NAME (result)); > > > - } > > > +/* Struct to hold the return value of the scan_cur_identifier () hel= per > > > + function below. */ > > > > > > +struct scan_id_result > > > +{ > > > + cpp_hashnode *node; > > > + normalize_state nst; > > > + > > > + scan_id_result () > > > + : node (nullptr) > > > + { > > > + nst =3D INITIAL_NORMALIZE_STATE; > > > + } > > > + > > > + explicit operator bool () const { return node; } > > > +}; > > > + > > > +/* Helper function to scan an entire identifier beginning at > > > + pfile->buffer->cur, and possibly containing extended characters (= UCNs > > > + and/or UTF-8). Returns the cpp_hashnode for the identifier on su= ccess, or > > > + else nullptr, as well as a normalize_state so that normalization = warnings > > > + may be issued once the token lexing is complete. */ > > > + > > > +static scan_id_result > > > +scan_cur_identifier (cpp_reader *pfile) > > > +{ > > > + const auto buffer =3D pfile->buffer; > > > + const auto begin =3D buffer->cur; > > > + scan_id_result result; > > > + if (ISIDST (*buffer->cur)) > > > + { > > > + ++buffer->cur; > > > + cpp_hashnode *ignore; > > > + result.node =3D lex_identifier (pfile, begin, false, &result.n= st, &ignore); > > > + } > > > + else if (forms_identifier_p (pfile, true, &result.nst)) > > > + { > > > + /* buffer->cur has been moved already by the call > > > + to forms_identifier_p. */ > > > + cpp_hashnode *ignore; > > > + result.node =3D lex_identifier (pfile, begin, true, &result.ns= t, &ignore); > > > + } > > > return result; > > > } > > > > > > @@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *= token, const uchar *base, > > > token->val.str.text =3D cpp_alloc_token_string (pfile, base, len); > > > } > > > > > > +/* Like create_literal(), but construct it from two separate strings > > > + which are concatenated. LEN2 may be 0 if no second string is > > > + required. */ > > > +static void > > > +create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *b= ase1, > > > + unsigned int len1, const uchar *base2, unsigned int = len2, > > > + enum cpp_ttype type) > > > +{ > > > + token->type =3D type; > > > + token->val.str.len =3D len1 + len2; > > > + uchar *const dest =3D _cpp_unaligned_alloc (pfile, len1 + len2 + 1= ); > > > + memcpy (dest, base1, len1); > > > + if (len2) > > > + memcpy (dest+len1, base2, len2); > > > + dest[len1 + len2] =3D 0; > > > + token->val.str.text =3D dest; > > > +} > > > + > > > const uchar * > > > cpp_alloc_token_string (cpp_reader *pfile, > > > const unsigned char *ptr, unsigned len) > > > @@ -2403,6 +2442,11 @@ struct lit_accum { > > > rpos =3D NULL; > > > return c; > > > } > > > + > > > + void create_literal2 (cpp_reader *pfile, cpp_token *token, > > > + const uchar *base1, unsigned int len1, > > > + const uchar *base2, unsigned int len2, > > > + enum cpp_ttype type); > > > }; > > > > > > /* Subroutine of lex_raw_string: Append LEN chars from BASE to the b= uffer > > > @@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile) > > > rpos =3D BUFF_FRONT (last); > > > } > > > > > > -/* Returns true if a macro has been defined. > > > - This might not work if compile with -save-temps, > > > - or preprocess separately from compilation. */ > > > +/* Helper function to check if a string format macro, say from intty= pes.h, is > > > + placed touching a string literal, in which case it could be parse= d as a C++11 > > > + user-defined string literal thus breaking the program. User-defi= ned literals > > > + outside of namespace std must start with a single underscore, so = assume > > > + anything of that form really is a UDL suffix. We don't need to w= orry about > > > + UDLs defined inside namespace std because their names are reserve= d, so cannot > > > + be used as macro names in valid programs. Return TRUE if the UDL= should be > > > + ignored for now and preserved for potential macro expansion. */ > > > > > > static bool > > > -is_macro(cpp_reader *pfile, const uchar *base) > > > +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc= , > > > + const uchar *suffix_begin, cpp_hashnod= e *node) > > > { > > > - const uchar *cur =3D base; > > > - if (! ISIDST (*cur)) > > > + if ((suffix_begin[0] =3D=3D '_' && suffix_begin[1] !=3D '_') > > > + || !cpp_macro_p (node)) > > > return false; > > > - unsigned int hash =3D HT_HASHSTEP (0, *cur); > > > - ++cur; > > > - while (ISIDNUM (*cur)) > > > - { > > > - hash =3D HT_HASHSTEP (hash, *cur); > > > - ++cur; > > > - } > > > - hash =3D HT_HASHFINISH (hash, cur - base); > > > > > > - cpp_hashnode *result =3D CPP_HASHNODE (ht_lookup_with_hash (pfile-= >hash_table, > > > - base, cur - base, hash, HT_NO= _INSERT)); > > > - > > > - return result && cpp_macro_p (result); > > > + /* Maybe raise a warning here; caller should arrange not to consum= e > > > + the tokens. */ > > > + if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipp= ing) > > > + cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0, > > > + "invalid suffix on literal; C++11 requires= a space " > > > + "between literal and string macro"); > > > + return true; > > > } > > > > > > -/* Returns true if a literal suffix does not have the expected form > > > - and is defined as a macro. */ > > > - > > > -static bool > > > -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base) > > > +/* Like create_literal2(), but also prepend all the accumulated data= from > > > + the lit_accum struct. */ > > > +void > > > +lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token, > > > + const uchar *base1, unsigned int len1, > > > + const uchar *base2, unsigned int len2, > > > + enum cpp_ttype type) > > > { > > > - /* User-defined literals outside of namespace std must start with = a single > > > - underscore, so assume anything of that form really is a UDL suf= fix. > > > - We don't need to worry about UDLs defined inside namespace std = because > > > - their names are reserved, so cannot be used as macro names in v= alid > > > - programs. */ > > > - if (base[0] =3D=3D '_' && base[1] !=3D '_') > > > - return false; > > > - return is_macro (pfile, base); > > > + const unsigned int tot_len =3D accum + len1 + len2; > > > + uchar *dest =3D _cpp_unaligned_alloc (pfile, tot_len + 1); > > > + token->type =3D type; > > > + token->val.str.len =3D tot_len; > > > + token->val.str.text =3D dest; > > > + for (_cpp_buff *buf =3D first; buf; buf =3D buf->next) > > > + { > > > + size_t len =3D BUFF_FRONT (buf) - buf->base; > > > + memcpy (dest, buf->base, len); > > > + dest +=3D len; > > > + } > > > + memcpy (dest, base1, len1); > > > + dest +=3D len1; > > > + if (len2) > > > + memcpy (dest, base2, len2); > > > + dest +=3D len2; > > > + *dest =3D '\0'; > > > } > > > > > > /* Lexes a raw string. The stored string contains the spelling, > > > @@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token = *token, const uchar *base) > > > > > > if (CPP_OPTION (pfile, user_literals)) > > > { > > > - /* If a string format macro, say from inttypes.h, is placed to= uching > > > - a string literal it could be parsed as a C++11 user-defined = string > > > - literal thus breaking the program. */ > > > - if (is_macro_not_literal_suffix (pfile, pos)) > > > - { > > > - /* Raise a warning, but do not consume subsequent tokens. = */ > > > - if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->stat= e.skipping) > > > - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, > > > - token->src_loc, 0, > > > - "invalid suffix on literal; C++11 = requires " > > > - "a space between literal and strin= g macro"); > > > - } > > > - /* Grab user defined literal suffix. */ > > > - else if (ISIDST (*pos)) > > > - { > > > - type =3D cpp_userdef_string_add_type (type); > > > - ++pos; > > > + const uchar *const suffix_begin =3D pos; > > > + pfile->buffer->cur =3D pos; > > > > > > - while (ISIDNUM (*pos)) > > > - ++pos; > > > + if (const auto sr =3D scan_cur_identifier (pfile)) > > > + { > > > + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, > > > + suffix_begin, sr.node)) > > > + pfile->buffer->cur =3D suffix_begin; > > > + else > > > + { > > > + type =3D cpp_userdef_string_add_type (type); > > > + accum.create_literal2 (pfile, token, base, suffix_begin= - base, > > > + NODE_NAME (sr.node), NODE_LEN (s= r.node), > > > + type); > > > + if (accum.first) > > > + _cpp_release_buff (pfile, accum.first); > > > + warn_about_normalization (pfile, token, &sr.nst, true); > > > + return; > > > + } > > > } > > > } > > > > > > @@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *= token, const uchar *base) > > > create_literal (pfile, token, base, pos - base, type); > > > else > > > { > > > - size_t extra_len =3D pos - base; > > > - uchar *dest =3D _cpp_unaligned_alloc (pfile, accum.accum + ext= ra_len + 1); > > > - > > > - token->type =3D type; > > > - token->val.str.len =3D accum.accum + extra_len; > > > - token->val.str.text =3D dest; > > > - for (_cpp_buff *buf =3D accum.first; buf; buf =3D buf->next) > > > - { > > > - size_t len =3D BUFF_FRONT (buf) - buf->base; > > > - memcpy (dest, buf->base, len); > > > - dest +=3D len; > > > - } > > > + accum.create_literal2 (pfile, token, base, pos - base, nullptr= , 0, type); > > > _cpp_release_buff (pfile, accum.first); > > > - memcpy (dest, base, extra_len); > > > - dest[extra_len] =3D '\0'; > > > } > > > } > > > > > > @@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *tok= en, const uchar *base) > > > cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c charac= ter", > > > (int) terminator); > > > > > > + pfile->buffer->cur =3D cur; > > > + const uchar *const suffix_begin =3D cur; > > > + > > > if (CPP_OPTION (pfile, user_literals)) > > > { > > > - /* If a string format macro, say from inttypes.h, is placed to= uching > > > - a string literal it could be parsed as a C++11 user-defined = string > > > - literal thus breaking the program. */ > > > - if (is_macro_not_literal_suffix (pfile, cur)) > > > - { > > > - /* Raise a warning, but do not consume subsequent tokens. = */ > > > - if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->stat= e.skipping) > > > - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, > > > - token->src_loc, 0, > > > - "invalid suffix on literal; C++11 = requires " > > > - "a space between literal and strin= g macro"); > > > - } > > > - /* Grab user defined literal suffix. */ > > > - else if (ISIDST (*cur)) > > > + if (const auto sr =3D scan_cur_identifier (pfile)) > > > { > > > - type =3D cpp_userdef_char_add_type (type); > > > - type =3D cpp_userdef_string_add_type (type); > > > - ++cur; > > > - > > > - while (ISIDNUM (*cur)) > > > - ++cur; > > > + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, > > > + suffix_begin, sr.node)) > > > + pfile->buffer->cur =3D suffix_begin; > > > + else > > > + { > > > + /* Grab user defined literal suffix. */ > > > + type =3D cpp_userdef_char_add_type (type); > > > + type =3D cpp_userdef_string_add_type (type); > > > + create_literal2 (pfile, token, base, suffix_begin - bas= e, > > > + NODE_NAME (sr.node), NODE_LEN (sr.node= ), type); > > > + warn_about_normalization (pfile, token, &sr.nst, true); > > > + return; > > > + } > > > } > > > } > > > else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat) > > > - && is_macro (pfile, cur) > > > && !pfile->state.skipping) > > > - cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, > > > - token->src_loc, 0, "C++11 requires a space= " > > > - "between string literal and macro"); > > > + { > > > + const auto sr =3D scan_cur_identifier (pfile); > > > + /* Maybe raise a warning, but do not consume the tokens. */ > > > + pfile->buffer->cur =3D suffix_begin; > > > + if (sr && cpp_macro_p (sr.node)) > > > + cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, > > > + token->src_loc, 0, "C++11 requires a s= pace " > > > + "between string literal and macro"); > > > + } > > > > > > - pfile->buffer->cur =3D cur; > > > create_literal (pfile, token, base, cur - base, type); > > > } > > > > > > @@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile) > > > result->type =3D CPP_NAME; > > > { > > > struct normalize_state nst =3D INITIAL_NORMALIZE_STATE; > > > - result->val.node.node =3D lex_identifier (pfile, buffer->cur = - 1, false, > > > - &nst, > > > - &result->val.node.spe= lling); > > > + const auto node =3D lex_identifier (pfile, buffer->cur - 1, f= alse, &nst, > > > + &result->val.node.spelling)= ; > > > + result->val.node.node =3D node; > > > + identifier_diagnostics_on_lex (pfile, node); > > > warn_about_normalization (pfile, result, &nst, true); > > > } > > > > > > @@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile) > > > if (forms_identifier_p (pfile, true, &nst)) > > > { > > > result->type =3D CPP_NAME; > > > - result->val.node.node =3D lex_identifier (pfile, base, tr= ue, &nst, > > > - &result->val.node= .spelling); > > > + const auto node =3D lex_identifier (pfile, base, true, &n= st, > > > + &result->val.node.spell= ing); > > > + result->val.node.node =3D node; > > > + identifier_diagnostics_on_lex (pfile, node); > > > warn_about_normalization (pfile, result, &nst, true); > > > break; > > > } > > > @@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type) > > > } > > > > > > /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER. > > > - The buffer must already contain the enough space to hold the > > > + The buffer must already contain enough space to hold the > > > token's spelling. Returns a pointer to the character after the > > > last character written. */ > > > unsigned char * > > > @@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, c= pp_hashnode *ident) > > > } > > > > > > /* Write the spelling of a token TOKEN to BUFFER. The buffer must > > > - already contain the enough space to hold the token's spelling. > > > + already contain enough space to hold the token's spelling. > > > Returns a pointer to the character after the last character writt= en. > > > FORSTRING is true if this is to be the spelling after translation > > > phase 1 (with the original spelling of extended identifiers), fal= se