* [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
@ 2024-07-17 22:04 Jakub Jelinek
2024-07-25 18:35 ` Jason Merrill
2024-07-26 15:43 ` Jason Merrill
0 siblings, 2 replies; 5+ messages in thread
From: Jakub Jelinek @ 2024-07-17 22:04 UTC (permalink / raw)
To: Jason Merrill; +Cc: gcc-patches
Hi!
The following patch implements the easy parts of the paper.
When @$` are added to the basic character set, it means that
R"@$`()@$`" should now be valid (here I've noticed most of the
raw string tests were tested solely with -std=c++11 or -std=gnu++11
and I've tried to change that), and on the other side even if
by extension $ is allowed in identifiers, \u0024 or \U00000024
or \u{24} should not be, similarly how \u0041 is not allowed.
Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
The paper in 3.1 claims though that
#include <stdio.h>
#define STR(x) #x
int main()
{
printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
}
should have been accepted before this paper (and rejected after it),
but g++ rejects it.
I've tried to understand it, but am confused on what is the right
behavior and why.
Consider
#define STR(x) #x
const char *a = "\u00b7";
const char *b = STR(\u00b7);
const char *c = "\u0041";
const char *d = STR(\u0041);
const char *e = STR(a\u00b7);
const char *f = STR(a\u0041);
const char *g = STR(a \u00b7);
const char *h = STR(a \u0041);
const char *i = "\u066d";
const char *j = STR(\u066d);
const char *k = "\u0040";
const char *l = STR(\u0040);
const char *m = STR(a\u066d);
const char *n = STR(a\u0040);
const char *o = STR(a \u066d);
const char *p = STR(a \u0040);
Neither clang nor gcc emit any diagnostics on the a, c, i and k
initializers, those are certainly valid (c is invalid in C23 though). g++
emits with -pedantic-errors errors on all the others, while clang++ on the
ones with STR involving \u0041, \u0040 and a\u0066d. The chosen values are
\u0040 '@' as something being changed by this paper, \u0041 'A' as basic
character set char valid in identifiers before/after, \u00b7 as an example
of character which is pedantically valid in identifiers if not at the start
and \u066d s something pedantically not valid in identifiers.
Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
string/character literal which corresponds to basic character set character
(or control character) is ill-formed, that would make d, f, h cases invalid
for C++ and l, n, p cases invalid for C++26.
https://eel.is/c++draft/lex.name states which characters can appear at the
start of the identifier and which can appear after the start. And
https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
either identifier, or tons of other things, or "each non-whitespace
character that cannot be one of the above"
Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
invalid if the preprocessing token is being converted into token.
And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
the basic character set matches the last category, the program is
ill-formed."
Now, e.g. for the C++23 STR(\u0040) case, \u0040 is there not in the basic
character set, so valid outside of the literals (not the case anymore in
C++26), but it isn't nondigit and doesn't have XID_Start property, so it
isn't IMHO an identifier and so must be the "each non-whitespace character
that cannot be one of the above" case. Why doesn't the above mentioned
https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid? Ignoring
that, I'd say it would be then stringized and that feels like it is what
clang++ is doing. Now, e.g. for the STR(a\u066d) case, I wonder why that
isn't lexed as a identifier followed by \u066d "each non-whitespace
character that cannot be one of the above" token and stringified similarly,
clang++ rejects that.
What GCC libcpp seems to be doing is that if that forms_identifier_p calls
_cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
or second+ in identifier, and e.g. _cpp_valid_ucn then for UCNs valid in
string literals calls
else if (identifier_pos)
{
int validity = ucn_valid_in_identifier (pfile, result, nst);
if (validity == 0)
cpp_error (pfile, CPP_DL_ERROR,
"universal character %.*s is not valid in an identifier",
(int) (str - base), base);
else if (validity == 2 && identifier_pos == 1)
cpp_error (pfile, CPP_DL_ERROR,
"universal character %.*s is not valid at the start of an identifier",
(int) (str - base), base);
}
so basically all those invalid in identifiers cases emit an error and
pretend to be valid in identifiers, rather than what e.g. _cpp_valid_utf8
does for C but not for C++ and only for the chars completely invalid in
identifiers rather than just valid in identifiers but not at the start:
/* In C++, this is an error for invalid character in an identifier
because logically, the UTF-8 was converted to a UCN during
translation phase 1 (even though we don't physically do it that
way). In C, this byte rather becomes grammatically a separate
token. */
if (CPP_OPTION (pfile, cplusplus))
cpp_error (pfile, CPP_DL_ERROR,
"extended character %.*s is not valid in an identifier",
(int) (*pstr - base), base);
else
{
*pstr = base;
return false;
}
The comment doesn't really match what is done in recent C++ versions because
there UCNs are translated to characters and not the other way around.
2024-07-17 Jakub Jelinek <jakub@redhat.com>
PR c++/110343
libcpp/
* lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
(lex_raw_string): For C++26 allow $@` characters in prefix.
* charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
gcc/testsuite/
* c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
remove c++ specific dg-options.
* c-c++-common/raw-string-2.c: Likewise.
* c-c++-common/raw-string-4.c: Likewise.
* c-c++-common/raw-string-5.c: Likewise. Expect some diagnostics
only for non-c++26, for c++26 expect different.
* c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
remove c++ specific dg-options.
* c-c++-common/raw-string-11.c: Likewise.
* c-c++-common/raw-string-13.c: Likewise.
* c-c++-common/raw-string-14.c: Likewise.
* c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
change c++ specific dg-options to just -Wtrigraphs.
* c-c++-common/raw-string-16.c: Likewise.
* c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
remove c++ specific dg-options.
* c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
remove -std=c++11 from c++ specific dg-options.
* c-c++-common/raw-string-19.c: Likewise.
* g++.dg/cpp26/raw-string1.C: New test.
* g++.dg/cpp26/raw-string2.C: New test.
--- libcpp/lex.cc.jj 2024-07-17 11:36:49.897873247 +0200
+++ libcpp/lex.cc 2024-07-17 20:04:43.936793506 +0200
@@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_t
|| c == '*' || c == '+' || c == '-' || c == '/'
|| c == '^' || c == '&' || c == '|' || c == '~'
|| c == '!' || c == '=' || c == ','
- || c == '"' || c == '\''))
+ || c == '"' || c == '\''
+ || ((c == '$' || c == '@' || c == '`')
+ && CPP_OPTION (pfile, cplusplus)
+ && CPP_OPTION (pfile, lang) > CLK_CXX23)))
prefix[prefix_len++] = c;
else
{
--- libcpp/charset.cc.jj 2024-01-05 08:35:13.696827331 +0100
+++ libcpp/charset.cc 2024-07-17 20:18:13.665467035 +0200
@@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const
result = 1;
}
else if (identifier_pos && result == 0x24
- && CPP_OPTION (pfile, dollars_in_ident))
+ && CPP_OPTION (pfile, dollars_in_ident)
+ /* In C++26 when dollars are allowed in identifiers,
+ we should still reject \u0024 as $ is part of the basic
+ character set. */
+ && !(CPP_OPTION (pfile, cplusplus)
+ && CPP_OPTION (pfile, lang) > CLK_CXX23))
{
if (CPP_OPTION (pfile, warn_dollars) && !pfile->state.skipping)
{
--- gcc/testsuite/c-c++-common/raw-string-1.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-1.c 2024-07-17 20:31:02.272652757 +0200
@@ -1,7 +1,6 @@
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
// { dg-require-effective-target wchar }
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
#ifndef __cplusplus
#include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-2.c.jj 2020-01-12 11:54:37.023404206 +0100
+++ gcc/testsuite/c-c++-common/raw-string-2.c 2024-07-17 20:31:18.415446546 +0200
@@ -1,7 +1,6 @@
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
// { dg-require-effective-target wchar }
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
#ifndef __cplusplus
#include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-4.c.jj 2020-01-12 11:54:37.023404206 +0100
+++ gcc/testsuite/c-c++-common/raw-string-4.c 2024-07-17 20:31:51.590022777 +0200
@@ -1,7 +1,6 @@
// R is not applicable for character literals.
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
// { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
const int i0 = R'a'; // { dg-error "was not declared|undeclared" "undeclared" }
// { dg-error "expected ',' or ';'" "expected" { target c } .-1 }
--- gcc/testsuite/c-c++-common/raw-string-5.c.jj 2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-5.c 2024-07-17 20:56:46.522822013 +0200
@@ -1,6 +1,5 @@
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
// { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
const void *s0 = R"0123456789abcdefg()0123456789abcdefg" 0;
// { dg-error "raw string delimiter longer" "longer" { target *-*-* } .-1 }
@@ -15,12 +14,18 @@ const void *s3 = R")())" 0;
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
// { dg-error "stray" "stray" { target *-*-* } .-2 }
const void *s4 = R"@()@" 0;
- // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
- // { dg-error "stray" "stray" { target *-*-* } .-2 }
+ // { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+ // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+ // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
const void *s5 = R"$()$" 0;
- // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
- // { dg-error "stray" "stray" { target *-*-* } .-2 }
-const void *s6 = R"\u0040()\u0040" 0;
+ // { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+ // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+ // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s6 = R"`()`" 0;
+ // { dg-error "invalid character" "invalid" { target { c || c++23_down } } .-1 }
+ // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+ // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s7 = R"\u0040()\u0040" 0;
// { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
// { dg-error "stray" "stray" { target *-*-* } .-2 }
--- gcc/testsuite/c-c++-common/raw-string-6.c.jj 2020-12-28 12:27:32.500752614 +0100
+++ gcc/testsuite/c-c++-common/raw-string-6.c 2024-07-17 20:32:26.193580759 +0200
@@ -1,6 +1,5 @@
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
// { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
const void *s0 = R"ouch()ouCh"; // { dg-error "unterminated raw string" "unterminated" }
// { dg-error "at end of input" "end" { target *-*-* } .-1 }
--- gcc/testsuite/c-c++-common/raw-string-11.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-11.c 2024-07-17 20:33:54.236456112 +0200
@@ -1,7 +1,7 @@
// PR preprocessor/48740
+// { dg-do run { target { c || c++11 } } }
// { dg-options "-std=gnu99 -trigraphs -save-temps" { target c } }
-// { dg-options "-std=c++0x -save-temps" { target c++ } }
-// { dg-do run }
+// { dg-options "-save-temps" { target c++ } }
int main ()
{
@@ -9,4 +9,3 @@ int main ()
"foo%sbar%sfred%sbob?""?""?""?""?",
sizeof ("foo%sbar%sfred%sbob?""?""?""?""?"));
}
-
--- gcc/testsuite/c-c++-common/raw-string-13.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-13.c 2024-07-17 20:34:23.669080145 +0200
@@ -1,8 +1,7 @@
// PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
// { dg-require-effective-target wchar }
// { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
#ifndef __cplusplus
#include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-14.c.jj 2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-14.c 2024-07-17 20:34:43.507826727 +0200
@@ -1,7 +1,6 @@
// PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
// { dg-options "-std=gnu99 -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
const void *s0 = R"abc\
def()abcdef" 0;
--- gcc/testsuite/c-c++-common/raw-string-15.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-15.c 2024-07-17 20:34:58.994628892 +0200
@@ -1,8 +1,8 @@
// PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
// { dg-require-effective-target wchar }
// { dg-options "-std=gnu99 -Wno-c++-compat -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
#ifndef __cplusplus
#include <wchar.h>
--- gcc/testsuite/c-c++-common/raw-string-16.c.jj 2020-07-28 15:39:09.992756448 +0200
+++ gcc/testsuite/c-c++-common/raw-string-16.c 2024-07-17 20:35:22.387330085 +0200
@@ -1,7 +1,7 @@
// PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
// { dg-options "-std=gnu99 -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
const void *s0 = R"abc\
def()abcdef" 0;
--- gcc/testsuite/c-c++-common/raw-string-17.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-17.c 2024-07-17 20:35:36.497149845 +0200
@@ -1,7 +1,6 @@
/* PR preprocessor/57824 */
-/* { dg-do run } */
+/* { dg-do run { target { c || c++11 } } } */
/* { dg-options "-std=gnu99" { target c } } */
-/* { dg-options "-std=c++11" { target c++ } } */
#define S(s) s
#define T(s) s "\n"
--- gcc/testsuite/c-c++-common/raw-string-18.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-18.c 2024-07-17 20:35:55.151911555 +0200
@@ -1,7 +1,7 @@
/* PR preprocessor/57824 */
-/* { dg-do compile } */
+/* { dg-do compile { target { c || c++11 } } } */
/* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno" { target c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno" { target c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno" { target c++ } } */
const char x[] = R"(
abc
--- gcc/testsuite/c-c++-common/raw-string-19.c.jj 2020-01-12 11:54:37.022404221 +0100
+++ gcc/testsuite/c-c++-common/raw-string-19.c 2024-07-17 20:36:25.445524589 +0200
@@ -1,7 +1,7 @@
/* PR preprocessor/57824 */
-/* { dg-do compile } */
+// { dg-do compile { target { c || c++11 } } }
/* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno -save-temps" { target c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno -save-temps" { target c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno -save-temps" { target c++ } } */
const char x[] = R"(
abc
--- gcc/testsuite/g++.dg/cpp26/raw-string1.C.jj 2024-07-17 20:46:06.878052479 +0200
+++ gcc/testsuite/g++.dg/cpp26/raw-string1.C 2024-07-17 20:47:50.761715122 +0200
@@ -0,0 +1,4 @@
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target c++26 } }
+
+const char *s0 = R"`@$$@`@`$()`@$$@`@`$";
--- gcc/testsuite/g++.dg/cpp26/raw-string2.C.jj 2024-07-17 20:54:53.478273235 +0200
+++ gcc/testsuite/g++.dg/cpp26/raw-string2.C 2024-07-17 20:58:46.177289931 +0200
@@ -0,0 +1,7 @@
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target { ! { avr*-*-* mmix*-*-* *-*-aix* } } } }
+// { dg-options "" }
+
+int a$b;
+int a\u0024c; // { dg-error "universal character \\\\u0024 is not valid in an identifier" "" { target c++26 } }
+int a\U00000024d; // { dg-error "universal character \\\\U00000024 is not valid in an identifier" "" { target c++26 } }
Jakub
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
2024-07-17 22:04 [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] Jakub Jelinek
@ 2024-07-25 18:35 ` Jason Merrill
2024-07-26 15:43 ` Jason Merrill
1 sibling, 0 replies; 5+ messages in thread
From: Jason Merrill @ 2024-07-25 18:35 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: gcc-patches
On 7/17/24 6:04 PM, Jakub Jelinek wrote:
> Hi!
>
> The following patch implements the easy parts of the paper.
> When @$` are added to the basic character set, it means that
> R"@$`()@$`" should now be valid (here I've noticed most of the
> raw string tests were tested solely with -std=c++11 or -std=gnu++11
> and I've tried to change that), and on the other side even if
> by extension $ is allowed in identifiers, \u0024 or \U00000024
> or \u{24} should not be, similarly how \u0041 is not allowed.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> The paper in 3.1 claims though that
> #include <stdio.h>
>
> #define STR(x) #x
>
> int main()
> {
> printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
> }
> should have been accepted before this paper (and rejected after it),
> but g++ rejects it.
>
> I've tried to understand it, but am confused on what is the right
> behavior and why.
>
> Consider
> #define STR(x) #x
> const char *a = "\u00b7";
> const char *b = STR(\u00b7);
> const char *c = "\u0041";
> const char *d = STR(\u0041);
> const char *e = STR(a\u00b7);
> const char *f = STR(a\u0041);
> const char *g = STR(a \u00b7);
> const char *h = STR(a \u0041);
> const char *i = "\u066d";
> const char *j = STR(\u066d);
> const char *k = "\u0040";
> const char *l = STR(\u0040);
> const char *m = STR(a\u066d);
> const char *n = STR(a\u0040);
> const char *o = STR(a \u066d);
> const char *p = STR(a \u0040);
>
> Neither clang nor gcc emit any diagnostics on the a, c, i and k
> initializers, those are certainly valid (c is invalid in C23 though). g++
> emits with -pedantic-errors errors on all the others, while clang++ on the
> ones with STR involving \u0041, \u0040 and a\u0066d. The chosen values are
> \u0040 '@' as something being changed by this paper, \u0041 'A' as basic
> character set char valid in identifiers before/after, \u00b7 as an example
> of character which is pedantically valid in identifiers if not at the start
> and \u066d s something pedantically not valid in identifiers.
>
> Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
> string/character literal which corresponds to basic character set character
> (or control character) is ill-formed, that would make d, f, h cases invalid
> for C++ and l, n, p cases invalid for C++26.
>
> https://eel.is/c++draft/lex.name states which characters can appear at the
> start of the identifier and which can appear after the start. And
> https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
> either identifier, or tons of other things, or "each non-whitespace
> character that cannot be one of the above"
>
> Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
> invalid if the preprocessing token is being converted into token.
>
> And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
> the basic character set matches the last category, the program is
> ill-formed."
>
> Now, e.g. for the C++23 STR(\u0040) case, \u0040 is there not in the basic
> character set, so valid outside of the literals (not the case anymore in
> C++26), but it isn't nondigit and doesn't have XID_Start property, so it
> isn't IMHO an identifier and so must be the "each non-whitespace character
> that cannot be one of the above" case. Why doesn't the above mentioned
> https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?
Your argument makes sense to me, though...
> Ignoring
> that, I'd say it would be then stringized and that feels like it is what
> clang++ is doing. Now, e.g. for the STR(a\u066d) case, I wonder why that
> isn't lexed as a identifier followed by \u066d "each non-whitespace
> character that cannot be one of the above" token and stringified similarly,
> clang++ rejects that.
>
> What GCC libcpp seems to be doing is that if that forms_identifier_p calls
> _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
> or second+ in identifier, and e.g. _cpp_valid_ucn then for UCNs valid in
> string literals calls
> else if (identifier_pos)
> {
> int validity = ucn_valid_in_identifier (pfile, result, nst);
>
> if (validity == 0)
> cpp_error (pfile, CPP_DL_ERROR,
> "universal character %.*s is not valid in an identifier",
> (int) (str - base), base);
> else if (validity == 2 && identifier_pos == 1)
> cpp_error (pfile, CPP_DL_ERROR,
> "universal character %.*s is not valid at the start of an identifier",
> (int) (str - base), base);
> }
> so basically all those invalid in identifiers cases emit an error and
> pretend to be valid in identifiers, rather than what e.g. _cpp_valid_utf8
> does for C but not for C++ and only for the chars completely invalid in
> identifiers rather than just valid in identifiers but not at the start:
> /* In C++, this is an error for invalid character in an identifier
> because logically, the UTF-8 was converted to a UCN during
> translation phase 1 (even though we don't physically do it that
> way). In C, this byte rather becomes grammatically a separate
> token. */
>
> if (CPP_OPTION (pfile, cplusplus))
> cpp_error (pfile, CPP_DL_ERROR,
> "extended character %.*s is not valid in an identifier",
> (int) (*pstr - base), base);
> else
> {
> *pstr = base;
> return false;
> }
> The comment doesn't really match what is done in recent C++ versions because
> there UCNs are translated to characters and not the other way around.
...it seems wrong that calling forms_identifier_p gives an error and
returns true for characters that can't be part of an identifier, which I
would expect to produce a false result. If we want to complain about
the pptoken#2 issue, that seems like it should happen in the CPP_OTHER
section of _cpp_lex_direct.
Our diagnostic for STR(\u0041) is similarly unhelpful, saying just "not
valid in an identifier" rather than anything about the basic character
set or that it should be spelled "A".
But if we're going to give an error either way, fixing this seems a low
priority.
> 2024-07-17 Jakub Jelinek <jakub@redhat.com>
>
> PR c++/110343
> libcpp/
> * lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
> (lex_raw_string): For C++26 allow $@` characters in prefix.
> * charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
> gcc/testsuite/
> * c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-2.c: Likewise.
> * c-c++-common/raw-string-4.c: Likewise.
> * c-c++-common/raw-string-5.c: Likewise. Expect some diagnostics
> only for non-c++26, for c++26 expect different.
> * c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-11.c: Likewise.
> * c-c++-common/raw-string-13.c: Likewise.
> * c-c++-common/raw-string-14.c: Likewise.
> * c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
> change c++ specific dg-options to just -Wtrigraphs.
> * c-c++-common/raw-string-16.c: Likewise.
> * c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
> remove -std=c++11 from c++ specific dg-options.
> * c-c++-common/raw-string-19.c: Likewise.
> * g++.dg/cpp26/raw-string1.C: New test.
> * g++.dg/cpp26/raw-string2.C: New test.
>
> --- libcpp/lex.cc.jj 2024-07-17 11:36:49.897873247 +0200
> +++ libcpp/lex.cc 2024-07-17 20:04:43.936793506 +0200
> @@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_t
> || c == '*' || c == '+' || c == '-' || c == '/'
> || c == '^' || c == '&' || c == '|' || c == '~'
> || c == '!' || c == '=' || c == ','
> - || c == '"' || c == '\''))
> + || c == '"' || c == '\''
> + || ((c == '$' || c == '@' || c == '`')
> + && CPP_OPTION (pfile, cplusplus)
> + && CPP_OPTION (pfile, lang) > CLK_CXX23)))
> prefix[prefix_len++] = c;
> else
> {
> --- libcpp/charset.cc.jj 2024-01-05 08:35:13.696827331 +0100
> +++ libcpp/charset.cc 2024-07-17 20:18:13.665467035 +0200
> @@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const
> result = 1;
> }
> else if (identifier_pos && result == 0x24
> - && CPP_OPTION (pfile, dollars_in_ident))
> + && CPP_OPTION (pfile, dollars_in_ident)
> + /* In C++26 when dollars are allowed in identifiers,
> + we should still reject \u0024 as $ is part of the basic
> + character set. */
> + && !(CPP_OPTION (pfile, cplusplus)
> + && CPP_OPTION (pfile, lang) > CLK_CXX23))
I wonder about moving $ handling into the next else, so we don't need to
worry about the basic charset here?
But the patch is OK.
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
2024-07-17 22:04 [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] Jakub Jelinek
2024-07-25 18:35 ` Jason Merrill
@ 2024-07-26 15:43 ` Jason Merrill
2024-07-26 15:55 ` Jakub Jelinek
1 sibling, 1 reply; 5+ messages in thread
From: Jason Merrill @ 2024-07-26 15:43 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: gcc-patches
On 7/17/24 6:04 PM, Jakub Jelinek wrote:
> Hi!
>
> The following patch implements the easy parts of the paper.
> When @$` are added to the basic character set, it means that
> R"@$`()@$`" should now be valid (here I've noticed most of the
> raw string tests were tested solely with -std=c++11 or -std=gnu++11
> and I've tried to change that), and on the other side even if
> by extension $ is allowed in identifiers, \u0024 or \U00000024
> or \u{24} should not be, similarly how \u0041 is not allowed.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> The paper in 3.1 claims though that
> #include <stdio.h>
>
> #define STR(x) #x
>
> int main()
> {
> printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
> }
> should have been accepted before this paper (and rejected after it),
> but g++ rejects it.
>
> I've tried to understand it, but am confused on what is the right
> behavior and why.
>
> Consider
> #define STR(x) #x
> const char *a = "\u00b7";
> const char *b = STR(\u00b7);
> const char *c = "\u0041";
> const char *d = STR(\u0041);
> const char *e = STR(a\u00b7);
> const char *f = STR(a\u0041);
> const char *g = STR(a \u00b7);
> const char *h = STR(a \u0041);
> const char *i = "\u066d";
> const char *j = STR(\u066d);
> const char *k = "\u0040";
> const char *l = STR(\u0040);
> const char *m = STR(a\u066d);
> const char *n = STR(a\u0040);
> const char *o = STR(a \u066d);
> const char *p = STR(a \u0040);
>
> Neither clang nor gcc emit any diagnostics on the a, c, i and k
> initializers, those are certainly valid (c is invalid in C23 though). g++
> emits with -pedantic-errors errors on all the others, while clang++ on the
> ones with STR involving \u0041, \u0040 and a\u0066d. The chosen values are
> \u0040 '@' as something being changed by this paper, \u0041 'A' as basic
> character set char valid in identifiers before/after, \u00b7 as an example
> of character which is pedantically valid in identifiers if not at the start
> and \u066d s something pedantically not valid in identifiers.
>
> Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
> string/character literal which corresponds to basic character set character
> (or control character) is ill-formed, that would make d, f, h cases invalid
> for C++ and l, n, p cases invalid for C++26.
>
> https://eel.is/c++draft/lex.name states which characters can appear at the
> start of the identifier and which can appear after the start. And
> https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
> either identifier, or tons of other things, or "each non-whitespace
> character that cannot be one of the above"
>
> Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
> invalid if the preprocessing token is being converted into token.
>
> And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
> the basic character set matches the last category, the program is
> ill-formed."
>
> Now, e.g. for the C++23 STR(\u0040) case, \u0040 is there not in the basic
> character set, so valid outside of the literals (not the case anymore in
> C++26), but it isn't nondigit and doesn't have XID_Start property, so it
> isn't IMHO an identifier and so must be the "each non-whitespace character
> that cannot be one of the above" case. Why doesn't the above mentioned
> https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid? Ignoring
> that, I'd say it would be then stringized and that feels like it is what
> clang++ is doing. Now, e.g. for the STR(a\u066d) case, I wonder why that
> isn't lexed as a identifier followed by \u066d "each non-whitespace
> character that cannot be one of the above" token and stringified similarly,
> clang++ rejects that.
>
> What GCC libcpp seems to be doing is that if that forms_identifier_p calls
> _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
> or second+ in identifier, and e.g. _cpp_valid_ucn then for UCNs valid in
> string literals calls
> else if (identifier_pos)
> {
> int validity = ucn_valid_in_identifier (pfile, result, nst);
>
> if (validity == 0)
> cpp_error (pfile, CPP_DL_ERROR,
> "universal character %.*s is not valid in an identifier",
> (int) (str - base), base);
> else if (validity == 2 && identifier_pos == 1)
> cpp_error (pfile, CPP_DL_ERROR,
> "universal character %.*s is not valid at the start of an identifier",
> (int) (str - base), base);
> }
> so basically all those invalid in identifiers cases emit an error and
> pretend to be valid in identifiers, rather than what e.g. _cpp_valid_utf8
> does for C but not for C++ and only for the chars completely invalid in
> identifiers rather than just valid in identifiers but not at the start:
> /* In C++, this is an error for invalid character in an identifier
> because logically, the UTF-8 was converted to a UCN during
> translation phase 1 (even though we don't physically do it that
> way). In C, this byte rather becomes grammatically a separate
> token. */
>
> if (CPP_OPTION (pfile, cplusplus))
> cpp_error (pfile, CPP_DL_ERROR,
> "extended character %.*s is not valid in an identifier",
> (int) (*pstr - base), base);
> else
> {
> *pstr = base;
> return false;
> }
> The comment doesn't really match what is done in recent C++ versions because
> there UCNs are translated to characters and not the other way around.
>
> 2024-07-17 Jakub Jelinek <jakub@redhat.com>
>
> PR c++/110343
> libcpp/
> * lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
> (lex_raw_string): For C++26 allow $@` characters in prefix.
> * charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
> gcc/testsuite/
> * c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-2.c: Likewise.
> * c-c++-common/raw-string-4.c: Likewise.
> * c-c++-common/raw-string-5.c: Likewise. Expect some diagnostics
> only for non-c++26, for c++26 expect different.
> * c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-11.c: Likewise.
> * c-c++-common/raw-string-13.c: Likewise.
> * c-c++-common/raw-string-14.c: Likewise.
> * c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
> change c++ specific dg-options to just -Wtrigraphs.
> * c-c++-common/raw-string-16.c: Likewise.
> * c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
> remove c++ specific dg-options.
> * c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
> remove -std=c++11 from c++ specific dg-options.
> * c-c++-common/raw-string-19.c: Likewise.
> * g++.dg/cpp26/raw-string1.C: New test.
> * g++.dg/cpp26/raw-string2.C: New test.
I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
2024-07-26 15:43 ` Jason Merrill
@ 2024-07-26 15:55 ` Jakub Jelinek
2024-07-26 17:25 ` Jason Merrill
0 siblings, 1 reply; 5+ messages in thread
From: Jakub Jelinek @ 2024-07-26 15:55 UTC (permalink / raw)
To: Jason Merrill; +Cc: gcc-patches
On Fri, Jul 26, 2024 at 11:43:13AM -0400, Jason Merrill wrote:
> I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.
I don't remember seeing it when I wrote the patch, but today I see it as
well.
The following patch seems to fix that, tested on i686-linux, ok for trunk?
2024-07-26 Jakub Jelinek <jakub@redhat.com>
* g++.dg/cpp/ucn-1.C (main): Expect error on c\u0024c identifier also
for C++26.
--- gcc/testsuite/g++.dg/cpp/ucn-1.C.jj 2020-01-14 20:02:46.702611047 +0100
+++ gcc/testsuite/g++.dg/cpp/ucn-1.C 2024-07-26 17:52:33.881518790 +0200
@@ -9,7 +9,7 @@ int main()
int c\u0041c; // { dg-error "not valid in an identifier" }
// $ is OK on most targets; not part of basic source char set
- int c\u0024c; // { dg-error "not valid in an identifier" "" { target { powerpc-ibm-aix* } } }
+ int c\u0024c; // { dg-error "not valid in an identifier" "" { target { { powerpc-ibm-aix* } || c++26 } } }
U"\uD800"; // { dg-error "not a valid universal character" }
Jakub
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]
2024-07-26 15:55 ` Jakub Jelinek
@ 2024-07-26 17:25 ` Jason Merrill
0 siblings, 0 replies; 5+ messages in thread
From: Jason Merrill @ 2024-07-26 17:25 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: gcc-patches
On 7/26/24 11:55 AM, Jakub Jelinek wrote:
> On Fri, Jul 26, 2024 at 11:43:13AM -0400, Jason Merrill wrote:
>> I'm now seeing a -std=c++26 failure on g++.dg/cpp/ucn-1.C.
>
> I don't remember seeing it when I wrote the patch, but today I see it as
> well.
>
> The following patch seems to fix that, tested on i686-linux, ok for trunk?
OK.
> 2024-07-26 Jakub Jelinek <jakub@redhat.com>
>
> * g++.dg/cpp/ucn-1.C (main): Expect error on c\u0024c identifier also
> for C++26.
>
> --- gcc/testsuite/g++.dg/cpp/ucn-1.C.jj 2020-01-14 20:02:46.702611047 +0100
> +++ gcc/testsuite/g++.dg/cpp/ucn-1.C 2024-07-26 17:52:33.881518790 +0200
> @@ -9,7 +9,7 @@ int main()
>
> int c\u0041c; // { dg-error "not valid in an identifier" }
> // $ is OK on most targets; not part of basic source char set
> - int c\u0024c; // { dg-error "not valid in an identifier" "" { target { powerpc-ibm-aix* } } }
> + int c\u0024c; // { dg-error "not valid in an identifier" "" { target { { powerpc-ibm-aix* } || c++26 } } }
>
> U"\uD800"; // { dg-error "not a valid universal character" }
>
>
>
> Jakub
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-07-26 17:25 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-17 22:04 [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343] Jakub Jelinek
2024-07-25 18:35 ` Jason Merrill
2024-07-26 15:43 ` Jason Merrill
2024-07-26 15:55 ` Jakub Jelinek
2024-07-26 17:25 ` Jason Merrill
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).