[PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
@ 2022-06-14 21:26 Lewis Hyatt
  2022-06-15 19:06 ` Lewis Hyatt
  0 siblings, 1 reply; 10+ messages in thread
From: Lewis Hyatt @ 2022-06-14 21:26 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1962 bytes --]

Hello-

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902

The attached patch resolves PR preprocessor/103902 as described in the patch
message inline below. bootstrap + regtest all languages was successful on
x86-64 Linux, with no new failures:

FAIL 103 103
PASS 542338 542371
UNSUPPORTED 15247 15250
UNTESTED 136 136
XFAIL 4166 4166
XPASS 17 17

Please let me know if it looks OK?

A few questions I have:

- A difference introduced with this patch is that after lexing something
like `operator ""_abc', then `_abc' is added to the identifier hash map,
whereas previously it was not. I feel like this must be OK because with the
optional space as in `operator "" _abc', it would be added with or without the
patch.

- The behavior of `#pragma GCC poison' is not consistent (including prior to
  my patch). I tried to make it more so but there is still one thing I want to
  ask about. Leaving aside extended characters for now, the inconsistency is
  that currently the poison is only checked, when the suffix appears as a
  standalone token.

  #pragma GCC poison _X
  bool operator ""_X (unsigned long long);   //accepted before the patch,
                                             //rejected after it
  bool operator "" _X (unsigned long long);  //rejected either before or after
  const char * operator ""_X (const char *, unsigned long); //accepted before,
                                                            //rejected after
  const char * operator "" _X (const char *, unsigned long); //rejected either

  const char * s = ""_X; //accepted before the patch, rejected after it
  const bool b = 1_X; //accepted before or after ****

I feel like after the patch, the behavior is the expected behavior for all
cases but the last one. Here, we allow the poisoned identifier because it's
not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
like this or does it need to be addressed?

Thanks for taking a look!

-Lewis

[-- Attachment #2: udlit_ext1.txt --]
[-- Type: text/plain, Size: 24363 bytes --]

Subject: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token. It is
somewhat duplicative of the code in lex_identifier(), which handles the normal
case, but I think there's no good way to avoid that without pessimizing the
usual case, since lex_identifier() takes advantage of the fact that the first
character of the identifier has already been analyzed. The code duplication is
somewhat offset by factoring out the identifier lexing diagnostics (e.g. for
poisoned identifiers), which were formerly duplicated in two places, and have
been factored into their own function that's used in (now) 3 places.

BTW, the other place that was lexing identifiers is lex_identifier_intern(),
which is used to implement #pragma push_macro and #pragma pop_macro. This does
not support extended characters either. I will add that in a subsequent patch,
because it can't directly reuse the new function, but rather needs to lex from
a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix, and we
check for poisoned identifiers there as well.

PR preprocessor/103902

libcpp/ChangeLog:

	* lex.cc (identifier_diagnostics_on_lex): New function refactors
	common code from...
	(lex_identifier_intern): ...here, and...
	(lex_identifier): ...here.
	(struct scan_id_result): New struct to hold the result of...
	(scan_cur_identifier): ...new function.
	(create_literal2): New function.
	(is_macro): Removed function that is now handled directly in
	lex_string() and lex_raw_string().
	(is_macro_not_literal_suffix): Likewise.
	(lit_accum::create_literal2): New function.
	(lex_raw_string): Make use of new function scan_cur_identifier().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.

diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 00000000000..411d4fdd0ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,68 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 00000000000..05a2804a463
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 00000000000..6db729c3432
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,7 @@
+// { dg-do compile { target c++11 } }
+int _ħ;
+const char * operator ""_ħ (const char *, unsigned long);
+bool operator ""_ħ (unsigned long long x);
+#pragma GCC poison _ħ
+bool b = 1_ħ; // This currently is allowed, is that intended?
+const char *x = "hbar"_ħ; // { dg-error "attempt to use poisoned" }
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index f891d3e17df..481f01bc5c3 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -1854,8 +1854,11 @@ warn_about_normalization (cpp_reader *pfile,
 
 static const cppchar_t utf8_signifier = 0xC0;
 
-/* Returns TRUE if the sequence starting at buffer->cur is valid in
-   an identifier.  FIRST is TRUE if this starts an identifier.  */
+/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
+   extended character in an identifier.  If FIRST is TRUE, then the character
+   must be valid at the beginning of an identifier as well.  If the return
+   value is TRUE, then pfile->buffer->cur has been moved to point to the next
+   byte after the extended character.  */
 
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -1941,6 +1944,122 @@ maybe_va_opt_error (cpp_reader *pfile)
     }
 }
 
+/* Helper function to perform diagnostics that are needed (rarely)
+   when an identifier is lexed.  */
+static void identifier_diagnostics_on_lex (cpp_reader *pfile,
+					   cpp_hashnode *node)
+{
+  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
+			|| pfile->state.skipping, 1))
+    return;
+
+  /* It is allowed to poison the same identifier twice.  */
+  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
+    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
+	       NODE_NAME (node));
+
+  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
+     replacement list of a variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_ARGS__
+      && !pfile->state.va_args_ok)
+    {
+      if (CPP_OPTION (pfile, cplusplus))
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C++11 variadic macro");
+      else
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C99 variadic macro");
+    }
+
+  if (node == pfile->spec_nodes.n__VA_OPT__)
+    maybe_va_opt_error (pfile);
+
+  /* For -Wc++-compat, warn about use of C++ named operators.  */
+  if (node->flags & NODE_WARN_OPERATOR)
+    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
+		 "identifier \"%s\" is a special operator name in C++",
+		 NODE_NAME (node));
+}
+
+/* Helper function to scan an entire identifier beginning at
+   pfile->buffer->cur, and possibly containing extended characters (UCNs
+   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
+   else nullptr, as well as a normalize_state so that normalization warnings
+   may be issued once the token lexing is complete.  */
+
+struct scan_id_result
+{
+  cpp_hashnode *node;
+  normalize_state nst;
+
+  scan_id_result ()
+    : node (nullptr)
+  {
+    nst = INITIAL_NORMALIZE_STATE;
+  }
+
+  explicit operator bool () const { return node; }
+};
+
+static scan_id_result
+scan_cur_identifier (cpp_reader *pfile)
+{
+  cpp_buffer *const buffer = pfile->buffer;
+  const uchar *const begin = buffer->cur;
+  scan_id_result result;
+  bool need_extended;
+  unsigned int hash = 0;
+  if (ISIDST (*buffer->cur))
+    {
+      hash = HT_HASHSTEP (0, *buffer->cur);
+      ++buffer->cur;
+      while (ISIDNUM (*buffer->cur))
+	{
+	  hash = HT_HASHSTEP (hash, *buffer->cur);
+	  ++buffer->cur;
+	}
+      NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, buffer->cur[-1]);
+      need_extended = forms_identifier_p (pfile, false, &result.nst);
+    }
+  else
+    {
+      if (!forms_identifier_p (pfile, true, &result.nst))
+	return result;
+      need_extended = true;
+    }
+
+  if (need_extended)
+    {
+      do {
+	while (ISIDNUM (*buffer->cur))
+	  {
+	    NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, *buffer->cur);
+	    ++buffer->cur;
+	  }
+      } while (forms_identifier_p (pfile, false, &result.nst));
+
+      if (pfile->warn_bidi_p ())
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+
+      result.node = _cpp_interpret_identifier (pfile, begin,
+					       buffer->cur - begin);
+    }
+  else
+    {
+      const size_t len = buffer->cur - begin;
+      hash = HT_HASHFINISH (hash, len);
+      result.node = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
+						       begin, len,
+						       hash, HT_ALLOC));
+    }
+
+  identifier_diagnostics_on_lex (pfile, result.node);
+  return result;
+}
+
+
 /* Helper function to get the cpp_hashnode of the identifier BASE.  */
 static cpp_hashnode *
 lex_identifier_intern (cpp_reader *pfile, const uchar *base)
@@ -1960,41 +2079,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base)
   hash = HT_HASHFINISH (hash, len);
   result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
 					      base, len, hash, HT_ALLOC));
-
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2057,42 +2142,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
       *spelling = result;
     }
 
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      /* __VA_OPT__ should only appear in the replacement list of a
-	 variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2152,6 +2202,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base,
   token->val.str.text = cpp_alloc_token_string (pfile, base, len);
 }
 
+/* Like create_literal(), but construct it from two separate strings
+   which are concatenated.  LEN2 may be 0 if no second string is
+   required.  */
+static void
+create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
+		 unsigned int len1, const uchar *base2, unsigned int len2,
+		 enum cpp_ttype type)
+{
+  token->type = type;
+  token->val.str.len = len1 + len2;
+  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
+  memcpy (dest, base1, len1);
+  if (len2)
+    memcpy (dest+len1, base2, len2);
+  dest[len1 + len2] = 0;
+  token->val.str.text = dest;
+}
+
 const uchar *
 cpp_alloc_token_string (cpp_reader *pfile,
 			const unsigned char *ptr, unsigned len)
@@ -2190,6 +2258,11 @@ struct lit_accum {
       rpos = NULL;
     return c;
   }
+
+  void create_literal2 (cpp_reader *pfile, cpp_token *token,
+			const uchar *base1, unsigned int len1,
+			const uchar *base2, unsigned int len2,
+			enum cpp_ttype type);
 };
 
 /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
@@ -2232,45 +2305,31 @@ lit_accum::read_begin (cpp_reader *pfile)
   rpos = BUFF_FRONT (last);
 }
 
-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
-
-static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+/* Like create_literal2(), but also prepend all the accumulated data from
+   the lit_accum struct.  */
+void
+lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
+			    const uchar *base1, unsigned int len1,
+			    const uchar *base2, unsigned int len2,
+			    enum cpp_ttype type)
 {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
-    return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
+  const unsigned int tot_len = accum + len1 + len2;
+  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
+  token->type = type;
+  token->val.str.len = tot_len;
+  token->val.str.text = dest;
+  for (_cpp_buff *buf = first; buf; buf = buf->next)
     {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
+      size_t len = BUFF_FRONT (buf) - buf->base;
+      memcpy (dest, buf->base, len);
+      dest += len;
     }
-  hash = HT_HASHFINISH (hash, cur - base);
-
-  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-					base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
-}
-
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
-{
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);
+  memcpy (dest, base1, len1);
+  dest += len1;
+  if (len2)
+    memcpy (dest, base2, len2);
+  dest += len2;
+  *dest = '\0';
 }
 
 /* Lexes a raw string.  The stored string contains the spelling,
@@ -2540,26 +2599,53 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, pos))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*pos))
-	{
-	  type = cpp_userdef_string_add_type (type);
-	  ++pos;
+      const uchar *const suffix_begin = pos;
+      pfile->buffer->cur = pos;
 
-	  while (ISIDNUM (*pos))
-	    ++pos;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      type = cpp_userdef_string_add_type (type);
+	      if (!accum.accum)
+		create_literal2 (pfile, token, base,
+				 suffix_begin - base,
+				 NODE_NAME (sr.node),
+				 NODE_LEN (sr.node),
+				 type);
+	      else
+		{
+		  accum.create_literal2 (pfile, token, base,
+					 suffix_begin - base,
+					 NODE_NAME (sr.node),
+					 NODE_LEN (sr.node),
+					 type);
+		  _cpp_release_buff (pfile, accum.first);
+		}
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
 
@@ -2569,21 +2655,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     create_literal (pfile, token, base, pos - base, type);
   else
     {
-      size_t extra_len = pos - base;
-      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1);
-
-      token->type = type;
-      token->val.str.len = accum.accum + extra_len;
-      token->val.str.text = dest;
-      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
-	{
-	  size_t len = BUFF_FRONT (buf) - buf->base;
-	  memcpy (dest, buf->base, len);
-	  dest += len;
-	}
+      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type);
       _cpp_release_buff (pfile, accum.first);
-      memcpy (dest, base, extra_len);
-      dest[extra_len] = '\0';
     }
 }
 
@@ -2687,39 +2760,57 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
 	       (int) terminator);
 
+  pfile->buffer->cur = cur;
+
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, cur))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*cur))
-	{
-	  type = cpp_userdef_char_add_type (type);
-	  type = cpp_userdef_string_add_type (type);
-          ++cur;
+      const uchar *const suffix_begin = cur;
 
-	  while (ISIDNUM (*cur))
-	    ++cur;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      /* Grab user defined literal suffix.  */
+	      type = cpp_userdef_char_add_type (type);
+	      type = cpp_userdef_string_add_type (type);
+	      create_literal2 (pfile, token, base, suffix_begin - base,
+			       NODE_NAME (sr.node), NODE_LEN (sr.node), type);
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
   else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
-	   && is_macro (pfile, cur)
 	   && !pfile->state.skipping)
-    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
-			   token->src_loc, 0, "C++11 requires a space "
-			   "between string literal and macro");
+    {
+      const auto sr = scan_cur_identifier (pfile);
+      if (sr && cpp_macro_p (sr.node))
+	cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
+			       token->src_loc, 0, "C++11 requires a space "
+			       "between string literal and macro");
+    }
 
-  pfile->buffer->cur = cur;
   create_literal (pfile, token, base, cur - base, type);
 }
 
@@ -4091,7 +4182,7 @@ cpp_digraph2name (enum cpp_ttype type)
 }
 
 /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
-   The buffer must already contain the enough space to hold the
+   The buffer must already contain enough space to hold the
    token's spelling.  Returns a pointer to the character after the
    last character written.  */
 unsigned char *
@@ -4113,7 +4204,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident)
 }
 
 /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
-   already contain the enough space to hold the token's spelling.
+   already contain enough space to hold the token's spelling.
    Returns a pointer to the character after the last character written.
    FORSTRING is true if this is to be the spelling after translation
    phase 1 (with the original spelling of extended identifiers), false

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2022-06-14 21:26 [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902] Lewis Hyatt
@ 2022-06-15 19:06 ` Lewis Hyatt
  2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
  0 siblings, 1 reply; 10+ messages in thread
From: Lewis Hyatt @ 2022-06-15 19:06 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2546 bytes --]

On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> Hello-
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> 
> The attached patch resolves PR preprocessor/103902 as described in the patch
> message inline below. bootstrap + regtest all languages was successful on
> x86-64 Linux, with no new failures:
> 
> FAIL 103 103
> PASS 542338 542371
> UNSUPPORTED 15247 15250
> UNTESTED 136 136
> XFAIL 4166 4166
> XPASS 17 17
> 
> Please let me know if it looks OK?
> 
> A few questions I have:
> 
> - A difference introduced with this patch is that after lexing something
> like `operator ""_abc', then `_abc' is added to the identifier hash map,
> whereas previously it was not. I feel like this must be OK because with the
> optional space as in `operator "" _abc', it would be added with or without the
> patch.
> 
> - The behavior of `#pragma GCC poison' is not consistent (including prior to
>   my patch). I tried to make it more so but there is still one thing I want to
>   ask about. Leaving aside extended characters for now, the inconsistency is
>   that currently the poison is only checked, when the suffix appears as a
>   standalone token.
> 
>   #pragma GCC poison _X
>   bool operator ""_X (unsigned long long);   //accepted before the patch,
>                                              //rejected after it
>   bool operator "" _X (unsigned long long);  //rejected either before or after
>   const char * operator ""_X (const char *, unsigned long); //accepted before,
>                                                             //rejected after
>   const char * operator "" _X (const char *, unsigned long); //rejected either
> 
>   const char * s = ""_X; //accepted before the patch, rejected after it
>   const bool b = 1_X; //accepted before or after ****
> 
> I feel like after the patch, the behavior is the expected behavior for all
> cases but the last one. Here, we allow the poisoned identifier because it's
> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> like this or does it need to be addressed?

Sorry, that version actually did not handle the case of -Wc++11-compat in
c++98 mode correctly. This updated version fixes that and adds the missing
test coverage for that, if you could please review this one instead?

By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
attachments. I attached the patch also gzipped to address that for the
archive, since the new testcases do use non-ASCII characters.

Thanks for taking a look!

-Lewis

[-- Attachment #2: udlit_ext2.txt --]
[-- Type: text/plain, Size: 25484 bytes --]

Subject: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token. It is
somewhat duplicative of the code in lex_identifier(), which handles the normal
case, but I think there's no good way to avoid that without pessimizing the
usual case, since lex_identifier() takes advantage of the fact that the first
character of the identifier has already been analyzed. The code duplication is
somewhat offset by factoring out the identifier lexing diagnostics (e.g. for
poisoned identifiers), which were formerly duplicated in two places, and have
been factored into their own function that's used in (now) 3 places.

BTW, the other place that was lexing identifiers is lex_identifier_intern(),
which is used to implement #pragma push_macro and #pragma pop_macro. This does
not support extended characters either. I will add that in a subsequent patch,
because it can't directly reuse the new function, but rather needs to lex from
a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix, and we
check for poisoned identifiers there as well.

PR preprocessor/103902

libcpp/ChangeLog:

	* lex.cc (identifier_diagnostics_on_lex): New function refactors
	common code from...
	(lex_identifier_intern): ...here, and...
	(lex_identifier): ...here.
	(struct scan_id_result): New struct to hold the result of...
	(scan_cur_identifier): ...new function.
	(create_literal2): New function.
	(is_macro): Removed function that is now handled directly in
	lex_string() and lex_raw_string().
	(is_macro_not_literal_suffix): Likewise.
	(lit_accum::create_literal2): New function.
	(lex_raw_string): Make use of new function scan_cur_identifier().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.

diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 00000000000..411d4fdd0ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,68 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 00000000000..05a2804a463
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 00000000000..6db729c3432
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,7 @@
+// { dg-do compile { target c++11 } }
+int _ħ;
+const char * operator ""_ħ (const char *, unsigned long);
+bool operator ""_ħ (unsigned long long x);
+#pragma GCC poison _ħ
+bool b = 1_ħ; // This currently is allowed, is that intended?
+const char *x = "hbar"_ħ; // { dg-error "attempt to use poisoned" }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
new file mode 100644
index 00000000000..a356eba4a3c
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
@@ -0,0 +1,15 @@
+// { dg-options "-std=c++98 -Wc++11-compat" }
+#define END ;
+#define εND ;
+#define EηD ;
+#define EN\u0394 ;
+
+const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" }
+const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" }
+const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" }
+const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" }
+
+/* Make sure we did not skip the token also in the case that it wasn't found to
+   be a macro; compilation should fail here.  */
+const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" }
+
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index f891d3e17df..b41292f641b 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -1854,8 +1854,11 @@ warn_about_normalization (cpp_reader *pfile,
 
 static const cppchar_t utf8_signifier = 0xC0;
 
-/* Returns TRUE if the sequence starting at buffer->cur is valid in
-   an identifier.  FIRST is TRUE if this starts an identifier.  */
+/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
+   extended character in an identifier.  If FIRST is TRUE, then the character
+   must be valid at the beginning of an identifier as well.  If the return
+   value is TRUE, then pfile->buffer->cur has been moved to point to the next
+   byte after the extended character.  */
 
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -1941,6 +1944,122 @@ maybe_va_opt_error (cpp_reader *pfile)
     }
 }
 
+/* Helper function to perform diagnostics that are needed (rarely)
+   when an identifier is lexed.  */
+static void identifier_diagnostics_on_lex (cpp_reader *pfile,
+					   cpp_hashnode *node)
+{
+  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
+			|| pfile->state.skipping, 1))
+    return;
+
+  /* It is allowed to poison the same identifier twice.  */
+  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
+    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
+	       NODE_NAME (node));
+
+  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
+     replacement list of a variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_ARGS__
+      && !pfile->state.va_args_ok)
+    {
+      if (CPP_OPTION (pfile, cplusplus))
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C++11 variadic macro");
+      else
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C99 variadic macro");
+    }
+
+  if (node == pfile->spec_nodes.n__VA_OPT__)
+    maybe_va_opt_error (pfile);
+
+  /* For -Wc++-compat, warn about use of C++ named operators.  */
+  if (node->flags & NODE_WARN_OPERATOR)
+    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
+		 "identifier \"%s\" is a special operator name in C++",
+		 NODE_NAME (node));
+}
+
+/* Helper function to scan an entire identifier beginning at
+   pfile->buffer->cur, and possibly containing extended characters (UCNs
+   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
+   else nullptr, as well as a normalize_state so that normalization warnings
+   may be issued once the token lexing is complete.  */
+
+struct scan_id_result
+{
+  cpp_hashnode *node;
+  normalize_state nst;
+
+  scan_id_result ()
+    : node (nullptr)
+  {
+    nst = INITIAL_NORMALIZE_STATE;
+  }
+
+  explicit operator bool () const { return node; }
+};
+
+static scan_id_result
+scan_cur_identifier (cpp_reader *pfile)
+{
+  cpp_buffer *const buffer = pfile->buffer;
+  const uchar *const begin = buffer->cur;
+  scan_id_result result;
+  bool need_extended;
+  unsigned int hash = 0;
+  if (ISIDST (*buffer->cur))
+    {
+      hash = HT_HASHSTEP (0, *buffer->cur);
+      ++buffer->cur;
+      while (ISIDNUM (*buffer->cur))
+	{
+	  hash = HT_HASHSTEP (hash, *buffer->cur);
+	  ++buffer->cur;
+	}
+      NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, buffer->cur[-1]);
+      need_extended = forms_identifier_p (pfile, false, &result.nst);
+    }
+  else
+    {
+      if (!forms_identifier_p (pfile, true, &result.nst))
+	return result;
+      need_extended = true;
+    }
+
+  if (need_extended)
+    {
+      do {
+	while (ISIDNUM (*buffer->cur))
+	  {
+	    NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, *buffer->cur);
+	    ++buffer->cur;
+	  }
+      } while (forms_identifier_p (pfile, false, &result.nst));
+
+      if (pfile->warn_bidi_p ())
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+
+      result.node = _cpp_interpret_identifier (pfile, begin,
+					       buffer->cur - begin);
+    }
+  else
+    {
+      const size_t len = buffer->cur - begin;
+      hash = HT_HASHFINISH (hash, len);
+      result.node = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
+						       begin, len,
+						       hash, HT_ALLOC));
+    }
+
+  identifier_diagnostics_on_lex (pfile, result.node);
+  return result;
+}
+
+
 /* Helper function to get the cpp_hashnode of the identifier BASE.  */
 static cpp_hashnode *
 lex_identifier_intern (cpp_reader *pfile, const uchar *base)
@@ -1960,41 +2079,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base)
   hash = HT_HASHFINISH (hash, len);
   result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
 					      base, len, hash, HT_ALLOC));
-
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2057,42 +2142,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
       *spelling = result;
     }
 
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      /* __VA_OPT__ should only appear in the replacement list of a
-	 variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2152,6 +2202,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base,
   token->val.str.text = cpp_alloc_token_string (pfile, base, len);
 }
 
+/* Like create_literal(), but construct it from two separate strings
+   which are concatenated.  LEN2 may be 0 if no second string is
+   required.  */
+static void
+create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
+		 unsigned int len1, const uchar *base2, unsigned int len2,
+		 enum cpp_ttype type)
+{
+  token->type = type;
+  token->val.str.len = len1 + len2;
+  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
+  memcpy (dest, base1, len1);
+  if (len2)
+    memcpy (dest+len1, base2, len2);
+  dest[len1 + len2] = 0;
+  token->val.str.text = dest;
+}
+
 const uchar *
 cpp_alloc_token_string (cpp_reader *pfile,
 			const unsigned char *ptr, unsigned len)
@@ -2190,6 +2258,11 @@ struct lit_accum {
       rpos = NULL;
     return c;
   }
+
+  void create_literal2 (cpp_reader *pfile, cpp_token *token,
+			const uchar *base1, unsigned int len1,
+			const uchar *base2, unsigned int len2,
+			enum cpp_ttype type);
 };
 
 /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
@@ -2232,45 +2305,31 @@ lit_accum::read_begin (cpp_reader *pfile)
   rpos = BUFF_FRONT (last);
 }
 
-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
-
-static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+/* Like create_literal2(), but also prepend all the accumulated data from
+   the lit_accum struct.  */
+void
+lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
+			    const uchar *base1, unsigned int len1,
+			    const uchar *base2, unsigned int len2,
+			    enum cpp_ttype type)
 {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
-    return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
+  const unsigned int tot_len = accum + len1 + len2;
+  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
+  token->type = type;
+  token->val.str.len = tot_len;
+  token->val.str.text = dest;
+  for (_cpp_buff *buf = first; buf; buf = buf->next)
     {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
+      size_t len = BUFF_FRONT (buf) - buf->base;
+      memcpy (dest, buf->base, len);
+      dest += len;
     }
-  hash = HT_HASHFINISH (hash, cur - base);
-
-  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-					base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
-}
-
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
-{
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);
+  memcpy (dest, base1, len1);
+  dest += len1;
+  if (len2)
+    memcpy (dest, base2, len2);
+  dest += len2;
+  *dest = '\0';
 }
 
 /* Lexes a raw string.  The stored string contains the spelling,
@@ -2540,26 +2599,53 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, pos))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*pos))
-	{
-	  type = cpp_userdef_string_add_type (type);
-	  ++pos;
+      const uchar *const suffix_begin = pos;
+      pfile->buffer->cur = pos;
 
-	  while (ISIDNUM (*pos))
-	    ++pos;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      type = cpp_userdef_string_add_type (type);
+	      if (!accum.accum)
+		create_literal2 (pfile, token, base,
+				 suffix_begin - base,
+				 NODE_NAME (sr.node),
+				 NODE_LEN (sr.node),
+				 type);
+	      else
+		{
+		  accum.create_literal2 (pfile, token, base,
+					 suffix_begin - base,
+					 NODE_NAME (sr.node),
+					 NODE_LEN (sr.node),
+					 type);
+		  _cpp_release_buff (pfile, accum.first);
+		}
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
 
@@ -2569,21 +2655,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     create_literal (pfile, token, base, pos - base, type);
   else
     {
-      size_t extra_len = pos - base;
-      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1);
-
-      token->type = type;
-      token->val.str.len = accum.accum + extra_len;
-      token->val.str.text = dest;
-      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
-	{
-	  size_t len = BUFF_FRONT (buf) - buf->base;
-	  memcpy (dest, buf->base, len);
-	  dest += len;
-	}
+      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type);
       _cpp_release_buff (pfile, accum.first);
-      memcpy (dest, base, extra_len);
-      dest[extra_len] = '\0';
     }
 }
 
@@ -2687,39 +2760,58 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
 	       (int) terminator);
 
+  pfile->buffer->cur = cur;
+  const uchar *const suffix_begin = cur;
+
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, cur))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*cur))
+      if (const auto sr = scan_cur_identifier (pfile))
 	{
-	  type = cpp_userdef_char_add_type (type);
-	  type = cpp_userdef_string_add_type (type);
-          ++cur;
-
-	  while (ISIDNUM (*cur))
-	    ++cur;
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      /* Grab user defined literal suffix.  */
+	      type = cpp_userdef_char_add_type (type);
+	      type = cpp_userdef_string_add_type (type);
+	      create_literal2 (pfile, token, base, suffix_begin - base,
+			       NODE_NAME (sr.node), NODE_LEN (sr.node), type);
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
   else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
-	   && is_macro (pfile, cur)
 	   && !pfile->state.skipping)
-    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
-			   token->src_loc, 0, "C++11 requires a space "
-			   "between string literal and macro");
+    {
+      const auto sr = scan_cur_identifier (pfile);
+      /* Maybe raise a warning, but do not consume the tokens.  */
+      pfile->buffer->cur = suffix_begin;
+      if (sr && cpp_macro_p (sr.node))
+	cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
+			       token->src_loc, 0, "C++11 requires a space "
+			       "between string literal and macro");
+    }
 
-  pfile->buffer->cur = cur;
   create_literal (pfile, token, base, cur - base, type);
 }
 
@@ -4091,7 +4183,7 @@ cpp_digraph2name (enum cpp_ttype type)
 }
 
 /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
-   The buffer must already contain the enough space to hold the
+   The buffer must already contain enough space to hold the
    token's spelling.  Returns a pointer to the character after the
    last character written.  */
 unsigned char *
@@ -4113,7 +4205,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident)
 }
 
 /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
-   already contain the enough space to hold the token's spelling.
+   already contain enough space to hold the token's spelling.
    Returns a pointer to the character after the last character written.
    FORSTRING is true if this is to be the spelling after translation
    phase 1 (with the original spelling of extended identifiers), false

[-- Attachment #3: udlit_ext2.txt.gz --]
[-- Type: application/x-gunzip, Size: 6499 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2022-06-15 19:06 ` Lewis Hyatt
@ 2022-09-26 22:27   ` Lewis Hyatt
  2023-02-10 16:30     ` Jakub Jelinek
  2023-02-15 18:39     ` Jason Merrill
  0 siblings, 2 replies; 10+ messages in thread
From: Lewis Hyatt @ 2022-09-26 22:27 UTC (permalink / raw)
  To: gcc-patches; +Cc: jason

[-- Attachment #1: Type: text/plain, Size: 3347 bytes --]

On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> > Hello-
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> > 
> > The attached patch resolves PR preprocessor/103902 as described in the patch
> > message inline below. bootstrap + regtest all languages was successful on
> > x86-64 Linux, with no new failures:
> > 
> > FAIL 103 103
> > PASS 542338 542371
> > UNSUPPORTED 15247 15250
> > UNTESTED 136 136
> > XFAIL 4166 4166
> > XPASS 17 17
> > 
> > Please let me know if it looks OK?
> > 
> > A few questions I have:
> > 
> > - A difference introduced with this patch is that after lexing something
> > like `operator ""_abc', then `_abc' is added to the identifier hash map,
> > whereas previously it was not. I feel like this must be OK because with the
> > optional space as in `operator "" _abc', it would be added with or without the
> > patch.
> > 
> > - The behavior of `#pragma GCC poison' is not consistent (including prior to
> >   my patch). I tried to make it more so but there is still one thing I want to
> >   ask about. Leaving aside extended characters for now, the inconsistency is
> >   that currently the poison is only checked, when the suffix appears as a
> >   standalone token.
> > 
> >   #pragma GCC poison _X
> >   bool operator ""_X (unsigned long long);   //accepted before the patch,
> >                                              //rejected after it
> >   bool operator "" _X (unsigned long long);  //rejected either before or after
> >   const char * operator ""_X (const char *, unsigned long); //accepted before,
> >                                                             //rejected after
> >   const char * operator "" _X (const char *, unsigned long); //rejected either
> > 
> >   const char * s = ""_X; //accepted before the patch, rejected after it
> >   const bool b = 1_X; //accepted before or after ****
> > 
> > I feel like after the patch, the behavior is the expected behavior for all
> > cases but the last one. Here, we allow the poisoned identifier because it's
> > not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> > like this or does it need to be addressed?
> 
> Sorry, that version actually did not handle the case of -Wc++11-compat in
> c++98 mode correctly. This updated version fixes that and adds the missing
> test coverage for that, if you could please review this one instead?
> 
> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
> attachments. I attached the patch also gzipped to address that for the
> archive, since the new testcases do use non-ASCII characters.
> 
> Thanks for taking a look!

Hello-

May I please ping this patch again? Joseph suggested that it would be best if
a C++ maintainer has a look at it. This is one of just a few places left where
we don't handle UTF-8 properly in libcpp, it would be really nice to get them
fixed up if there is time to review this patch. Thanks!

https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html

I re-attached it here as it required some trivial rebasing on top of recently
pushed changes. As before, I also attached the gzipped version so that the
UTF-8 testcases show up OK in the online archive, in case that's still an
issue. Thanks for taking a look!

-Lewis

[-- Attachment #2: udlit_ext3.txt --]
[-- Type: text/plain, Size: 25460 bytes --]

[PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token. It is
somewhat duplicative of the code in lex_identifier(), which handles the normal
case, but I think there's no good way to avoid that without pessimizing the
usual case, since lex_identifier() takes advantage of the fact that the first
character of the identifier has already been analyzed. The code duplication is
somewhat offset by factoring out the identifier lexing diagnostics (e.g. for
poisoned identifiers), which were formerly duplicated in two places, and have
been factored into their own function that's used in (now) 3 places.

BTW, the other place that was lexing identifiers is lex_identifier_intern(),
which is used to implement #pragma push_macro and #pragma pop_macro. This does
not support extended characters either. I will add that in a subsequent patch,
because it can't directly reuse the new function, but rather needs to lex from
a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix, and we
check for poisoned identifiers there as well.

libcpp/ChangeLog:

	PR preprocessor/103902
	* lex.cc (identifier_diagnostics_on_lex): New function refactors
	common code from...
	(lex_identifier_intern): ...here, and...
	(lex_identifier): ...here.
	(struct scan_id_result): New struct to hold the result of...
	(scan_cur_identifier): ...new function.
	(create_literal2): New function.
	(is_macro): Removed function that is now handled directly in
	lex_string() and lex_raw_string().
	(is_macro_not_literal_suffix): Likewise.
	(lit_accum::create_literal2): New function.
	(lex_raw_string): Make use of new function scan_cur_identifier().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	PR preprocessor/103902
	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.

diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 00000000000..411d4fdd0ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,68 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 00000000000..05a2804a463
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 00000000000..6db729c3432
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,7 @@
+// { dg-do compile { target c++11 } }
+int _ħ;
+const char * operator ""_ħ (const char *, unsigned long);
+bool operator ""_ħ (unsigned long long x);
+#pragma GCC poison _ħ
+bool b = 1_ħ; // This currently is allowed, is that intended?
+const char *x = "hbar"_ħ; // { dg-error "attempt to use poisoned" }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
new file mode 100644
index 00000000000..a356eba4a3c
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
@@ -0,0 +1,15 @@
+// { dg-options "-std=c++98 -Wc++11-compat" }
+#define END ;
+#define εND ;
+#define EηD ;
+#define EN\u0394 ;
+
+const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" }
+const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" }
+const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" }
+const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" }
+
+/* Make sure we did not skip the token also in the case that it wasn't found to
+   be a macro; compilation should fail here.  */
+const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" }
+
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 41f905dea16..f93a883acce 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -2052,8 +2052,11 @@ warn_about_normalization (cpp_reader *pfile,
     }
 }
 
-/* Returns TRUE if the sequence starting at buffer->cur is valid in
-   an identifier.  FIRST is TRUE if this starts an identifier.  */
+/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
+   extended character in an identifier.  If FIRST is TRUE, then the character
+   must be valid at the beginning of an identifier as well.  If the return
+   value is TRUE, then pfile->buffer->cur has been moved to point to the next
+   byte after the extended character.  */
 
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -2143,6 +2146,122 @@ maybe_va_opt_error (cpp_reader *pfile)
     }
 }
 
+/* Helper function to perform diagnostics that are needed (rarely)
+   when an identifier is lexed.  */
+static void identifier_diagnostics_on_lex (cpp_reader *pfile,
+					   cpp_hashnode *node)
+{
+  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
+			|| pfile->state.skipping, 1))
+    return;
+
+  /* It is allowed to poison the same identifier twice.  */
+  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
+    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
+	       NODE_NAME (node));
+
+  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
+     replacement list of a variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_ARGS__
+      && !pfile->state.va_args_ok)
+    {
+      if (CPP_OPTION (pfile, cplusplus))
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C++11 variadic macro");
+      else
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C99 variadic macro");
+    }
+
+  if (node == pfile->spec_nodes.n__VA_OPT__)
+    maybe_va_opt_error (pfile);
+
+  /* For -Wc++-compat, warn about use of C++ named operators.  */
+  if (node->flags & NODE_WARN_OPERATOR)
+    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
+		 "identifier \"%s\" is a special operator name in C++",
+		 NODE_NAME (node));
+}
+
+/* Helper function to scan an entire identifier beginning at
+   pfile->buffer->cur, and possibly containing extended characters (UCNs
+   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
+   else nullptr, as well as a normalize_state so that normalization warnings
+   may be issued once the token lexing is complete.  */
+
+struct scan_id_result
+{
+  cpp_hashnode *node;
+  normalize_state nst;
+
+  scan_id_result ()
+    : node (nullptr)
+  {
+    nst = INITIAL_NORMALIZE_STATE;
+  }
+
+  explicit operator bool () const { return node; }
+};
+
+static scan_id_result
+scan_cur_identifier (cpp_reader *pfile)
+{
+  cpp_buffer *const buffer = pfile->buffer;
+  const uchar *const begin = buffer->cur;
+  scan_id_result result;
+  bool need_extended;
+  unsigned int hash = 0;
+  if (ISIDST (*buffer->cur))
+    {
+      hash = HT_HASHSTEP (0, *buffer->cur);
+      ++buffer->cur;
+      while (ISIDNUM (*buffer->cur))
+	{
+	  hash = HT_HASHSTEP (hash, *buffer->cur);
+	  ++buffer->cur;
+	}
+      NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, buffer->cur[-1]);
+      need_extended = forms_identifier_p (pfile, false, &result.nst);
+    }
+  else
+    {
+      if (!forms_identifier_p (pfile, true, &result.nst))
+	return result;
+      need_extended = true;
+    }
+
+  if (need_extended)
+    {
+      do {
+	while (ISIDNUM (*buffer->cur))
+	  {
+	    NORMALIZE_STATE_UPDATE_IDNUM (&result.nst, *buffer->cur);
+	    ++buffer->cur;
+	  }
+      } while (forms_identifier_p (pfile, false, &result.nst));
+
+      if (pfile->warn_bidi_p ())
+	maybe_warn_bidi_on_close (pfile, buffer->cur);
+
+      result.node = _cpp_interpret_identifier (pfile, begin,
+					       buffer->cur - begin);
+    }
+  else
+    {
+      const size_t len = buffer->cur - begin;
+      hash = HT_HASHFINISH (hash, len);
+      result.node = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
+						       begin, len,
+						       hash, HT_ALLOC));
+    }
+
+  identifier_diagnostics_on_lex (pfile, result.node);
+  return result;
+}
+
+
 /* Helper function to get the cpp_hashnode of the identifier BASE.  */
 static cpp_hashnode *
 lex_identifier_intern (cpp_reader *pfile, const uchar *base)
@@ -2162,41 +2281,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base)
   hash = HT_HASHFINISH (hash, len);
   result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
 					      base, len, hash, HT_ALLOC));
-
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2259,42 +2344,7 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
       *spelling = result;
     }
 
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      /* __VA_OPT__ should only appear in the replacement list of a
-	 variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2354,6 +2404,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base,
   token->val.str.text = cpp_alloc_token_string (pfile, base, len);
 }
 
+/* Like create_literal(), but construct it from two separate strings
+   which are concatenated.  LEN2 may be 0 if no second string is
+   required.  */
+static void
+create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
+		 unsigned int len1, const uchar *base2, unsigned int len2,
+		 enum cpp_ttype type)
+{
+  token->type = type;
+  token->val.str.len = len1 + len2;
+  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
+  memcpy (dest, base1, len1);
+  if (len2)
+    memcpy (dest+len1, base2, len2);
+  dest[len1 + len2] = 0;
+  token->val.str.text = dest;
+}
+
 const uchar *
 cpp_alloc_token_string (cpp_reader *pfile,
 			const unsigned char *ptr, unsigned len)
@@ -2392,6 +2460,11 @@ struct lit_accum {
       rpos = NULL;
     return c;
   }
+
+  void create_literal2 (cpp_reader *pfile, cpp_token *token,
+			const uchar *base1, unsigned int len1,
+			const uchar *base2, unsigned int len2,
+			enum cpp_ttype type);
 };
 
 /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
@@ -2434,45 +2507,31 @@ lit_accum::read_begin (cpp_reader *pfile)
   rpos = BUFF_FRONT (last);
 }
 
-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
-
-static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+/* Like create_literal2(), but also prepend all the accumulated data from
+   the lit_accum struct.  */
+void
+lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
+			    const uchar *base1, unsigned int len1,
+			    const uchar *base2, unsigned int len2,
+			    enum cpp_ttype type)
 {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
-    return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
+  const unsigned int tot_len = accum + len1 + len2;
+  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
+  token->type = type;
+  token->val.str.len = tot_len;
+  token->val.str.text = dest;
+  for (_cpp_buff *buf = first; buf; buf = buf->next)
     {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
+      size_t len = BUFF_FRONT (buf) - buf->base;
+      memcpy (dest, buf->base, len);
+      dest += len;
     }
-  hash = HT_HASHFINISH (hash, cur - base);
-
-  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-					base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
-}
-
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
-{
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);
+  memcpy (dest, base1, len1);
+  dest += len1;
+  if (len2)
+    memcpy (dest, base2, len2);
+  dest += len2;
+  *dest = '\0';
 }
 
 /* Lexes a raw string.  The stored string contains the spelling,
@@ -2741,26 +2800,53 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, pos))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*pos))
-	{
-	  type = cpp_userdef_string_add_type (type);
-	  ++pos;
+      const uchar *const suffix_begin = pos;
+      pfile->buffer->cur = pos;
 
-	  while (ISIDNUM (*pos))
-	    ++pos;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      type = cpp_userdef_string_add_type (type);
+	      if (!accum.accum)
+		create_literal2 (pfile, token, base,
+				 suffix_begin - base,
+				 NODE_NAME (sr.node),
+				 NODE_LEN (sr.node),
+				 type);
+	      else
+		{
+		  accum.create_literal2 (pfile, token, base,
+					 suffix_begin - base,
+					 NODE_NAME (sr.node),
+					 NODE_LEN (sr.node),
+					 type);
+		  _cpp_release_buff (pfile, accum.first);
+		}
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
 
@@ -2770,21 +2856,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     create_literal (pfile, token, base, pos - base, type);
   else
     {
-      size_t extra_len = pos - base;
-      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1);
-
-      token->type = type;
-      token->val.str.len = accum.accum + extra_len;
-      token->val.str.text = dest;
-      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
-	{
-	  size_t len = BUFF_FRONT (buf) - buf->base;
-	  memcpy (dest, buf->base, len);
-	  dest += len;
-	}
+      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type);
       _cpp_release_buff (pfile, accum.first);
-      memcpy (dest, base, extra_len);
-      dest[extra_len] = '\0';
     }
 }
 
@@ -2891,39 +2964,58 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
 	       (int) terminator);
 
+  pfile->buffer->cur = cur;
+  const uchar *const suffix_begin = cur;
+
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, cur))
+      if (const auto sr = scan_cur_identifier (pfile))
 	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*cur))
-	{
-	  type = cpp_userdef_char_add_type (type);
-	  type = cpp_userdef_string_add_type (type);
-          ++cur;
-
-	  while (ISIDNUM (*cur))
-	    ++cur;
+	  /* If a string format macro, say from inttypes.h, is placed touching
+	     a string literal it could be parsed as a C++11 user-defined
+	     string literal thus breaking the program.  User-defined literals
+	     outside of namespace std must start with a single underscore, so
+	     assume anything of that form really is a UDL suffix.  We don't
+	     need to worry about UDLs defined inside namespace std because
+	     their names are reserved, so cannot be used as macro names in
+	     valid programs.  */
+	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
+	      && cpp_macro_p (sr.node))
+	    {
+	      /* Maybe raise a warning, but do not consume the tokens.  */
+	      pfile->buffer->cur = suffix_begin;
+	      if (CPP_OPTION (pfile, warn_literal_suffix)
+		  && !pfile->state.skipping)
+		cpp_warning_with_line
+		  (pfile, CPP_W_LITERAL_SUFFIX,
+		   token->src_loc, 0,
+		   "invalid suffix on literal; C++11 requires "
+		   "a space between literal and string macro");
+	    }
+	  else
+	    {
+	      /* Grab user defined literal suffix.  */
+	      type = cpp_userdef_char_add_type (type);
+	      type = cpp_userdef_string_add_type (type);
+	      create_literal2 (pfile, token, base, suffix_begin - base,
+			       NODE_NAME (sr.node), NODE_LEN (sr.node), type);
+	      warn_about_normalization (pfile, token, &sr.nst);
+	      return;
+	    }
 	}
     }
   else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
-	   && is_macro (pfile, cur)
 	   && !pfile->state.skipping)
-    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
-			   token->src_loc, 0, "C++11 requires a space "
-			   "between string literal and macro");
+    {
+      const auto sr = scan_cur_identifier (pfile);
+      /* Maybe raise a warning, but do not consume the tokens.  */
+      pfile->buffer->cur = suffix_begin;
+      if (sr && cpp_macro_p (sr.node))
+	cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
+			       token->src_loc, 0, "C++11 requires a space "
+			       "between string literal and macro");
+    }
 
-  pfile->buffer->cur = cur;
   create_literal (pfile, token, base, cur - base, type);
 }
 
@@ -4322,7 +4414,7 @@ cpp_digraph2name (enum cpp_ttype type)
 }
 
 /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
-   The buffer must already contain the enough space to hold the
+   The buffer must already contain enough space to hold the
    token's spelling.  Returns a pointer to the character after the
    last character written.  */
 unsigned char *
@@ -4344,7 +4436,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident)
 }
 
 /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
-   already contain the enough space to hold the token's spelling.
+   already contain enough space to hold the token's spelling.
    Returns a pointer to the character after the last character written.
    FORSTRING is true if this is to be the spelling after translation
    phase 1 (with the original spelling of extended identifiers), false

[-- Attachment #3: udlit_ext3.txt.gz --]
[-- Type: application/x-gunzip, Size: 6461 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
@ 2023-02-10 16:30     ` Jakub Jelinek
  2023-02-10 16:58       ` Lewis Hyatt
  2023-02-15 18:39     ` Jason Merrill
  1 sibling, 1 reply; 10+ messages in thread
From: Jakub Jelinek @ 2023-02-10 16:30 UTC (permalink / raw)
  To: Jason Merrill, Nathan Sidwell, Lewis Hyatt; +Cc: gcc-patches

On Mon, Sep 26, 2022 at 06:27:25PM -0400, Lewis Hyatt via Gcc-patches wrote:
> May I please ping this patch again? Joseph suggested that it would be best if
> a C++ maintainer has a look at it. This is one of just a few places left where
> we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> fixed up if there is time to review this patch. Thanks!

CCing them.

Just some nits from me, but I agree C++ maintainers are the best reviewers
for this.

> libcpp/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* lex.cc (identifier_diagnostics_on_lex): New function refactors
> 	common code from...
> 	(lex_identifier_intern): ...here, and...
> 	(lex_identifier): ...here.
> 	(struct scan_id_result): New struct to hold the result of...

I'd just write
	(struct scan_id_result): New type.
or similar, no need to explain what it will be used for.

> 	(scan_cur_identifier): ...new function.

So just New function here too.

> 	(create_literal2): New function.
> 	(is_macro): Removed function that is now handled directly in
> 	lex_string() and lex_raw_string().
> 	(is_macro_not_literal_suffix): Likewise.
> 	(lit_accum::create_literal2): New function.
> 	(lex_raw_string): Make use of new function scan_cur_identifier().
> 	(lex_string): Likewise.

> +/* Helper function to perform diagnostics that are needed (rarely)
> +   when an identifier is lexed.  */
> +static void identifier_diagnostics_on_lex (cpp_reader *pfile,
> +					   cpp_hashnode *node)

Formatting, function name should be at the start of line, so
static void
identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node)

> +{
> +  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
> +			|| pfile->state.skipping, 1))
> +    return;
> +
> +  /* It is allowed to poison the same identifier twice.  */
> +  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
> +    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
> +	       NODE_NAME (node));
> +
> +  /* Constraint 6.10.3.5: __VA_ARGS__; should only appear in the
> +     replacement list of a variadic macro.  */
> +  if (node == pfile->spec_nodes.n__VA_ARGS__
> +      && !pfile->state.va_args_ok)
> +    {
> +      if (CPP_OPTION (pfile, cplusplus))
> +	cpp_error (pfile, CPP_DL_PEDWARN,
> +		   "__VA_ARGS__ can only appear in the expansion"
> +		   " of a C++11 variadic macro");
> +      else
> +	cpp_error (pfile, CPP_DL_PEDWARN,
> +		   "__VA_ARGS__ can only appear in the expansion"
> +		   " of a C99 variadic macro");
> +    }
> +

Perhaps add here the:
+      /* __VA_OPT__ should only appear in the replacement list of a
+	 variadic macro.  */
comment that used to be present only in the second occurrence of
all this.

> +  if (node == pfile->spec_nodes.n__VA_OPT__)
> +    maybe_va_opt_error (pfile);
> +
> +  /* For -Wc++-compat, warn about use of C++ named operators.  */
> +  if (node->flags & NODE_WARN_OPERATOR)
> +    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> +		 "identifier \"%s\" is a special operator name in C++",
> +		 NODE_NAME (node));
> +}
> +
> +/* Helper function to scan an entire identifier beginning at
> +   pfile->buffer->cur, and possibly containing extended characters (UCNs
> +   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
> +   else nullptr, as well as a normalize_state so that normalization warnings
> +   may be issued once the token lexing is complete.  */

This looks like a function comment that should be immediately above
scan_cur_identifier, there might be another comment above struct
scan_id_result which would just explain the purpose of the class.

> +
> +struct scan_id_result
> +{
> +  cpp_hashnode *node;
> +  normalize_state nst;
> +
> +  scan_id_result ()
> +    : node (nullptr)
> +  {
> +    nst = INITIAL_NORMALIZE_STATE;
> +  }
> +
> +  explicit operator bool () const { return node; }
> +};
> +
> +static scan_id_result
> +scan_cur_identifier (cpp_reader *pfile)
> +{

> @@ -2741,26 +2800,53 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
>  
>    if (CPP_OPTION (pfile, user_literals))
>      {
> -      /* If a string format macro, say from inttypes.h, is placed touching
> -	 a string literal it could be parsed as a C++11 user-defined string
> -	 literal thus breaking the program.  */
> -      if (is_macro_not_literal_suffix (pfile, pos))
> -	{
> -	  /* Raise a warning, but do not consume subsequent tokens.  */
> -	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> -	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
> -				   token->src_loc, 0,
> -				   "invalid suffix on literal; C++11 requires "
> -				   "a space between literal and string macro");
> -	}
> -      /* Grab user defined literal suffix.  */
> -      else if (ISIDST (*pos))
> -	{
> -	  type = cpp_userdef_string_add_type (type);
> -	  ++pos;
> +      const uchar *const suffix_begin = pos;
> +      pfile->buffer->cur = pos;
>  
> -	  while (ISIDNUM (*pos))
> -	    ++pos;
> +      if (const auto sr = scan_cur_identifier (pfile))
> +	{
> +	  /* If a string format macro, say from inttypes.h, is placed touching
> +	     a string literal it could be parsed as a C++11 user-defined
> +	     string literal thus breaking the program.  User-defined literals
> +	     outside of namespace std must start with a single underscore, so
> +	     assume anything of that form really is a UDL suffix.  We don't
> +	     need to worry about UDLs defined inside namespace std because
> +	     their names are reserved, so cannot be used as macro names in
> +	     valid programs.  */
> +	  if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
> +	      && cpp_macro_p (sr.node))

What is the advantage of dropping is_macro_not_literal_suffix and
hand-inlining it in two different spots?
Couldn't even the actual warning be moved into an inline function?

> +	    {
> +	      /* Maybe raise a warning, but do not consume the tokens.  */
> +	      pfile->buffer->cur = suffix_begin;
> +	      if (CPP_OPTION (pfile, warn_literal_suffix)
> +		  && !pfile->state.skipping)
> +		cpp_warning_with_line
> +		  (pfile, CPP_W_LITERAL_SUFFIX,
> +		   token->src_loc, 0,
> +		   "invalid suffix on literal; C++11 requires "
> +		   "a space between literal and string macro");

The ( on a call on a different line is quite ugly, so if it can be avoided,
it should.
		cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
				       token->src_loc, 0,
				       "invalid suffix on literal; C++11 "
				       "requires a space between literal "
				       "and string macro");
is more readable and same number of lines.

Otherwise it looks reasonable to me, but I'd still prefer Jason or Nathan
to review this.

	Jakub


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2023-02-10 16:30     ` Jakub Jelinek
@ 2023-02-10 16:58       ` Lewis Hyatt
  2023-02-10 17:10         ` Jakub Jelinek
  0 siblings, 1 reply; 10+ messages in thread
From: Lewis Hyatt @ 2023-02-10 16:58 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jason Merrill, Nathan Sidwell, gcc-patches

On Fri, Feb 10, 2023 at 11:30 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Mon, Sep 26, 2022 at 06:27:25PM -0400, Lewis Hyatt via Gcc-patches wrote:
> > May I please ping this patch again? Joseph suggested that it would be best if
> > a C++ maintainer has a look at it. This is one of just a few places left where
> > we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> > fixed up if there is time to review this patch. Thanks!
>
> CCing them.
>
> Just some nits from me, but I agree C++ maintainers are the best reviewers
> for this.

Thanks so much for looking it over, I really appreciate it. I'll be
sure to incorporate all your feedback along with those from the full
review.

Is this for stage 1 at this point BTW?

One note, the patch as-is doesn't quite apply to master branch
nowadays, it just needs a small tweak since warn_about_normalization()
has acquired a new argument in the meantime. If it's helpful I can
resend it with this addressed, as well as the rest of your comments?

Finally one comment here:

> > +      if (const auto sr = scan_cur_identifier (pfile))
> > +     {
> > +       /* If a string format macro, say from inttypes.h, is placed touching
> > +          a string literal it could be parsed as a C++11 user-defined
> > +          string literal thus breaking the program.  User-defined literals
> > +          outside of namespace std must start with a single underscore, so
> > +          assume anything of that form really is a UDL suffix.  We don't
> > +          need to worry about UDLs defined inside namespace std because
> > +          their names are reserved, so cannot be used as macro names in
> > +          valid programs.  */
> > +       if ((suffix_begin[0] != '_' || suffix_begin[1] == '_')
> > +           && cpp_macro_p (sr.node))
>
> What is the advantage of dropping is_macro_not_literal_suffix and
> hand-inlining it in two different spots?
> Couldn't even the actual warning be moved into an inline function?

The is_macro() function was doing two jobs, first lexing the
identifier and looking it up in the hash table, and then calling
cpp_macro_p(). This was a bit duplicative because the identifier was
then immediately lexed again after the check. Since lexing it became
more complicated with UTF-8 support, I changed it not to duplicate
that effort and instead scan_cur_identifer() does the job once. With
that done, all that's left for is_macro() to do is just the one line
check so I got rid of it. However, I agree that the check about
suffix_begin is not really trivial and so factoring this out into one
place instead of two makes sense. I'll try to move the whole warning
into its own function in the next iteration.

> Otherwise it looks reasonable to me, but I'd still prefer Jason or Nathan
> to review this.
>
>         Jakub
>

Thanks again.

-Lewis

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2023-02-10 16:58       ` Lewis Hyatt
@ 2023-02-10 17:10         ` Jakub Jelinek
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Jelinek @ 2023-02-10 17:10 UTC (permalink / raw)
  To: Lewis Hyatt; +Cc: Jason Merrill, Nathan Sidwell, gcc-patches

On Fri, Feb 10, 2023 at 11:58:03AM -0500, Lewis Hyatt wrote:
> On Fri, Feb 10, 2023 at 11:30 AM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Mon, Sep 26, 2022 at 06:27:25PM -0400, Lewis Hyatt via Gcc-patches wrote:
> > > May I please ping this patch again? Joseph suggested that it would be best if
> > > a C++ maintainer has a look at it. This is one of just a few places left where
> > > we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> > > fixed up if there is time to review this patch. Thanks!
> >
> > CCing them.
> >
> > Just some nits from me, but I agree C++ maintainers are the best reviewers
> > for this.
> 
> Thanks so much for looking it over, I really appreciate it. I'll be
> sure to incorporate all your feedback along with those from the full
> review.
> 
> Is this for stage 1 at this point BTW?

It has been posted during stage 1, on the other side we are already almost a
month into stage 4, and fixes an important bug which isn't a regression
though.  I'd defer to the reviewers to decide that.

I've noticed this when seeing PR108717 being marked as dup of PR103902.
> The is_macro() function was doing two jobs, first lexing the
> identifier and looking it up in the hash table, and then calling
> cpp_macro_p(). This was a bit duplicative because the identifier was
> then immediately lexed again after the check. Since lexing it became
> more complicated with UTF-8 support, I changed it not to duplicate
> that effort and instead scan_cur_identifer() does the job once. With
> that done, all that's left for is_macro() to do is just the one line
> check so I got rid of it. However, I agree that the check about
> suffix_begin is not really trivial and so factoring this out into one
> place instead of two makes sense. I'll try to move the whole warning
> into its own function in the next iteration.

I agree, I just wanted to mention that if both of the callers need the same
large comment and roughly or completely the same code that guards the
cpp_warning, then it is a good candidate for a new helper, exactly like
you've added for the large duplication in the first new function.

	Jakub


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
  2023-02-10 16:30     ` Jakub Jelinek
@ 2023-02-15 18:39     ` Jason Merrill
  2023-02-15 23:18       ` Lewis Hyatt
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Merrill @ 2023-02-15 18:39 UTC (permalink / raw)
  To: Lewis Hyatt, gcc-patches

On 9/26/22 15:27, Lewis Hyatt wrote:
> On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
>> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
>>> Hello-
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
>>>
>>> The attached patch resolves PR preprocessor/103902 as described in the patch
>>> message inline below. bootstrap + regtest all languages was successful on
>>> x86-64 Linux, with no new failures:
>>>
>>> FAIL 103 103
>>> PASS 542338 542371
>>> UNSUPPORTED 15247 15250
>>> UNTESTED 136 136
>>> XFAIL 4166 4166
>>> XPASS 17 17
>>>
>>> Please let me know if it looks OK?
>>>
>>> A few questions I have:
>>>
>>> - A difference introduced with this patch is that after lexing something
>>> like `operator ""_abc', then `_abc' is added to the identifier hash map,
>>> whereas previously it was not. I feel like this must be OK because with the
>>> optional space as in `operator "" _abc', it would be added with or without the
>>> patch.
>>>
>>> - The behavior of `#pragma GCC poison' is not consistent (including prior to
>>>    my patch). I tried to make it more so but there is still one thing I want to
>>>    ask about. Leaving aside extended characters for now, the inconsistency is
>>>    that currently the poison is only checked, when the suffix appears as a
>>>    standalone token.
>>>
>>>    #pragma GCC poison _X
>>>    bool operator ""_X (unsigned long long);   //accepted before the patch,
>>>                                               //rejected after it
>>>    bool operator "" _X (unsigned long long);  //rejected either before or after
>>>    const char * operator ""_X (const char *, unsigned long); //accepted before,
>>>                                                              //rejected after
>>>    const char * operator "" _X (const char *, unsigned long); //rejected either
>>>
>>>    const char * s = ""_X; //accepted before the patch, rejected after it
>>>    const bool b = 1_X; //accepted before or after ****
>>>
>>> I feel like after the patch, the behavior is the expected behavior for all
>>> cases but the last one. Here, we allow the poisoned identifier because it's
>>> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
>>> like this or does it need to be addressed?
>>
>> Sorry, that version actually did not handle the case of -Wc++11-compat in
>> c++98 mode correctly. This updated version fixes that and adds the missing
>> test coverage for that, if you could please review this one instead?
>>
>> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
>> attachments. I attached the patch also gzipped to address that for the
>> archive, since the new testcases do use non-ASCII characters.
>>
>> Thanks for taking a look!
> 
> Hello-
> 
> May I please ping this patch again? Joseph suggested that it would be best if
> a C++ maintainer has a look at it. This is one of just a few places left where
> we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> fixed up if there is time to review this patch. Thanks!
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html
> 
> I re-attached it here as it required some trivial rebasing on top of recently
> pushed changes. As before, I also attached the gzipped version so that the
> UTF-8 testcases show up OK in the online archive, in case that's still an
> issue. Thanks for taking a look!

Thank you for the patch, sorry it slipped off my radar.

> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token. It is
> somewhat duplicative of the code in lex_identifier(), which handles the normal
> case, but I think there's no good way to avoid that without pessimizing the
> usual case, since lex_identifier() takes advantage of the fact that the first
> character of the identifier has already been analyzed.

So could you analyze the first character and then call lex_identifier?

> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix, and we
> check for poisoned identifiers there as well.

Hmm, I don't think we want the check for poisoned identifiers; a suffix 
is not a name.  That goes for the other diagnostics in 
identifier_diagnostics_on_lex, as well.  At the meeting last week the 
committee decided to deprecate the declaration with a space to clarify 
this distinction.

> +	      if (!accum.accum)
> +		create_literal2 (pfile, token, base,
> +				 suffix_begin - base,
> +				 NODE_NAME (sr.node),
> +				 NODE_LEN (sr.node),
> +				 type);
> +	      else
> +		{
> +		  accum.create_literal2 (pfile, token, base,
> +					 suffix_begin - base,
> +					 NODE_NAME (sr.node),
> +					 NODE_LEN (sr.node),
> +					 type);
> +		  _cpp_release_buff (pfile, accum.first);
> +		}

How about always calling accum.create_literal2?

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Ping^3: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2023-02-15 18:39     ` Jason Merrill
@ 2023-02-15 23:18       ` Lewis Hyatt
  2023-03-02 23:07         ` [PATCH v2] " Lewis Hyatt
  0 siblings, 1 reply; 10+ messages in thread
From: Lewis Hyatt @ 2023-02-15 23:18 UTC (permalink / raw)
  To: Jason Merrill; +Cc: gcc-patches

On Wed, Feb 15, 2023 at 1:39 PM Jason Merrill <jason@redhat.com> wrote:
>
> On 9/26/22 15:27, Lewis Hyatt wrote:
> > On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote:
> >> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote:
> >>> Hello-
> >>>
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902
> >>>
> >>> The attached patch resolves PR preprocessor/103902 as described in the patch
> >>> message inline below. bootstrap + regtest all languages was successful on
> >>> x86-64 Linux, with no new failures:
> >>>
> >>> FAIL 103 103
> >>> PASS 542338 542371
> >>> UNSUPPORTED 15247 15250
> >>> UNTESTED 136 136
> >>> XFAIL 4166 4166
> >>> XPASS 17 17
> >>>
> >>> Please let me know if it looks OK?
> >>>
> >>> A few questions I have:
> >>>
> >>> - A difference introduced with this patch is that after lexing something
> >>> like `operator ""_abc', then `_abc' is added to the identifier hash map,
> >>> whereas previously it was not. I feel like this must be OK because with the
> >>> optional space as in `operator "" _abc', it would be added with or without the
> >>> patch.
> >>>
> >>> - The behavior of `#pragma GCC poison' is not consistent (including prior to
> >>>    my patch). I tried to make it more so but there is still one thing I want to
> >>>    ask about. Leaving aside extended characters for now, the inconsistency is
> >>>    that currently the poison is only checked, when the suffix appears as a
> >>>    standalone token.
> >>>
> >>>    #pragma GCC poison _X
> >>>    bool operator ""_X (unsigned long long);   //accepted before the patch,
> >>>                                               //rejected after it
> >>>    bool operator "" _X (unsigned long long);  //rejected either before or after
> >>>    const char * operator ""_X (const char *, unsigned long); //accepted before,
> >>>                                                              //rejected after
> >>>    const char * operator "" _X (const char *, unsigned long); //rejected either
> >>>
> >>>    const char * s = ""_X; //accepted before the patch, rejected after it
> >>>    const bool b = 1_X; //accepted before or after ****
> >>>
> >>> I feel like after the patch, the behavior is the expected behavior for all
> >>> cases but the last one. Here, we allow the poisoned identifier because it's
> >>> not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
> >>> like this or does it need to be addressed?
> >>
> >> Sorry, that version actually did not handle the case of -Wc++11-compat in
> >> c++98 mode correctly. This updated version fixes that and adds the missing
> >> test coverage for that, if you could please review this one instead?
> >>
> >> By the way, the pipermail archive seems to permanently mangle UTF-8 in inline
> >> attachments. I attached the patch also gzipped to address that for the
> >> archive, since the new testcases do use non-ASCII characters.
> >>
> >> Thanks for taking a look!
> >
> > Hello-
> >
> > May I please ping this patch again? Joseph suggested that it would be best if
> > a C++ maintainer has a look at it. This is one of just a few places left where
> > we don't handle UTF-8 properly in libcpp, it would be really nice to get them
> > fixed up if there is time to review this patch. Thanks!
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html
> >
> > I re-attached it here as it required some trivial rebasing on top of recently
> > pushed changes. As before, I also attached the gzipped version so that the
> > UTF-8 testcases show up OK in the online archive, in case that's still an
> > issue. Thanks for taking a look!
>
> Thank you for the patch, sorry it slipped off my radar.
>

Thanks for taking a look at it. It's certainly an edge case that is
not bothering anyone too much, so no rush with it.

> > This patch fixes it by adding a new function scan_cur_identifier() that can be
> > used to lex an identifier while in the middle of lexing another token. It is
> > somewhat duplicative of the code in lex_identifier(), which handles the normal
> > case, but I think there's no good way to avoid that without pessimizing the
> > usual case, since lex_identifier() takes advantage of the fact that the first
> > character of the identifier has already been analyzed.
>
> So could you analyze the first character and then call lex_identifier?
>

Yes, it can be done this way. lex_identifier may need some adaptations
though, since it does some other work like tracking the original
spelling of the identifier. Plus per your comments below, it would
need to avoid the poison and other checks too.
I think it's pretty straightforward to refactor a bit so that it works
out. I kinda thought it may not be desirable to touch lex_identifier,
which is called on everything, just to handle this rare case, however
I am happy to do it this way after confirming it won't hurt
performance.

> > With scan_cur_identifier(), we do also correctly warn about bidi and
> > normalization issues in the extended identifiers comprising the suffix, and we
> > check for poisoned identifiers there as well.
>
> Hmm, I don't think we want the check for poisoned identifiers; a suffix
> is not a name.  That goes for the other diagnostics in
> identifier_diagnostics_on_lex, as well.  At the meeting last week the
> committee decided to deprecate the declaration with a space to clarify
> this distinction.
>

OK thanks, interesting.

> > +           if (!accum.accum)
> > +             create_literal2 (pfile, token, base,
> > +                              suffix_begin - base,
> > +                              NODE_NAME (sr.node),
> > +                              NODE_LEN (sr.node),
> > +                              type);
> > +           else
> > +             {
> > +               accum.create_literal2 (pfile, token, base,
> > +                                      suffix_begin - base,
> > +                                      NODE_NAME (sr.node),
> > +                                      NODE_LEN (sr.node),
> > +                                      type);
> > +               _cpp_release_buff (pfile, accum.first);
> > +             }
>
> How about always calling accum.create_literal2?

Yes it should be equivalent and better that way.

Sounds like I should try a version that attempts to reuse the logic in
lex_identifier and cleans up your other points, so I'll look into that
next. Thanks!


-Lewis

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2023-02-15 23:18       ` Lewis Hyatt
@ 2023-03-02 23:07         ` Lewis Hyatt
  2023-07-18 20:33           ` Jason Merrill
  0 siblings, 1 reply; 10+ messages in thread
From: Lewis Hyatt @ 2023-03-02 23:07 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jason Merrill, Lewis Hyatt

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token.

BTW, the other place that has been mis-lexing identifiers is
lex_identifier_intern(), which is used to implement #pragma push_macro
and #pragma pop_macro. This does not support extended characters either.
I will add that in a subsequent patch, because it can't directly reuse the
new function, but rather needs to lex from a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix.

libcpp/ChangeLog:

	PR preprocessor/103902
	* lex.cc (identifier_diagnostics_on_lex): New function refactoring
	some common code.
	(lex_identifier_intern): Use the new function.
	(lex_identifier): Don't run identifier diagnostics here, rather let
	the call site do it when needed.
	(_cpp_lex_direct): Adjust the call sites of lex_identifier ()
	acccordingly.
	(struct scan_id_result): New struct.
	(scan_cur_identifier): New function.
	(create_literal2): New function.
	(lit_accum::create_literal2): New function.
	(is_macro): Folded into new function...
	(maybe_ignore_udl_macro_suffix): ...here.
	(is_macro_not_literal_suffix): Folded likewise.
	(lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	PR preprocessor/103902
	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.
---

Notes:
    Hello-
    
    This is the updated version of the patch, incorporating feedback from Jakub
    and Jason, most recently discussed here:
    
    https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
    
    Please let me know how it looks? It is simpler than before with the new
    approach. Thanks!
    
    One thing to note. As Jason clarified for me, a usage like this:
    
     #pragma GCC poison _x
    const char * operator "" _x (const char *, unsigned long);
    
    The space between the "" and the _x is currently allowed but will be
    deprecated in C++23. GCC currently will complain about the poisoned use of
    _x in this case, and this patch, which is just focused on handling UTF-8
    properly, does not change this. But it seems that it would be correct
    not to apply poison in this case. I can try to follow up with a patch to do
    so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
    worth it...
    
    For the time being, this patch does add a testcase for the above and xfails
    it. For the case where no space is present, which is the part touched by the
    present patch, existing behavior is preserved correctly and no diagnostics
    such as poison are issued for the UDL suffix. (Contrary to v1 of this
    patch.)
    
    Thanks! bootstrap + regtested all languages on x86-64 Linux with
    no regressions.
    
    -Lewis

 .../g++.dg/cpp0x/udlit-extended-id-1.C        |  68 ++++
 .../g++.dg/cpp0x/udlit-extended-id-2.C        |   6 +
 .../g++.dg/cpp0x/udlit-extended-id-3.C        |  15 +
 .../g++.dg/cpp0x/udlit-extended-id-4.C        |  14 +
 libcpp/lex.cc                                 | 382 ++++++++++--------
 5 files changed, 317 insertions(+), 168 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C

diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 00000000000..411d4fdd0ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,68 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 00000000000..05a2804a463
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 00000000000..11292e476e3
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,15 @@
+// { dg-do compile { target c++11 } }
+
+// Check that we do not look for poisoned identifier when it is a suffix.
+int _ħ;
+#pragma GCC poison _ħ
+const char * operator ""_ħ (const char *, unsigned long); // { dg-bogus "poisoned" }
+bool operator ""_ħ (unsigned long long x); // { dg-bogus "poisoned" }
+bool b = 1_ħ; // { dg-bogus "poisoned" }
+const char *x = "hbar"_ħ; // { dg-bogus "poisoned" }
+
+/* Ideally, we should not warn here either, but this is not implemented yet.  This
+   syntax has been deprecated for C++23.  */
+#pragma GCC poison _ħ2
+const char * operator "" _ħ2 (const char *, unsigned long); // { dg-bogus "poisoned" "" { xfail *-*-*} }
+const char *x2 = "hbar2"_ħ2; // { dg-bogus "poisoned" }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
new file mode 100644
index 00000000000..d1683c4d892
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
@@ -0,0 +1,14 @@
+// { dg-options "-std=c++98 -Wc++11-compat" }
+#define END ;
+#define εND ;
+#define EηD ;
+#define EN\u0394 ;
+
+const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" }
+const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" }
+const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" }
+const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" }
+
+/* Make sure we did not skip the token also in the case that it wasn't found to
+   be a macro; compilation should fail here.  */
+const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" }
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 45ea16a91bc..062935e2371 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile,
     }
 }
 
-/* Returns TRUE if the sequence starting at buffer->cur is valid in
-   an identifier.  FIRST is TRUE if this starts an identifier.  */
+/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
+   extended character in an identifier.  If FIRST is TRUE, then the character
+   must be valid at the beginning of an identifier as well.  If the return
+   value is TRUE, then pfile->buffer->cur has been moved to point to the next
+   byte after the extended character.  */
 
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile)
     }
 }
 
+/* Helper function to perform diagnostics that are needed (rarely)
+   when an identifier is lexed.  */
+static void
+identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node)
+{
+  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
+			|| pfile->state.skipping, 1))
+    return;
+
+  /* It is allowed to poison the same identifier twice.  */
+  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
+    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
+	       NODE_NAME (node));
+
+  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
+     replacement list of a variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_ARGS__
+      && !pfile->state.va_args_ok)
+    {
+      if (CPP_OPTION (pfile, cplusplus))
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C++11 variadic macro");
+      else
+	cpp_error (pfile, CPP_DL_PEDWARN,
+		   "__VA_ARGS__ can only appear in the expansion"
+		   " of a C99 variadic macro");
+    }
+
+  /* __VA_OPT__ should only appear in the replacement list of a
+     variadic macro.  */
+  if (node == pfile->spec_nodes.n__VA_OPT__)
+    maybe_va_opt_error (pfile);
+
+  /* For -Wc++-compat, warn about use of C++ named operators.  */
+  if (node->flags & NODE_WARN_OPERATOR)
+    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
+		 "identifier \"%s\" is a special operator name in C++",
+		 NODE_NAME (node));
+}
+
 /* Helper function to get the cpp_hashnode of the identifier BASE.  */
 static cpp_hashnode *
 lex_identifier_intern (cpp_reader *pfile, const uchar *base)
@@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar *base)
   hash = HT_HASHFINISH (hash, len);
   result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
 					      base, len, hash, HT_ALLOC));
-
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
-
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
-
+  identifier_diagnostics_on_lex (pfile, result);
   return result;
 }
 
@@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const char *name)
   return result;
 }
 
-/* Lex an identifier starting at BUFFER->CUR - 1.  */
+/* Lex an identifier starting at BASE.  BUFFER->CUR is expected to point
+   one past the first character at BASE, which may be a (possibly multi-byte)
+   character if STARTS_UCN is true.  */
 static cpp_hashnode *
 lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
 		struct normalize_state *nst, cpp_hashnode **spelling)
@@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
       *spelling = result;
     }
 
-  /* Rarely, identifiers require diagnostics when lexed.  */
-  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
-			&& !pfile->state.skipping, 0))
-    {
-      /* It is allowed to poison the same identifier twice.  */
-      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
-	cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
-		   NODE_NAME (result));
-
-      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
-	 replacement list of a variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_ARGS__
-	  && !pfile->state.va_args_ok)
-	{
-	  if (CPP_OPTION (pfile, cplusplus))
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C++11 variadic macro");
-	  else
-	    cpp_error (pfile, CPP_DL_PEDWARN,
-		       "__VA_ARGS__ can only appear in the expansion"
-		       " of a C99 variadic macro");
-	}
+  return result;
+}
 
-      /* __VA_OPT__ should only appear in the replacement list of a
-	 variadic macro.  */
-      if (result == pfile->spec_nodes.n__VA_OPT__)
-	maybe_va_opt_error (pfile);
-
-      /* For -Wc++-compat, warn about use of C++ named operators.  */
-      if (result->flags & NODE_WARN_OPERATOR)
-	cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
-		     "identifier \"%s\" is a special operator name in C++",
-		     NODE_NAME (result));
-    }
+/* Struct to hold the return value of the scan_cur_identifier () helper
+   function below.  */
 
+struct scan_id_result
+{
+  cpp_hashnode *node;
+  normalize_state nst;
+
+  scan_id_result ()
+    : node (nullptr)
+  {
+    nst = INITIAL_NORMALIZE_STATE;
+  }
+
+  explicit operator bool () const { return node; }
+};
+
+/* Helper function to scan an entire identifier beginning at
+   pfile->buffer->cur, and possibly containing extended characters (UCNs
+   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
+   else nullptr, as well as a normalize_state so that normalization warnings
+   may be issued once the token lexing is complete.  */
+
+static scan_id_result
+scan_cur_identifier (cpp_reader *pfile)
+{
+  const auto buffer = pfile->buffer;
+  const auto begin = buffer->cur;
+  scan_id_result result;
+  if (ISIDST (*buffer->cur))
+    {
+      ++buffer->cur;
+      cpp_hashnode *ignore;
+      result.node = lex_identifier (pfile, begin, false, &result.nst, &ignore);
+    }
+  else if (forms_identifier_p (pfile, true, &result.nst))
+    {
+      /* buffer->cur has been moved already by the call
+	 to forms_identifier_p.  */
+      cpp_hashnode *ignore;
+      result.node = lex_identifier (pfile, begin, true, &result.nst, &ignore);
+    }
   return result;
 }
 
@@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, const uchar *base,
   token->val.str.text = cpp_alloc_token_string (pfile, base, len);
 }
 
+/* Like create_literal(), but construct it from two separate strings
+   which are concatenated.  LEN2 may be 0 if no second string is
+   required.  */
+static void
+create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
+		 unsigned int len1, const uchar *base2, unsigned int len2,
+		 enum cpp_ttype type)
+{
+  token->type = type;
+  token->val.str.len = len1 + len2;
+  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
+  memcpy (dest, base1, len1);
+  if (len2)
+    memcpy (dest+len1, base2, len2);
+  dest[len1 + len2] = 0;
+  token->val.str.text = dest;
+}
+
 const uchar *
 cpp_alloc_token_string (cpp_reader *pfile,
 			const unsigned char *ptr, unsigned len)
@@ -2403,6 +2442,11 @@ struct lit_accum {
       rpos = NULL;
     return c;
   }
+
+  void create_literal2 (cpp_reader *pfile, cpp_token *token,
+			const uchar *base1, unsigned int len1,
+			const uchar *base2, unsigned int len2,
+			enum cpp_ttype type);
 };
 
 /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
@@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile)
   rpos = BUFF_FRONT (last);
 }
 
-/* Returns true if a macro has been defined.
-   This might not work if compile with -save-temps,
-   or preprocess separately from compilation.  */
+/* Helper function to check if a string format macro, say from inttypes.h, is
+   placed touching a string literal, in which case it could be parsed as a C++11
+   user-defined string literal thus breaking the program.  User-defined literals
+   outside of namespace std must start with a single underscore, so assume
+   anything of that form really is a UDL suffix.  We don't need to worry about
+   UDLs defined inside namespace std because their names are reserved, so cannot
+   be used as macro names in valid programs.  Return TRUE if the UDL should be
+   ignored for now and preserved for potential macro expansion.  */
 
 static bool
-is_macro(cpp_reader *pfile, const uchar *base)
+maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
+			       const uchar *suffix_begin, cpp_hashnode *node)
 {
-  const uchar *cur = base;
-  if (! ISIDST (*cur))
+  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
+      || !cpp_macro_p (node))
     return false;
-  unsigned int hash = HT_HASHSTEP (0, *cur);
-  ++cur;
-  while (ISIDNUM (*cur))
-    {
-      hash = HT_HASHSTEP (hash, *cur);
-      ++cur;
-    }
-  hash = HT_HASHFINISH (hash, cur - base);
 
-  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
-					base, cur - base, hash, HT_NO_INSERT));
-
-  return result && cpp_macro_p (result);
+  /* Maybe raise a warning here; caller should arrange not to consume
+     the tokens.  */
+  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
+    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
+			   "invalid suffix on literal; C++11 requires a space "
+			   "between literal and string macro");
+  return true;
 }
 
-/* Returns true if a literal suffix does not have the expected form
-   and is defined as a macro.  */
-
-static bool
-is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
+/* Like create_literal2(), but also prepend all the accumulated data from
+   the lit_accum struct.  */
+void
+lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
+			    const uchar *base1, unsigned int len1,
+			    const uchar *base2, unsigned int len2,
+			    enum cpp_ttype type)
 {
-  /* User-defined literals outside of namespace std must start with a single
-     underscore, so assume anything of that form really is a UDL suffix.
-     We don't need to worry about UDLs defined inside namespace std because
-     their names are reserved, so cannot be used as macro names in valid
-     programs.  */
-  if (base[0] == '_' && base[1] != '_')
-    return false;
-  return is_macro (pfile, base);
+  const unsigned int tot_len = accum + len1 + len2;
+  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
+  token->type = type;
+  token->val.str.len = tot_len;
+  token->val.str.text = dest;
+  for (_cpp_buff *buf = first; buf; buf = buf->next)
+    {
+      size_t len = BUFF_FRONT (buf) - buf->base;
+      memcpy (dest, buf->base, len);
+      dest += len;
+    }
+  memcpy (dest, base1, len1);
+  dest += len1;
+  if (len2)
+    memcpy (dest, base2, len2);
+  dest += len2;
+  *dest = '\0';
 }
 
 /* Lexes a raw string.  The stored string contains the spelling,
@@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, pos))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*pos))
-	{
-	  type = cpp_userdef_string_add_type (type);
-	  ++pos;
+      const uchar *const suffix_begin = pos;
+      pfile->buffer->cur = pos;
 
-	  while (ISIDNUM (*pos))
-	    ++pos;
+      if (const auto sr = scan_cur_identifier (pfile))
+	{
+	  if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
+					     suffix_begin, sr.node))
+	      pfile->buffer->cur = suffix_begin;
+	  else
+	    {
+	      type = cpp_userdef_string_add_type (type);
+	      accum.create_literal2 (pfile, token, base, suffix_begin - base,
+				     NODE_NAME (sr.node), NODE_LEN (sr.node),
+				     type);
+	      if (accum.first)
+		_cpp_release_buff (pfile, accum.first);
+	      warn_about_normalization (pfile, token, &sr.nst, true);
+	      return;
+	    }
 	}
     }
 
@@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     create_literal (pfile, token, base, pos - base, type);
   else
     {
-      size_t extra_len = pos - base;
-      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 1);
-
-      token->type = type;
-      token->val.str.len = accum.accum + extra_len;
-      token->val.str.text = dest;
-      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
-	{
-	  size_t len = BUFF_FRONT (buf) - buf->base;
-	  memcpy (dest, buf->base, len);
-	  dest += len;
-	}
+      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, type);
       _cpp_release_buff (pfile, accum.first);
-      memcpy (dest, base, extra_len);
-      dest[extra_len] = '\0';
     }
 }
 
@@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
     cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
 	       (int) terminator);
 
+  pfile->buffer->cur = cur;
+  const uchar *const suffix_begin = cur;
+
   if (CPP_OPTION (pfile, user_literals))
     {
-      /* If a string format macro, say from inttypes.h, is placed touching
-	 a string literal it could be parsed as a C++11 user-defined string
-	 literal thus breaking the program.  */
-      if (is_macro_not_literal_suffix (pfile, cur))
-	{
-	  /* Raise a warning, but do not consume subsequent tokens.  */
-	  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
-	    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
-				   token->src_loc, 0,
-				   "invalid suffix on literal; C++11 requires "
-				   "a space between literal and string macro");
-	}
-      /* Grab user defined literal suffix.  */
-      else if (ISIDST (*cur))
+      if (const auto sr = scan_cur_identifier (pfile))
 	{
-	  type = cpp_userdef_char_add_type (type);
-	  type = cpp_userdef_string_add_type (type);
-          ++cur;
-
-	  while (ISIDNUM (*cur))
-	    ++cur;
+	  if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
+					     suffix_begin, sr.node))
+	    pfile->buffer->cur = suffix_begin;
+	  else
+	    {
+	      /* Grab user defined literal suffix.  */
+	      type = cpp_userdef_char_add_type (type);
+	      type = cpp_userdef_string_add_type (type);
+	      create_literal2 (pfile, token, base, suffix_begin - base,
+			       NODE_NAME (sr.node), NODE_LEN (sr.node), type);
+	      warn_about_normalization (pfile, token, &sr.nst, true);
+	      return;
+	    }
 	}
     }
   else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
-	   && is_macro (pfile, cur)
 	   && !pfile->state.skipping)
-    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
-			   token->src_loc, 0, "C++11 requires a space "
-			   "between string literal and macro");
+    {
+      const auto sr = scan_cur_identifier (pfile);
+      /* Maybe raise a warning, but do not consume the tokens.  */
+      pfile->buffer->cur = suffix_begin;
+      if (sr && cpp_macro_p (sr.node))
+	cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
+			       token->src_loc, 0, "C++11 requires a space "
+			       "between string literal and macro");
+    }
 
-  pfile->buffer->cur = cur;
   create_literal (pfile, token, base, cur - base, type);
 }
 
@@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile)
       result->type = CPP_NAME;
       {
 	struct normalize_state nst = INITIAL_NORMALIZE_STATE;
-	result->val.node.node = lex_identifier (pfile, buffer->cur - 1, false,
-						&nst,
-						&result->val.node.spelling);
+	const auto node = lex_identifier (pfile, buffer->cur - 1, false, &nst,
+					  &result->val.node.spelling);
+	result->val.node.node = node;
+	identifier_diagnostics_on_lex (pfile, node);
 	warn_about_normalization (pfile, result, &nst, true);
       }
 
@@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile)
 	if (forms_identifier_p (pfile, true, &nst))
 	  {
 	    result->type = CPP_NAME;
-	    result->val.node.node = lex_identifier (pfile, base, true, &nst,
-						    &result->val.node.spelling);
+	    const auto node = lex_identifier (pfile, base, true, &nst,
+					      &result->val.node.spelling);
+	    result->val.node.node = node;
+	    identifier_diagnostics_on_lex (pfile, node);
 	    warn_about_normalization (pfile, result, &nst, true);
 	    break;
 	  }
@@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type)
 }
 
 /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
-   The buffer must already contain the enough space to hold the
+   The buffer must already contain enough space to hold the
    token's spelling.  Returns a pointer to the character after the
    last character written.  */
 unsigned char *
@@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, cpp_hashnode *ident)
 }
 
 /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
-   already contain the enough space to hold the token's spelling.
+   already contain enough space to hold the token's spelling.
    Returns a pointer to the character after the last character written.
    FORSTRING is true if this is to be the spelling after translation
    phase 1 (with the original spelling of extended identifiers), false

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
  2023-03-02 23:07         ` [PATCH v2] " Lewis Hyatt
@ 2023-07-18 20:33           ` Jason Merrill
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Merrill @ 2023-07-18 20:33 UTC (permalink / raw)
  To: Lewis Hyatt, gcc-patches

On 3/2/23 18:07, Lewis Hyatt wrote:
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
> 
> bool operator ""_π (unsigned long long);
> 
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
> 
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
> 
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
> 
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
> 
> libcpp/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* lex.cc (identifier_diagnostics_on_lex): New function refactoring
> 	some common code.
> 	(lex_identifier_intern): Use the new function.
> 	(lex_identifier): Don't run identifier diagnostics here, rather let
> 	the call site do it when needed.
> 	(_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> 	acccordingly.
> 	(struct scan_id_result): New struct.
> 	(scan_cur_identifier): New function.
> 	(create_literal2): New function.
> 	(lit_accum::create_literal2): New function.
> 	(is_macro): Folded into new function...
> 	(maybe_ignore_udl_macro_suffix): ...here.
> 	(is_macro_not_literal_suffix): Folded likewise.
> 	(lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
> 	(lex_string): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
> 
> Notes:
>      Hello-
>      
>      This is the updated version of the patch, incorporating feedback from Jakub
>      and Jason, most recently discussed here:
>      
>      https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>      
>      Please let me know how it looks? It is simpler than before with the new
>      approach. Thanks!
>      
>      One thing to note. As Jason clarified for me, a usage like this:
>      
>       #pragma GCC poison _x
>      const char * operator "" _x (const char *, unsigned long);
>      
>      The space between the "" and the _x is currently allowed but will be
>      deprecated in C++23. GCC currently will complain about the poisoned use of
>      _x in this case, and this patch, which is just focused on handling UTF-8
>      properly, does not change this. But it seems that it would be correct
>      not to apply poison in this case. I can try to follow up with a patch to do
>      so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
>      worth it...

Right, the deprecation is intended to avoid this problem; it's fine for 
us to complain.

>      For the time being, this patch does add a testcase for the above and xfails
>      it. For the case where no space is present, which is the part touched by the
>      present patch, existing behavior is preserved correctly and no diagnostics
>      such as poison are issued for the UDL suffix. (Contrary to v1 of this
>      patch.)
>      
>      Thanks! bootstrap + regtested all languages on x86-64 Linux with
>      no regressions.
>      

> -/* Returns true if a macro has been defined.
> -   This might not work if compile with -save-temps,
> -   or preprocess separately from compilation.  */
> +/* Helper function to check if a string format macro, say from inttypes.h, is
> +   placed touching a string literal, in which case it could be parsed as a C++11
> +   user-defined string literal thus breaking the program.

>     User-defined literals
> +   outside of namespace std must start with a single underscore, so assume
> +   anything of that form really is a UDL suffix.  We don't need to worry about
> +   UDLs defined inside namespace std because their names are reserved, so cannot
> +   be used as macro names in valid programs.

I'd prefer to leave this rationale comment on the _[^_] check rather 
than hoist it out of the function; OK with that change.  Thank you very 
much for your persistence.

>     Return TRUE if the UDL should be
> +   ignored for now and preserved for potential macro expansion.  */
>   
>   static bool
> -is_macro(cpp_reader *pfile, const uchar *base)
> +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
> +			       const uchar *suffix_begin, cpp_hashnode *node)
>   {
> -  const uchar *cur = base;
> -  if (! ISIDST (*cur))
> +  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
> +      || !cpp_macro_p (node))
>       return false;
> -  unsigned int hash = HT_HASHSTEP (0, *cur);
> -  ++cur;
> -  while (ISIDNUM (*cur))
> -    {
> -      hash = HT_HASHSTEP (hash, *cur);
> -      ++cur;
> -    }
> -  hash = HT_HASHFINISH (hash, cur - base);
>   
> -  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
> -					base, cur - base, hash, HT_NO_INSERT));
> -
> -  return result && cpp_macro_p (result);
> +  /* Maybe raise a warning here; caller should arrange not to consume
> +     the tokens.  */
> +  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> +    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> +			   "invalid suffix on literal; C++11 requires a space "
> +			   "between literal and string macro");
> +  return true;
>   }
>   
> -/* Returns true if a literal suffix does not have the expected form
> -   and is defined as a macro.  */
> -
> -static bool
> -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
>   {
> -  /* User-defined literals outside of namespace std must start with a single
> -     underscore, so assume anything of that form really is a UDL suffix.
> -     We don't need to worry about UDLs defined inside namespace std because
> -     their names are reserved, so cannot be used as macro names in valid
> -     programs.  */
> -  if (base[0] == '_' && base[1] != '_')
> -    return false;
> -  return is_macro (pfile, base);


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-07-18 20:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 21:26 [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902] Lewis Hyatt
2022-06-15 19:06 ` Lewis Hyatt
2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
2023-02-10 16:30     ` Jakub Jelinek
2023-02-10 16:58       ` Lewis Hyatt
2023-02-10 17:10         ` Jakub Jelinek
2023-02-15 18:39     ` Jason Merrill
2023-02-15 23:18       ` Lewis Hyatt
2023-03-02 23:07         ` [PATCH v2] " Lewis Hyatt
2023-07-18 20:33           ` Jason Merrill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).