On Wed, Aug 24, 2022 at 04:22:17PM -0400, Jason Merrill wrote:
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> Does the copyright 2005-2022 mean that this code is partly derived from some
> other?

Yes, I was lazy and started by copying over makeucnid.cc which also
parses UnicodeData.txt.
In the end, according to diff -upd -U10000 make{ucnid,uname2c}.cc, there are
~180 lines in common (out of ~530 lines of makeucnid.cc), out of which is
~80 lines in the two copyrights, most of the rest are just empty lines or
lines with { or } alone, beyond that
 #include <stdio.h>
 #include <string.h>
 #include <ctype.h>
 #include <stdbool.h>
 #include <stdlib.h>
 
 
 #define NUM_CODE_POINTS 0x110000
 #define MAX_CODE_POINT 0x10ffff
and
 /* Read UnicodeData.txt and fill in the 'decomp' table to be the
    decompositions of characters for which both the character
    decomposed and all the code points in the decomposition are valid
    for some supported language version, and the 'all_decomp' table to
    be the decompositions of all characters without those
    constraints.  */
 
 static void
 {
   if (!f)
   for (;;)
     {
       char line[256];
       char *l;
 
       if (!fgets (line, sizeof (line), f))
        break;
       codepoint = strtoul (line, &l, 16);
       if (l == line || *l != ';')
       if (codepoint > MAX_CODE_POINT)
 
       do {
       } while (*l != ';');
        {
        }
     }
   if (ferror (f))
   fclose (f);
 }
are the common lines close to each other (and whole
write_copyright function).  Dunno if with that I could use
just 2022 copyright or not.

> > +			  /* We don't know what the next letter will be.
> > +			     It could be ISALNUM, then we are supposed
> > +			     to omit it, or it could be a space and then
> > +			     we should not omit it and need to compare it.
> > +			     Fortunately the only 3 names with hyphen
> > +			     followed by non-letter are
> > +			     U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
> > +			     U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN
> > +			     U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
> > +			     and makeuname2c.cc verifies this.
> > +			     Furthermore, prefixes of NR2 generated
> > +			     ranges all end with a hyphen, but the generated
> > +			     part is then followed by alpha-numeric.
> > +			     So, let's just assume that - at the end of
> > +			     key is always followed by alphanumeric and
> > +			     so should be omitted.  */
> 
> Let's mention that makeuname2c.cc verifies this property.

I had "and makeuname2c.cc verifies this." there already a few lines before,
but I agree it is better to move that to the end.

> > +		      for (j = start; j < end; j++)
> > +			{
> > +			  /* Actually strlen, but we know strlen () <= 3.  */
> 
> Is this comment saying that you're using a loop instead of calling strlen
> because you know the result will be small?  That seems an odd choice.

Yes, but perhaps it is a micro-optimization and maybe the Korean characters
will not be used that much that it isn't worth it.
Our optimizers certainly aren't able to figure out that when
strlen is called on an array element with size 4 that calling library
function isn't the best idea.  The string lengths are 0 in 3%, 1 in 44%,
2 in 47% and 3 in 6% of cases.
At least on x86_64 when I just use this_len = strlen (hangul_syllables[j]);
it calls the library routine.
Changed to this_len = strlen (hangul_syllables[j]);

> > +		      /* Try to do a loose name lookup according to
> > +			 Unicode loose matching rule UAX44-LM2.
> 
> Maybe factor the loose lookup into a separate function?

Good idea.

> > +	      bidi::kind kind;
> > +	      if (buffer->cur[-1] == 'N')
> > +		kind = get_bidi_named (pfile, buffer->cur, &loc);
> > +	      else
> > +		kind = get_bidi_ucn (pfile, buffer->cur,
> > +				     buffer->cur[-1] == 'U', &loc);
> 
> Hmm, I'm surprised that we're doing bidi checking before replacing escape
> characters with elements of the translation character set.  So now we need
> to check it three different ways.

It is unfortunate, but I'm afraid it is intentional.
Because after replacing the escape characters we lose the distinction
between characters written as UTF-8 in the source and the escape sequences.
The former need to be treated differently as they are more dangerous than
the latter, bidi written as UTF-8 can mislead what the source contains
already in (some) text editors or whatever way user looks at the source
code, while when written as UCNs (\u, \u{}, \U, \N{}) it can be dangerous
only when the program emits it at runtime unpaired.

Here is incremental diff and full patch (with the huge uname2c.h generated
header removed so that it fits in ml limits, that one hasn't changed).

So far tested with
GXX_TESTSUITE_STDS=98,11,14,17,20,2b make check-gcc check-g++ RUNTESTFLAGS="dg.exp='named-uni* Wbidi*' cpp.exp=named-uni*"
which should cover it, but of course would do full bootstrap/regtest
later.

--- libcpp/charset.cc	2022-08-20 23:29:12.817996729 +0200
+++ libcpp/charset.cc	2022-08-25 10:34:16.652212078 +0200
@@ -1037,13 +1037,13 @@
 			     U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
 			     U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN
 			     U+0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
-			     and makeuname2c.cc verifies this.
 			     Furthermore, prefixes of NR2 generated
 			     ranges all end with a hyphen, but the generated
 			     part is then followed by alpha-numeric.
 			     So, let's just assume that - at the end of
 			     key is always followed by alphanumeric and
-			     so should be omitted.  */
+			     so should be omitted.
+			     makeuname2c.cc verifies that this is true.  */
 			  ++q;
 			  continue;
 			}
@@ -1090,10 +1090,7 @@
 		      winner[i] = -1;
 		      for (j = start; j < end; j++)
 			{
-			  /* Actually strlen, but we know strlen () <= 3.  */
-			  for (this_len = 0; hangul_syllables[j][this_len];
-			       ++this_len)
-			    ;
+			  this_len = strlen (hangul_syllables[j]);
 			  if (len >= (size_t) this_len
 			      && this_len > max_len
 			      && memcmp (name, hangul_syllables[j],
@@ -1196,6 +1193,69 @@
   return -1;
 }
 
+/* Try to do a loose name lookup according to Unicode loose matching rule
+   UAX44-LM2.  First ignore medial hyphens, whitespace, underscore
+   characters and convert to upper case.  */
+
+static cppchar_t
+_cpp_uname2c_uax44_lm2 (const char *name, size_t len, char *canon_name)
+{
+  char name_after_uax44_lm2[uname2c_max_name_len];
+  char *q = name_after_uax44_lm2;
+  const char *p;
+
+  for (p = name; p < name + len; p++)
+    if (*p == '_' || *p == ' ')
+      continue;
+    else if (*p == '-' && p != name && ISALNUM (p[-1]) && ISALNUM (p[1]))
+      continue;
+    else if (q == name_after_uax44_lm2 + uname2c_max_name_len)
+      return -1;
+    else if (ISLOWER (*p))
+      *q++ = TOUPPER (*p);
+    else
+      *q++ = *p;
+
+  struct uname2c_data data;
+  data.canon_name = canon_name;
+  data.prev_char = ' ';
+  /* Hangul Jungseong O- E after UAX44-LM2 should be HANGULJUNGSEONGO-E
+     and so should match U+1180.  */
+  if (q - name_after_uax44_lm2 == sizeof ("HANGULJUNGSEONGO-E") - 1
+      && memcmp (name_after_uax44_lm2, "HANGULJUNGSEONGO-E",
+		 sizeof ("HANGULJUNGSEONGO-E") - 1) == 0)
+    {
+      name_after_uax44_lm2[sizeof ("HANGULJUNGSEONGO") - 1] = 'E';
+      --q;
+    }
+  cppchar_t result
+    = _cpp_uname2c (name_after_uax44_lm2, q - name_after_uax44_lm2,
+		    uname2c_tree, &data);
+
+  /* Unicode UAX44-LM2 exception:
+     U+116C HANGUL JUNGSEONG OE
+     U+1180 HANGUL JUNGSEONG O-E
+     We remove all medial hyphens when we shouldn't remote the U+1180 one.
+     The U+1180 entry sorts before U+116C lexicographilly, so we get U+1180
+     in both cases.  Thus, if result is U+1180, check if user's name doesn't
+     have a hyphen there and adjust.  */
+  if (result == 0x1180)
+    {
+      while (p[-1] == ' ' || p[-1] == '_')
+	--p;
+      gcc_assert (TOUPPER (p[-1]) == 'E');
+      --p;
+      while (p[-1] == ' ' || p[-1] == '_')
+	--p;
+      if (p[-1] != '-')
+	{
+	  result = 0x116c;
+	  memcpy (canon_name + sizeof ("HANGUL JUNGSEONG O") - 1, "E", 2);
+	}
+    }
+  return result;
+}
+
 
 /* Returns 1 if C is valid in an identifier, 2 if C is valid except at
    the start of an identifier, and 0 if C is not valid in an
@@ -1448,96 +1508,26 @@
 		{
 		  /* If the name is longer than maximum length of a Unicode
 		     name, it can't be strictly valid.  */
-		  if ((size_t) (str - name) > uname2c_max_name_len)
-		    strict = false;
-
-		  if (!strict)
+		  if ((size_t) (str - name) > uname2c_max_name_len || !strict)
 		    result = -1;
 		  else
 		    result = _cpp_uname2c ((const char *) name, str - name,
 					   uname2c_tree, NULL);
 		  if (result == (cppchar_t) -1)
 		    {
-		      char name_after_uax44_lm2[uname2c_max_name_len];
-
 		      cpp_error (pfile, CPP_DL_ERROR,
 				 "\\N{%.*s} is not a valid universal "
 				 "character", (int) (str - name), name);
 
 		      /* Try to do a loose name lookup according to
-			 Unicode loose matching rule UAX44-LM2.
-			 First ignore medial hyphens, whitespace, underscore
-			 characters and convert to upper case.  */
-		      char *q = name_after_uax44_lm2;
-		      const uchar *p;
-		      for (p = name; p < str; p++)
-			if (*p == '_' || *p == ' ')
-			  continue;
-			else if (*p == '-'
-				 && p != name
-				 && ISALNUM (p[-1])
-				 && ISALNUM (p[1]))
-			  continue;
-			else if (q == name_after_uax44_lm2 + uname2c_max_name_len)
-			  break;
-			else if (ISLOWER (*p))
-			  *q++ = TOUPPER (*p);
-			else
-			  *q++ = *p;
-		      if (p == str)
-			{
-			  char canon_name[uname2c_max_name_len + 1];
-			  struct uname2c_data data;
-			  data.canon_name = canon_name;
-			  data.prev_char = ' ';
-			  /* Hangul Jungseong O- E
-			     after UAX44-LM2 should be HANGULJUNGSEONGO-E
-			     and so should match U+1180.  */
-			  if (q - name_after_uax44_lm2
-			      == sizeof ("HANGULJUNGSEONGO-E") - 1
-			      && memcmp (name_after_uax44_lm2,
-					 "HANGULJUNGSEONGO-E",
-					 sizeof ("HANGULJUNGSEONGO-E")
-					 - 1) == 0)
-			    {
-			      name_after_uax44_lm2[sizeof ("HANGULJUNGSEONGO")
-						   - 1] = 'E';
-			      --q;
-			    }
-			  result = _cpp_uname2c (name_after_uax44_lm2,
-						 q - name_after_uax44_lm2,
-						 uname2c_tree, &data);
-			  /* Unicode UAX44-LM2 exception:
-			     U+116C HANGUL JUNGSEONG OE
-			     U+1180 HANGUL JUNGSEONG O-E
-			     We remove all medial hyphens when we shouldn't
-			     remote the U+1180 one.  The U+1180 entry sorts
-			     before U+116C lexicographilly, so we get U+1180
-			     in both cases.  Thus, if result is U+1180,
-			     check if user's name doesn't have a hyphen there
-			     and adjust.  */
-			  if (result == 0x1180)
-			    {
-			      while (p[-1] == ' ' || p[-1] == '_')
-				--p;
-			      gcc_assert (TOUPPER (p[-1]) == 'E');
-			      --p;
-			      while (p[-1] == ' ' || p[-1] == '_')
-				--p;
-			      if (p[-1] != '-')
-				{
-				  result = 0x116c;
-				  memcpy (canon_name
-					  + sizeof ("HANGUL JUNGSEONG O") - 1,
-					  "E", 2);
-				}
-			    }
-			  if (result != (cppchar_t) -1)
-			    cpp_error (pfile, CPP_DL_NOTE,
-				       "did you mean \\N{%s}?",
-				       canon_name);
-			}
-		      if (result == (cppchar_t) -1)
+			 Unicode loose matching rule UAX44-LM2.  */
+		      char canon_name[uname2c_max_name_len + 1];
+		      result = _cpp_uname2c_uax44_lm2 ((const char *) name,
+						       str - name, canon_name);
+		      if (result != (cppchar_t) -1)
+			cpp_error (pfile, CPP_DL_NOTE,
+				   "did you mean \\N{%s}?", canon_name);
+		      else
 			result = 0x40;
 		    }
 		}


	Jakub