From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tom@tromey.com>
Received: from qproxy1-pub.mail.unifiedlayer.com
 (qproxy1-pub.mail.unifiedlayer.com [173.254.64.10])
 by sourceware.org (Postfix) with ESMTPS id 04055383F422
 for <gdb-patches@sourceware.org>; Wed, 26 Jan 2022 23:17:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 04055383F422
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=tromey.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com
Received: from outbound-ss-761.bluehost.com (outbound-ss-761.bluehost.com
 [74.220.211.250])
 by qproxy1.mail.unifiedlayer.com (Postfix) with ESMTP id 3568C802B628
 for <gdb-patches@sourceware.org>; Wed, 26 Jan 2022 23:17:36 +0000 (UTC)
Received: from cmgw15.mail.unifiedlayer.com (unknown [10.0.90.130])
 by progateway8.mail.pro1.eigbox.com (Postfix) with ESMTP id E8F501004284A
 for <gdb-patches@sourceware.org>; Wed, 26 Jan 2022 23:15:05 +0000 (UTC)
Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP
 id CrVJn9IwsikTnCrVJnm1t4; Wed, 26 Jan 2022 23:15:05 +0000
X-Authority-Reason: nr=8
X-Authority-Analysis: v=2.4 cv=CeHNWJnl c=1 sm=1 tr=0 ts=61f1d5f9
 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17
 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=IkcTkHD0fZMA:10:nop_charset_1
 a=DghFqjY3_ZEA:10:nop_rcvd_month_year
 a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=CCpqsmhAAAAA:8
 a=mDV3o1hIAAAA:8 a=-wo-gco-WHJVUm5tFssA:9 a=QEXdDO2ut3YA:10:nop_charset_2
 a=ul9cdbp4aOFLsgKbc677:22 a=_FVE-zBwftR9WsbkzFJk:22
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; 
 s=default;
 h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id:
 Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:Content-Description:
 Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
 In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
 List-Post:List-Owner:List-Archive;
 bh=CU6xiy0bvjAF9QsejlING4oe7ll+XwnI+XcnQPfd30o=; b=sluaptSV7xyITgy/jYJ+5fBgV8
 s1TdmHlBzXDLimYSpcG6Ixh/m9BifLPgm9BTU67oIZ+YvUl7dF+FQ7BZj3IbtLQbBeS+DvOgFb2Uh
 iEvRTB2mBDG4bsDTCtUPUKBgD;
Received: from 75-166-128-165.hlrn.qwest.net ([75.166.128.165]:40682
 helo=prentzel.Home) by box5379.bluehost.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <tom@tromey.com>)
 id 1nCrVJ-002RDK-1Z; Wed, 26 Jan 2022 16:15:05 -0700
From: Tom Tromey <tom@tromey.com>
To: gdb-patches@sourceware.org
Cc: Tom Tromey <tom@tromey.com>
Subject: [PATCH] Allow non-ASCII characters in Rust identifiers
Date: Wed, 26 Jan 2022 16:15:01 -0700
Message-Id: <20220126231501.1031201-1-tom@tromey.com>
X-Mailer: git-send-email 2.31.1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - box5379.bluehost.com
X-AntiAbuse: Original Domain - sourceware.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - tromey.com
X-BWhitelist: no
X-Source-IP: 75.166.128.165
X-Source-L: No
X-Exim-ID: 1nCrVJ-002RDK-1Z
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Source-Sender: 75-166-128-165.hlrn.qwest.net (prentzel.Home)
 [75.166.128.165]:40682
X-Source-Auth: tom+tromey.com
X-Email-Count: 1
X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20=
X-Local-Domain: yes
X-Spam-Status: No, score=-3031.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_SHORT, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Jan 2022 23:17:38 -0000

Rust 1.53 (quite a while ago now) ungated the support for non-ASCII
identifiers.  This didn't work in gdb.  This is PR rust/20166.

This patch fixes the problem by allowing non-ASCII characters to be
considered as identifier components.  It seemed simplest to just pass
them through -- doing any extra checking didn't seem worthwhile.

The new test also verifies that such characters are allowed in strings
and character literals as well.  The latter also required a bit of
work in the lexer.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=20166
---
 gdb/rust-parse.c                   | 70 ++++++++++++++++++++++--------
 gdb/testsuite/gdb.rust/unicode.exp | 51 ++++++++++++++++++++++
 gdb/testsuite/gdb.rust/unicode.rs  | 26 +++++++++++
 3 files changed, 129 insertions(+), 18 deletions(-)
 create mode 100644 gdb/testsuite/gdb.rust/unicode.exp
 create mode 100644 gdb/testsuite/gdb.rust/unicode.rs

diff --git a/gdb/rust-parse.c b/gdb/rust-parse.c
index 31a1ee3b38f..aa215f9cf2a 100644
--- a/gdb/rust-parse.c
+++ b/gdb/rust-parse.c
@@ -33,6 +33,12 @@
 
 using namespace expr;
 
+#if WORDS_BIGENDIAN
+#define UTF32 "UTF-32BE"
+#else
+#define UTF32 "UTF-32LE"
+#endif
+
 /* A regular expression for matching Rust numbers.  This is split up
    since it is very long and this gives us a way to comment the
    sections.  */
@@ -577,6 +583,35 @@ rust_parser::lex_escape (int is_byte)
   return result;
 }
 
+/* A helper for lex_character.  Search forward for the closing single
+   quote, then convert the bytes from the host charset to UTF-32.  */
+
+static uint32_t
+lex_multibyte_char (const char *text, int *len)
+{
+  /* Only look a maximum of 5 bytes for the closing quote.  This is
+     the maximum for UTF-8.  */
+  int quote;
+  gdb_assert (text[0] != '\'');
+  for (quote = 1; text[quote] != '\0' && text[quote] != '\''; ++quote)
+    ;
+  *len = quote;
+  /* The caller will issue an error.  */
+  if (text[quote] == '\0')
+    return 0;
+
+  auto_obstack result;
+  convert_between_encodings (host_charset (), UTF32, (const gdb_byte *) text,
+			     quote, 1, &result, translit_none);
+
+  int size = obstack_object_size (&result);
+  if (size > 4)
+    error (_("overlong character literal"));
+  uint32_t value;
+  memcpy (&value, obstack_finish (&result), size);
+  return value;
+}
+
 /* Lex a character constant.  */
 
 int
@@ -592,13 +627,15 @@ rust_parser::lex_character ()
     }
   gdb_assert (pstate->lexptr[0] == '\'');
   ++pstate->lexptr;
-  /* This should handle UTF-8 here.  */
-  if (pstate->lexptr[0] == '\\')
+  if (pstate->lexptr[0] == '\'')
+    error (_("empty character literal"));
+  else if (pstate->lexptr[0] == '\\')
     value = lex_escape (is_byte);
   else
     {
-      value = pstate->lexptr[0] & 0xff;
-      ++pstate->lexptr;
+      int len;
+      value = lex_multibyte_char (&pstate->lexptr[0], &len);
+      pstate->lexptr += len;
     }
 
   if (pstate->lexptr[0] != '\'')
@@ -695,16 +732,9 @@ rust_parser::lex_string ()
 	  if (is_byte)
 	    obstack_1grow (&obstack, value);
 	  else
-	    {
-#if WORDS_BIGENDIAN
-#define UTF32 "UTF-32BE"
-#else
-#define UTF32 "UTF-32LE"
-#endif
-	      convert_between_encodings (UTF32, "UTF-8", (gdb_byte *) &value,
-					 sizeof (value), sizeof (value),
-					 &obstack, translit_none);
-	    }
+	    convert_between_encodings (UTF32, "UTF-8", (gdb_byte *) &value,
+				       sizeof (value), sizeof (value),
+				       &obstack, translit_none);
 	}
       else if (pstate->lexptr[0] == '\0')
 	error (_("Unexpected EOF in string"));
@@ -746,7 +776,10 @@ rust_identifier_start_p (char c)
   return ((c >= 'a' && c <= 'z')
 	  || (c >= 'A' && c <= 'Z')
 	  || c == '_'
-	  || c == '$');
+	  || c == '$'
+	  /* Allow any non-ASCII character as an identifier.  There
+	     doesn't seem to be a need to be picky about this.  */
+	  || (c & 0x80) != 0);
 }
 
 /* Lex an identifier.  */
@@ -772,13 +805,14 @@ rust_parser::lex_identifier ()
 
   ++pstate->lexptr;
 
-  /* For the time being this doesn't handle Unicode rules.  Non-ASCII
-     identifiers are gated anyway.  */
+  /* Allow any non-ASCII character here.  This "handles" UTF-8 by
+     passing it through.  */
   while ((pstate->lexptr[0] >= 'a' && pstate->lexptr[0] <= 'z')
 	 || (pstate->lexptr[0] >= 'A' && pstate->lexptr[0] <= 'Z')
 	 || pstate->lexptr[0] == '_'
 	 || (is_gdb_var && pstate->lexptr[0] == '$')
-	 || (pstate->lexptr[0] >= '0' && pstate->lexptr[0] <= '9'))
+	 || (pstate->lexptr[0] >= '0' && pstate->lexptr[0] <= '9')
+	 || (pstate->lexptr[0] & 0x80) != 0)
     ++pstate->lexptr;
 
 
diff --git a/gdb/testsuite/gdb.rust/unicode.exp b/gdb/testsuite/gdb.rust/unicode.exp
new file mode 100644
index 00000000000..9de0a0e724f
--- /dev/null
+++ b/gdb/testsuite/gdb.rust/unicode.exp
@@ -0,0 +1,51 @@
+# Copyright (C) 2022 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test raw identifiers.
+
+load_lib rust-support.exp
+if {[skip_rust_tests]} {
+    continue
+}
+
+# Non-ASCII identifiers were allowed starting in 1.53.
+set v [split [rust_compiler_version] .]
+if {[lindex $v 0] == 1 && [lindex $v 1] < 53} {
+    untested "this test requires rust 1.53 or greater"
+    return -1
+}
+
+# Enable basic use of UTF-8.  LC_ALL gets reset for each testfile.
+setenv LC_ALL C.UTF-8
+
+standard_testfile .rs
+if {[prepare_for_testing "failed to prepare" $testfile $srcfile {debug rust}]} {
+    return -1
+}
+
+set line [gdb_get_line_number "set breakpoint here"]
+if {![runto ${srcfile}:$line]} {
+    untested "could not run to breakpoint"
+    return -1
+}
+
+gdb_test "print 𝕯" " = 98" "print D"
+gdb_test "print \"𝕯\"" " = \"𝕯\"" "print D in string"
+# This output is maybe not ideal, but it also isn't incorrect.
+gdb_test "print '𝕯'" " = 120175 '\\\\u\\\{01d56f\\\}'" \
+    "print D as char"
+gdb_test "print cç" " = 97" "print cc"
+
+gdb_test "print 'çc'" "overlong character literal" "print cc as char"
diff --git a/gdb/testsuite/gdb.rust/unicode.rs b/gdb/testsuite/gdb.rust/unicode.rs
new file mode 100644
index 00000000000..c6ca90e6450
--- /dev/null
+++ b/gdb/testsuite/gdb.rust/unicode.rs
@@ -0,0 +1,26 @@
+// Copyright (C) 2022 Free Software Foundation, Inc.
+
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3 of the License, or
+// (at your option) any later version.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+#![allow(dead_code)]
+#![allow(unused_variables)]
+#![allow(unused_assignments)]
+#![allow(uncommon_codepoints)]
+#![allow(non_snake_case)]
+
+fn main() {
+    let 𝕯 = 98;
+    let cç = 97;
+    println!("{}, {}", 𝕯, cç);        // set breakpoint here
+}
-- 
2.31.1