support C/C++ identifiers named with non-ASCII characters

public inbox for gdb-patches@sourceware.org
 help / color / mirror / Atom feed

* support C/C++ identifiers named with non-ASCII characters
@ 2018-05-21  9:54 張俊芝
  2018-05-21 14:21 ` Simon Marchi
  0 siblings, 1 reply; 20+ messages in thread
From: 張俊芝 @ 2018-05-21  9:54 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

Hello, team.

This patch fixes the bug at
https://sourceware.org/bugzilla/show_bug.cgi?id=22973 .

Here is how to test the patch:

Step 1. If you are using Clang or any other C compilers that have 
implemented
         support for Unicode identifiers, then create a C file with the 
following
         content:

int main(int åƒé‡, char* åƒ[])
{
   struct é›†
   {
     int æ•¸[3];
   } é›† = {100, 200, 300};
   int åº = 2;
   return 0;
}

Or if you are using GCC, create a C file with the following content as a
workaround(GCC still doesn't actually support Unicode identifiers in 
2018, which
is a pity):

int main(int \u53C3\u91CF, char* \u53C3[])
{
   struct \u96C6
   {
     int \u6578[3];
   } \u96C6 = {100, 200, 300};
   int \u5E8F = 2;
   return 0;
}

Step 2. Compile the C file.

Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0".

Step 4. Run until the breakpoint.

Step 5. Test the following commands to see if they work:
         p åƒé‡
         p åƒ
         p é›†
         p é›†.æ•¸
         p é›†.æ•¸[åº]

Thanks for your review.

[-- Attachment #2: ChangeLog --]
[-- Type: text/plain, Size: 230 bytes --]

2018-05-20  張俊芝  <zjz@zjz.name>

	* gdb/c-exp.y (is_identifier_separator): New function.
	(lex_one_token): Now recognizes C and C++ Unicode identifiers by using
	is_identifier_separator to determine the boundary of a token.

[-- Attachment #3: diff --]
[-- Type: text/plain, Size: 2948 bytes --]

diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..b0dd6c7caf 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -73,6 +73,8 @@ void yyerror (const char *);
 
 static int type_aggregate_p (struct type *);
 
+static bool is_identifier_separator (char);
+
 %}
 
 /* Although the yacc "value" of an expression is not used,
@@ -1718,6 +1720,53 @@ type_aggregate_p (struct type *type)
 	      && TYPE_DECLARED_CLASS (type)));
 }
 
+/* While iterating all the characters in an identifier, an identifier separator
+   is a boundary where we know the iteration is done. */
+
+static bool
+is_identifier_separator (char c)
+{
+  switch (c)
+    {
+    case ' ':
+    case '\t':
+    case '\n':
+    case '\0':
+    case '\'':
+    case '"':
+    case '\\':
+    case '(':
+    case ')':
+    case ',':
+    case '.':
+    case '+':
+    case '-':
+    case '*':
+    case '/':
+    case '|':
+    case '&':
+    case '^':
+    case '~':
+    case '!':
+    case '@':
+    case '[':
+    case ']':
+    /* '<' should not be a token separator, because it can be an open angle
+       bracket followed by a nested template identifier in C++. */
+    case '>':
+    case '?':
+    case ':':
+    case '=':
+    case '{':
+    case '}':
+    case ';':
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
 /* Validate a parameter typelist.  */
 
 static void
@@ -1920,7 +1969,7 @@ parse_number (struct parser_state *par_state,
 	 FIXME: This check is wrong; for example it doesn't find overflow
 	 on 0x123456789 when LONGEST is 32 bits.  */
       if (c != 'l' && c != 'u' && n != 0)
-	{	
+	{
 	  if ((unsigned_p && (ULONGEST) prevn >= (ULONGEST) n))
 	    error (_("Numeric constant too large."));
 	}
@@ -2741,16 +2790,13 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (is_identifier_separator(c))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
-  for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+  for (c = tokstart[namelen]; !is_identifier_separator(c);)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
@@ -2932,7 +2978,7 @@ classify_name (struct parser_state *par_state, const struct block *block,
 	 filename.  However, if the name was quoted, then it is better
 	 to check for a filename or a block, since this is the only
 	 way the user has of requiring the extension to be used.  */
-      if ((is_a_field_of_this.type == NULL && !is_after_structop) 
+      if ((is_a_field_of_this.type == NULL && !is_after_structop)
 	  || is_quoted_name)
 	{
 	  /* See if it's a file name. */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21  9:54 support C/C++ identifiers named with non-ASCII characters 張俊芝
@ 2018-05-21 14:21 ` Simon Marchi
  2018-05-21 15:27   ` Paul.Koning
       [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
  0 siblings, 2 replies; 20+ messages in thread
From: Simon Marchi @ 2018-05-21 14:21 UTC (permalink / raw)
  To: 張俊芝, gdb-patches

On 2018-05-21 04:52 AM, å¼µä¿ŠèŠ wrote:
> Hello, team.
> 
> This patch fixes the bug at
> https://sourceware.org/bugzilla/show_bug.cgi?id=22973 .
> 
> Here is how to test the patch:
> 
> Step 1. If you are using Clang or any other C compilers that have 
> implemented
>          support for Unicode identifiers, then create a C file with the 
> following
>          content:
> 
> int main(int åƒé‡, char* åƒ[])
> {
>    struct é›†
>    {
>      int æ•¸[3];
>    } é›† = {100, 200, 300};
>    int åº = 2;
>    return 0;
> }
> 
> Or if you are using GCC, create a C file with the following content as a
> workaround(GCC still doesn't actually support Unicode identifiers in 
> 2018, which
> is a pity):
> 
> int main(int \u53C3\u91CF, char* \u53C3[])
> {
>    struct \u96C6
>    {
>      int \u6578[3];
>    } \u96C6 = {100, 200, 300};
>    int \u5E8F = 2;
>    return 0;
> }
> 
> Step 2. Compile the C file.
> 
> Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0".
> 
> Step 4. Run until the breakpoint.
> 
> Step 5. Test the following commands to see if they work:
>          p åƒé‡
>          p åƒ
>          p é›†
>          p é›†.æ•¸
>          p é›†.æ•¸[åº]
> 
> Thanks for your review.
> 

Hi Zhang,

Thanks for the patch, I tested it quickly, it seems to work as expected.

Could you please write a small test case in testsuite/gdb.base with the example
you gave, so we make sure this doesn't get broken later?  If you can write it
in such a way that both clang and gcc understand it would be better, because
most people run the testuite using gcc to compile test programs.

I am not a specialist in lexing and parsing C, so can you explain quickly why
you think this is a good solution?  Quickly, I understand that you change the
identifier recognition algorithm to a blacklist of characters rather than
a whitelist, so bytes that are not recognized (such as those that compose
the utf-8 encoded characters) are not rejected.

Given unlimited time, would the right solution be to use a lib to parse the
string as utf-8, and reject strings that are not valid utf-8?

Here are some not and formatting comments:

> +static bool is_identifier_separator (char);

You don't have to forward declare the function if it's not necessary.

> +    /* '<' should not be a token separator, because it can be an open angle
> +       bracket followed by a nested template identifier in C++. */

Please use two spaces after the final period (...C++.  */).

> +  if (is_identifier_separator(c))

Please use a space before the parentheses:

  is_identifier_separator (c)

> +  for (c = tokstart[namelen]; !is_identifier_separator(c);)

Here too.

Thanks!

Simon

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 14:21 ` Simon Marchi
@ 2018-05-21 15:27   ` Paul.Koning
  2018-05-21 16:16     ` Eli Zaretskii
       [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
  1 sibling, 1 reply; 20+ messages in thread
From: Paul.Koning @ 2018-05-21 15:27 UTC (permalink / raw)
  To: simark; +Cc: zjz, gdb-patches



> On May 21, 2018, at 10:03 AM, Simon Marchi <simark@simark.ca> wrote:
> 
> ...
> I am not a specialist in lexing and parsing C, so can you explain quickly why
> you think this is a good solution?  Quickly, I understand that you change the
> identifier recognition algorithm to a blacklist of characters rather than
> a whitelist, so bytes that are not recognized (such as those that compose
> the utf-8 encoded characters) are not rejected.
> 
> Given unlimited time, would the right solution be to use a lib to parse the
> string as utf-8, and reject strings that are not valid utf-8?

This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

	paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 15:27   ` Paul.Koning
@ 2018-05-21 16:16     ` Eli Zaretskii
  2018-05-21 18:34       ` Paul.Koning
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2018-05-21 16:16 UTC (permalink / raw)
  To: Paul.Koning; +Cc: simark, zjz, gdb-patches

> From: <Paul.Koning@dell.com>
> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
> Date: Mon, 21 May 2018 14:12:12 +0000
> 
> > Given unlimited time, would the right solution be to use a lib to parse the
> > string as utf-8, and reject strings that are not valid utf-8?
> 
> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
can not include invalid UTF-8 sequences?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
       [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
@ 2018-05-21 18:00     ` 張俊芝
  2018-05-21 18:03     ` 張俊芝
  1 sibling, 0 replies; 20+ messages in thread
From: 張俊芝 @ 2018-05-21 18:00 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 2879 bytes --]

Simon Marchi æ–¼ 2018/5/21 ä¸‹åˆ10:03 å¯«é“:
> On 2018-05-21 04:52 AM, å¼µä¿ŠèŠ wrote:
> 
> Hi Zhang,
> 
> Thanks for the patch, I tested it quickly, it seems to work as expected.
> 
> Could you please write a small test case in testsuite/gdb.base with the example
> you gave, so we make sure this doesn't get broken later?  If you can write it
> in such a way that both clang and gcc understand it would be better, because
> most people run the testuite using gcc to compile test programs.
> 
> I am not a specialist in lexing and parsing C, so can you explain quickly why
> you think this is a good solution?  Quickly, I understand that you change the
> identifier recognition algorithm to a blacklist of characters rather than
> a whitelist, so bytes that are not recognized (such as those that compose
> the utf-8 encoded characters) are not rejected.
> 
> Given unlimited time, would the right solution be to use a lib to parse the
> string as utf-8, and reject strings that are not valid utf-8?
> 
> Here are some not and formatting comments:
> 
>> +static bool is_identifier_separator (char);
> 
> You don't have to forward declare the function if it's not necessary.
> 
>> +    /* '<' should not be a token separator, because it can be an open angle
>> +       bracket followed by a nested template identifier in C++. */
> 
> Please use two spaces after the final period (...C++.  */).
> 
>> +  if (is_identifier_separator(c))
> 
> Please use a space before the parentheses:
> 
>    is_identifier_separator (c)
> 
> 
>> +  for (c = tokstart[namelen]; !is_identifier_separator(c);)
> 
> Here too.
> 
> Thanks!
> 
> Simon
> 

Thank you for the reply, Simon.

This new diff addresses all the code style issues you mentioned.

Yes, you are right in that the blacklist is limited. Actually, 
`is_identifier_separator` only blacklists the invalid ASCII characters 
for an identifier, leaving all the invalid non-ASCII characters unchecked.

So seems it would be better if non-ASCII characters were also checked. 
However, unfortunately, GDB is neither aware of the encoding of the 
terminal input, nor the encoding of the generated symbolic information 
in the executable. So the blacklist is made to restrict to the invalid 
ASCII characters in order to support all ASCII-compliant encodings.

Having said that, I find that it does no harm to the user if we only 
check the ASCII characters. If the user is trying to print an identifier 
which includes an invalid non-ASCII character, say, p æ¸¬ã€‘, where ã€‘is 
invalid, they will get an error message:

No symbol "æ¸¬ã€‘" in current context.

which doesn't seem any worse than an error message like:

Invalid character "ã€‘" in expression.

Perhaps the former might be even more intuitional.

So personally, I think it might be safe enough to use this limited 
blacklist method.

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 2741 bytes --]

diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..e10b6b474d 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -1718,6 +1718,53 @@ type_aggregate_p (struct type *type)
 	      && TYPE_DECLARED_CLASS (type)));
 }
 
+/* While iterating all the characters in an identifier, an identifier separator
+   is a boundary where we know the iteration is done. */
+
+static bool
+is_identifier_separator (char c)
+{
+  switch (c)
+    {
+    case ' ':
+    case '\t':
+    case '\n':
+    case '\0':
+    case '\'':
+    case '"':
+    case '\\':
+    case '(':
+    case ')':
+    case ',':
+    case '.':
+    case '+':
+    case '-':
+    case '*':
+    case '/':
+    case '|':
+    case '&':
+    case '^':
+    case '~':
+    case '!':
+    case '@':
+    case '[':
+    case ']':
+    /* '<' should not be a token separator, because it can be an open angle
+       bracket followed by a nested template identifier in C++.  */
+    case '>':
+    case '?':
+    case ':':
+    case '=':
+    case '{':
+    case '}':
+    case ';':
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
 /* Validate a parameter typelist.  */
 
 static void
@@ -1920,7 +1967,7 @@ parse_number (struct parser_state *par_state,
 	 FIXME: This check is wrong; for example it doesn't find overflow
 	 on 0x123456789 when LONGEST is 32 bits.  */
       if (c != 'l' && c != 'u' && n != 0)
-	{	
+	{
 	  if ((unsigned_p && (ULONGEST) prevn >= (ULONGEST) n))
 	    error (_("Numeric constant too large."));
 	}
@@ -2741,16 +2788,13 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (is_identifier_separator (c))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
-  for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+  for (c = tokstart[namelen]; !is_identifier_separator (c);)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
@@ -2932,7 +2976,7 @@ classify_name (struct parser_state *par_state, const struct block *block,
 	 filename.  However, if the name was quoted, then it is better
 	 to check for a filename or a block, since this is the only
 	 way the user has of requiring the extension to be used.  */
-      if ((is_a_field_of_this.type == NULL && !is_after_structop) 
+      if ((is_a_field_of_this.type == NULL && !is_after_structop)
 	  || is_quoted_name)
 	{
 	  /* See if it's a file name. */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
       [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
  2018-05-21 18:00     ` 張俊芝
@ 2018-05-21 18:03     ` 張俊芝
  2018-05-21 18:14       ` Matt Rice
  2018-05-22 14:39       ` Pedro Alves
  1 sibling, 2 replies; 20+ messages in thread
From: 張俊芝 @ 2018-05-21 18:03 UTC (permalink / raw)
  To: Simon Marchi, gdb-patches

> 
> 
> Simon Marchi æ–¼ 2018/5/21 ä¸‹åˆ10:03 å¯«é“:
>> Could you please write a small test case in testsuite/gdb.base with 
>> the example
>> you gave, so we make sure this doesn't get broken later?  If you can 
>> write it
>> in such a way that both clang and gcc understand it would be better, 
>> because
>> most people run the testuite using gcc to compile test programs.
>>
Oops, sorry, Simon, I forgot the test part in the second upload.

Clang is compatible with the GCC workaround \uXXXX. So I will write the 
test case in that format.

But it's late here, I will do it tomorrow.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:03     ` 張俊芝
@ 2018-05-21 18:14       ` Matt Rice
  2018-05-22  7:06         ` 張俊芝
  2018-05-22 14:39       ` Pedro Alves
  1 sibling, 1 reply; 20+ messages in thread
From: Matt Rice @ 2018-05-21 18:14 UTC (permalink / raw)
  To: 張俊芝; +Cc: Simon Marchi, gdb-patches

On Mon, May 21, 2018 at 10:45 AM, 張俊芝 <zjz@zjz.name> wrote:
>>
>>
>> Simon Marchi 於 2018/5/21 下午10:03 寫道:
>>>
>>> Could you please write a small test case in testsuite/gdb.base with the
>>> example
>>> you gave, so we make sure this doesn't get broken later?  If you can
>>> write it
>>> in such a way that both clang and gcc understand it would be better,
>>> because
>>> most people run the testuite using gcc to compile test programs.
>>>
> Oops, sorry, Simon, I forgot the test part in the second upload.
>
> Clang is compatible with the GCC workaround \uXXXX. So I will write the test
> case in that format.
>
> But it's late here, I will do it tomorrow.

Just FYI, there is another bug in this area, which i had noticed  that
occurs when trying to
tab complete symbols using GCC's \uXXXX.  It seems like an issue in
another place where gdb is not aware of the encoding.

https://sourceware.org/bugzilla/show_bug.cgi?id=18226

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 16:16     ` Eli Zaretskii
@ 2018-05-21 18:34       ` Paul.Koning
  2018-05-21 19:05         ` Eli Zaretskii
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Paul.Koning @ 2018-05-21 18:34 UTC (permalink / raw)
  To: eliz; +Cc: simark, zjz, gdb-patches

> On May 21, 2018, at 12:12 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: <Paul.Koning@dell.com>
>> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 14:12:12 +0000
>> 
>>> Given unlimited time, would the right solution be to use a lib to parse the
>>> string as utf-8, and reject strings that are not valid utf-8?
>> 
>> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.
> 
> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
> can not include invalid UTF-8 sequences?

Encoding is a I/O question.  "UTF-8" and "Unicode" are often mixed up, but they are distinct.  Unicode is a character set, in which each character has a numeric identification.  For example, 張 is Unicode character number 24373 (0x5f35).

UTF-8 is one of several ways to encode Unicode characters as a byte stream.  The UTF-8 encoding of 張 is e5 bc b5.

I don't know what the C/C++ standards say about non-ASCII identifiers.  I assume they are stated to be Unicode, and presumably specific Unicode character classes.  So there are some sequences of Unicode characters that are valid identifiers, while others are not -- exactly as "abc" is a valid ASCII identifier while "a@bc" is not.

A separate question is the encoding of files.  The encoding rule could be that UTF-8 is required -- or that the encoding is selectable.  There also has to be an encoding in output files (debug data for example).  And when strings are entered at the GDB user interface, they arrive in some encoding.  For all these, UTF-8 is a logical answer.

Not all byte strings are valid UTF-8 strings.  When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used.  Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match.  For gdb, that may be good enough.

Yet another issue: for many characters, there are multiple ways to represent them in Unicode.  For example, ü (latin small letter u with dieresis) can be coded as the single Unicode character 0xfc, or as the pair 0x0308 0x75 (combining dieresis, latin small letter u).  These are supposed to be synonymous; when doing string matches, you'd want them to be taken as equivalent.  The stringprep library helps with this by offering a conversion to a standard form, at which point memcmp will give the correct answer.

	paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:34       ` Paul.Koning
@ 2018-05-21 19:05         ` Eli Zaretskii
  2018-05-21 19:25           ` Paul.Koning
  2018-05-21 20:43         ` Joseph Myers
  2018-05-22  8:34         ` 張俊芝
  2 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2018-05-21 19:05 UTC (permalink / raw)
  To: Paul.Koning; +Cc: simark, zjz, gdb-patches

> From: <Paul.Koning@dell.com>
> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
> Date: Mon, 21 May 2018 18:03:17 +0000
> 
> > Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
> > can not include invalid UTF-8 sequences?
> 
> Encoding is a I/O question.

Not necessarily.

I asked that question because scanning a string for certain ASCII
characters using a 'char *' pointer will only work reliably if the
string is in UTF-8 or in some single-byte encoding.  Otherwise, we
might find false hits for the delimiters, which are actually parts of
multibyte sequences.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 19:05         ` Eli Zaretskii
@ 2018-05-21 19:25           ` Paul.Koning
  0 siblings, 0 replies; 20+ messages in thread
From: Paul.Koning @ 2018-05-21 19:25 UTC (permalink / raw)
  To: eliz; +Cc: simark, zjz, gdb-patches

> On May 21, 2018, at 2:14 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: <Paul.Koning@dell.com>
>> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 18:03:17 +0000
>> 
>>> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
>>> can not include invalid UTF-8 sequences?
>> 
>> Encoding is a I/O question.
> 
> Not necessarily.
> 
> I asked that question because scanning a string for certain ASCII
> characters using a 'char *' pointer will only work reliably if the
> string is in UTF-8 or in some single-byte encoding.  Otherwise, we
> might find false hits for the delimiters, which are actually parts of
> multibyte sequences.

I see your point.

The I/O encoding ties to the internal encoding.  UTF-8 can be read into char[] and processed using C string primitives.  Other encodings cannot.  For example, if you have UTF-16 or UTF-32, you'd have to read it into a wchar_t string of the correct character width and use the wchar string functions.

So there are two questions:

1. What are the valid characters?  (Unicode question, independent of encoding)
2. What encoding do we expect in I/O (UTF question) from which we conclude what processing functions we need.

	paul

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:34       ` Paul.Koning
  2018-05-21 19:05         ` Eli Zaretskii
@ 2018-05-21 20:43         ` Joseph Myers
  2018-05-22 10:31           ` 張俊芝
  2018-05-22  8:34         ` 張俊芝
  2 siblings, 1 reply; 20+ messages in thread
From: Joseph Myers @ 2018-05-21 20:43 UTC (permalink / raw)
  To: Paul.Koning; +Cc: eliz, simark, zjz, gdb-patches

[-- Attachment #1: Type: text/plain, Size: 1619 bytes --]

On Mon, 21 May 2018, Paul.Koning@dell.com wrote:

> I don't know what the C/C++ standards say about non-ASCII identifiers.  
> I assume they are stated to be Unicode, and presumably specific Unicode 

They are defined in terms of ISO 10646 (and so concepts from Unicode that 
don't appear in ISO 10646, such as normalization forms, are not relevant 
to them).

See C11 Annex D, which is also aligned with C++11 and later (older 
standard versions had generally more restrictive sets of allowed 
characters, different for C and C++, based on TR 10176).

> Yet another issue: for many characters, there are multiple ways to 
> represent them in Unicode.  For example, Ã¼ (latin small letter u with 
> dieresis) can be coded as the single Unicode character 0xfc, or as the 
> pair 0x0308 0x75 (combining dieresis, latin small letter u).  These are 

(The letter goes before the combining mark in that case, not after.  Thus 
such combining marks are not generally permitted at the start of 
identifiers.)

> supposed to be synonymous; when doing string matches, you'd want them to 

They are *not* synonymous in C or C++.  (GCC has -Wnormalized= options to 
warn about identifiers not in an appropriate normalization form, with 
-Wnormalized=nfc as the default.)

GCC always generates UTF-8 in its .s output for such identifiers, which 
gas then transfers straight through to its output (thus, UTF-8 ELF 
symbols).  The generic ELF ABI is silent on the encoding of such symbols 
(it just says "External C symbols have the same names in C and object 
files' symbol tables.").

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:14       ` Matt Rice
@ 2018-05-22  7:06         ` 張俊芝
  0 siblings, 0 replies; 20+ messages in thread
From: 張俊芝 @ 2018-05-22  7:06 UTC (permalink / raw)
  To: Matt Rice, gdb-patches



Matt Rice æ–¼ 2018/5/22 ä¸Šåˆ2:00 å¯«é“:
> On Mon, May 21, 2018 at 10:45 AM, å¼µä¿ŠèŠ <zjz@zjz.name> wrote:

> Just FYI, there is another bug in this area, which i had noticed  that
> occurs when trying to
> tab complete symbols using GCC's \uXXXX.  It seems like an issue in
> another place where gdb is not aware of the encoding.
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=18226


Thank you for your helpful information, Matt.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:34       ` Paul.Koning
  2018-05-21 19:05         ` Eli Zaretskii
  2018-05-21 20:43         ` Joseph Myers
@ 2018-05-22  8:34         ` 張俊芝
  2 siblings, 0 replies; 20+ messages in thread
From: 張俊芝 @ 2018-05-22  8:34 UTC (permalink / raw)
  To: Paul.Koning, gdb-patches

Paul.Koning@dell.com æ–¼ 2018/5/22 ä¸Šåˆ2:03 å¯«é“:
> 
> Not all byte strings are valid UTF-8 strings.  When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used.  Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match.  For gdb, that may be good enough.

I preferred the latter(I.e. assume all non-ASCII characters are valid 
and rely on "symbol not found"), and it's actually what the patch does. 
Although a compiler has to be strict with validity of non-ASCII 
characters, but for GDB, the latter solution is just good enough - 
Checking only ASCII characters makes GDB work well with all 
ASCII-compliant encodings.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 20:43         ` Joseph Myers
@ 2018-05-22 10:31           ` 張俊芝
  0 siblings, 0 replies; 20+ messages in thread
From: 張俊芝 @ 2018-05-22 10:31 UTC (permalink / raw)
  To: Joseph Myers, gdb-patches



Joseph Myers æ–¼ 2018/5/22 ä¸Šåˆ4:26 å¯«é“:

> The generic ELF ABI is silent on the encoding of such symbols
> (it just says "External C symbols have the same names in C and object
> files' symbol tables.").

So this is another reason why it may be better to just check ASCII 
character delimiters. May be good enough for GDB.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-22 14:39       ` Pedro Alves
@ 2018-05-22 14:39         ` 張俊芝
  2018-05-22 15:17           ` Pedro Alves
  0 siblings, 1 reply; 20+ messages in thread
From: 張俊芝 @ 2018-05-22 14:39 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches



Pedro Alves æ–¼ 2018/5/22 ä¸‹åˆ10:15 å¯«é“:
> 
> I actually already started writing a patch for this a few months
> back, including a C testcase, after these discussions:
> 
>    https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>    https://sourceware.org/ml/gdb/2017-11/msg00022.html
> 
> Let me try to find it.  I don't recall exactly where I left off,
> but I think I had something working.
> 
> Thanks,
> Pedro Alves
> 

I just started writing a test case when I saw your letter.

Could you shed light on how you delimit identifiers in your patch, 
Pedro? Does it check all invalid non-ASCII characters, is it dedicated 
to some encoding such as UTF-8, or to any encodings?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-21 18:03     ` 張俊芝
  2018-05-21 18:14       ` Matt Rice
@ 2018-05-22 14:39       ` Pedro Alves
  2018-05-22 14:39         ` 張俊芝
  1 sibling, 1 reply; 20+ messages in thread
From: Pedro Alves @ 2018-05-22 14:39 UTC (permalink / raw)
  To: 張俊芝, Simon Marchi, gdb-patches

On 05/21/2018 06:45 PM, å¼µä¿ŠèŠ wrote:
>>
>>
>> Simon Marchi æ–¼ 2018/5/21 ä¸‹åˆ10:03 å¯«é“:
>>> Could you please write a small test case in testsuite/gdb.base with the example
>>> you gave, so we make sure this doesn't get broken later?Â  If you can write it
>>> in such a way that both clang and gcc understand it would be better, because
>>> most people run the testuite using gcc to compile test programs.
>>>
> Oops, sorry, Simon, I forgot the test part in the second upload.
> 
> Clang is compatible with the GCC workaround \uXXXX. So I will write the test case in that format.
> 
> But it's late here, I will do it tomorrow.

I actually already started writing a patch for this a few months
back, including a C testcase, after these discussions:

  https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
  https://sourceware.org/ml/gdb/2017-11/msg00022.html

Let me try to find it.  I don't recall exactly where I left off,
but I think I had something working.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-22 14:39         ` 張俊芝
@ 2018-05-22 15:17           ` Pedro Alves
  2018-05-22 16:42             ` Pedro Alves
  0 siblings, 1 reply; 20+ messages in thread
From: Pedro Alves @ 2018-05-22 15:17 UTC (permalink / raw)
  To: 張俊芝, gdb-patches

On 05/22/2018 03:32 PM, å¼µä¿ŠèŠ wrote:
> 
> Pedro Alves æ–¼ 2018/5/22 ä¸‹åˆ10:15 å¯«é“:
>>
>> I actually already started writing a patch for this a few months
>> back, including a C testcase, after these discussions:
>>
>> Â Â  https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>> Â Â  https://sourceware.org/ml/gdb/2017-11/msg00022.html
>>
>> Let me try to find it.Â  I don't recall exactly where I left off,
>> but I think I had something working.
> 
> I just started writing a test case when I saw your letter.
> 
> Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings?

I found the patch.  Let me rebase it and send it / post it.  It'll
be easier to just look at the patch.

Thanks,
Pedro Alves

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-22 15:17           ` Pedro Alves
@ 2018-05-22 16:42             ` Pedro Alves
  2018-05-22 17:31               ` 張俊芝
  0 siblings, 1 reply; 20+ messages in thread
From: Pedro Alves @ 2018-05-22 16:42 UTC (permalink / raw)
  To: 張俊芝, gdb-patches

On 05/22/2018 03:50 PM, Pedro Alves wrote:
> On 05/22/2018 03:32 PM, å¼µä¿ŠèŠ wrote:
>>
>> Pedro Alves æ–¼ 2018/5/22 ä¸‹åˆ10:15 å¯«é“:
>>>
>>> I actually already started writing a patch for this a few months
>>> back, including a C testcase, after these discussions:
>>>
>>> Â Â  https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>>> Â Â  https://sourceware.org/ml/gdb/2017-11/msg00022.html
>>>
>>> Let me try to find it.Â  I don't recall exactly where I left off,
>>> but I think I had something working.
>>
>> I just started writing a test case when I saw your letter.
>>
>> Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings?
> 
> I found the patch.  Let me rebase it and send it / post it.  It'll
> be easier to just look at the patch.

Here it is.  So this is reusing the same logic added to
cp-name-parser.y, in the C/C++ expression parser as well.

The testcase passes cleanly, except for the test that does
"b fun<tab><tab>".  That finds two functions that start with
"fun", but GDB/readline displays them in an odd way, with no
space in between the matches:

 (gdb) b funÃ§Ã£o[tab]
 funÃ§Ã£o1funÃ§Ã£o2
 (gdb) b funÃ§Ã£o

I suspect the issue is in our readline replacements in
gdb/completer.c, like gdb_fnwidth.  HANDLE_MULTIBYTE
isn't defined for me, for example.

From ea6eafba4e32b760afdd1e00a5847772b30a2cbd Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 22 May 2018 15:35:21 +0100
Subject: [PATCH] Support UTF-8 identifiers in C/C++

Factor out cp_ident_is_alpha/cp_ident_is_alnum out of cp-name-parser.y
and use it in the C/C++ expression parser too.

New test included.

gdb/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* c-exp.y: Include "c-support.h".
	(parse_number, c_parse_escape, lex_one_token): Use TOLOWER instead
	of tolower.  Use c_ident_is_alpha to scan names.
	* c-lang.c: Include "c-support.h".
	(convert_ucn, convert_octal, convert_hex, convert_escape): Use
	ISXDIGIT instead of isxdigit and ISDIGIT instead of isdigit.
	* c-support.h: New file, with bits factored out from ...
	* cp-name-parser.y: ... this file.
	Include "c-support.h".
	(cp_ident_is_alpha, cp_ident_is_alnum): Deleted, moved to
	c-support.h and renamed.
	(symbol_end, yylex): Adjust.

gdb/testsuite/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* gdb.base/utf8-identifiers.c: New file.
	* gdb.base/utf8-identifiers.exp: New file.
---
 gdb/c-exp.y                                 | 27 +++++-----
 gdb/c-lang.c                                | 11 +++--
 gdb/c-support.h                             | 46 +++++++++++++++++
 gdb/cp-name-parser.y                        | 29 ++---------
 gdb/testsuite/gdb.base/utf8-identifiers.c   | 71 ++++++++++++++++++++++++++
 gdb/testsuite/gdb.base/utf8-identifiers.exp | 77 +++++++++++++++++++++++++++++
 6 files changed, 217 insertions(+), 44 deletions(-)
 create mode 100644 gdb/c-support.h
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.c
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.exp

diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..ae31af52df 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -42,6 +42,7 @@
 #include "parser-defs.h"
 #include "language.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "bfd.h" /* Required by objfiles.h.  */
 #include "symfile.h" /* Required by objfiles.h.  */
 #include "objfiles.h" /* For have_full_symbols and have_partial_symbols */
@@ -1806,13 +1807,13 @@ parse_number (struct parser_state *par_state,
 	  len -= 2;
 	}
       /* Handle suffixes: 'f' for float, 'l' for long double.  */
-      else if (len >= 1 && tolower (p[len - 1]) == 'f')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'f')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_float;
 	  len -= 1;
 	}
-      else if (len >= 1 && tolower (p[len - 1]) == 'l')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'l')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_long_double;
@@ -2023,9 +2024,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
       if (output)
 	obstack_grow_str (output, "\\x");
       ++tokptr;
-      if (!isxdigit (*tokptr))
+      if (!ISXDIGIT (*tokptr))
 	error (_("\\x escape without a following hex digit"));
-      while (isxdigit (*tokptr))
+      while (ISXDIGIT (*tokptr))
 	{
 	  if (output)
 	    obstack_1grow (output, *tokptr);
@@ -2048,7 +2049,7 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	if (output)
 	  obstack_grow_str (output, "\\");
 	for (i = 0;
-	     i < 3 && isdigit (*tokptr) && *tokptr != '8' && *tokptr != '9';
+	     i < 3 && ISDIGIT (*tokptr) && *tokptr != '8' && *tokptr != '9';
 	     ++i)
 	  {
 	    if (output)
@@ -2073,9 +2074,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	    obstack_1grow (output, *tokptr);
 	  }
 	++tokptr;
-	if (!isxdigit (*tokptr))
+	if (!ISXDIGIT (*tokptr))
 	  error (_("\\%c escape without a following hex digit"), c);
-	for (i = 0; i < len && isxdigit (*tokptr); ++i)
+	for (i = 0; i < len && ISXDIGIT (*tokptr); ++i)
 	  {
 	    if (output)
 	      obstack_1grow (output, *tokptr);
@@ -2668,7 +2669,7 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	    size_t len = strlen ("selector");
 
 	    if (strncmp (p, "selector", len) == 0
-		&& (p[len] == '\0' || isspace (p[len])))
+		&& (p[len] == '\0' || ISSPACE (p[len])))
 	      {
 		lexptr = p + len;
 		return SELECTOR;
@@ -2677,9 +2678,9 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	      goto parse_string;
 	  }
 
-	while (isspace (*p))
+	while (ISSPACE (*p))
 	  p++;
-	if (strncmp (p, "entry", len) == 0 && !isalnum (p[len])
+	if (strncmp (p, "entry", len) == 0 && !c_ident_is_alnum (p[len])
 	    && p[len] != '_')
 	  {
 	    lexptr = &p[len];
@@ -2741,16 +2742,14 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
   for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+       (c == '_' || c == '$' || c_ident_is_alnum (c) || c == '<');)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
diff --git a/gdb/c-lang.c b/gdb/c-lang.c
index 15e633f8c8..6bbb470957 100644
--- a/gdb/c-lang.c
+++ b/gdb/c-lang.c
@@ -25,6 +25,7 @@
 #include "language.h"
 #include "varobj.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "valprint.h"
 #include "macroscope.h"
 #include "charset.h"
@@ -382,7 +383,7 @@ convert_ucn (char *p, char *limit, const char *dest_charset,
   gdb_byte data[4];
   int i;
 
-  for (i = 0; i < length && p < limit && isxdigit (*p); ++i, ++p)
+  for (i = 0; i < length && p < limit && ISXDIGIT (*p); ++i, ++p)
     result = (result << 4) + host_hex_value (*p);
 
   for (i = 3; i >= 0; --i)
@@ -424,7 +425,7 @@ convert_octal (struct type *type, char *p,
   unsigned long value = 0;
 
   for (i = 0;
-       i < 3 && p < limit && isdigit (*p) && *p != '8' && *p != '9';
+       i < 3 && p < limit && ISDIGIT (*p) && *p != '8' && *p != '9';
        ++i)
     {
       value = 8 * value + host_hex_value (*p);
@@ -447,7 +448,7 @@ convert_hex (struct type *type, char *p,
 {
   unsigned long value = 0;
 
-  while (p < limit && isxdigit (*p))
+  while (p < limit && ISXDIGIT (*p))
     {
       value = 16 * value + host_hex_value (*p);
       ++p;
@@ -488,7 +489,7 @@ convert_escape (struct type *type, const char *dest_charset,
 
     case 'x':
       ADVANCE;
-      if (!isxdigit (*p))
+      if (!ISXDIGIT (*p))
 	error (_("\\x used with no following hex digits."));
       p = convert_hex (type, p, limit, output);
       break;
@@ -510,7 +511,7 @@ convert_escape (struct type *type, const char *dest_charset,
 	int length = *p == 'u' ? 4 : 8;
 
 	ADVANCE;
-	if (!isxdigit (*p))
+	if (!ISXDIGIT (*p))
 	  error (_("\\u used with no following hex digits"));
 	p = convert_ucn (p, limit, dest_charset, output, length);
       }
diff --git a/gdb/c-support.h b/gdb/c-support.h
new file mode 100644
index 0000000000..669db60cd6
--- /dev/null
+++ b/gdb/c-support.h
@@ -0,0 +1,46 @@
+/* Helper routines for C support in GDB.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+   This file is part of GDB.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#ifndef C_SUPPORT_H
+#define C_SUPPORT_H
+
+#include "safe-ctype.h"
+
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+   multi-byte sequence bytes and non-ASCII characters in
+   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
+   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
+   identifiers, but we don't need to be pedantic so for simplicity we
+   ignore that here.  Plus this avoids the complication of actually
+   knowing what was the right encoding.  */
+
+static inline bool
+c_ident_is_alpha (unsigned char ch)
+{
+  return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM.  */
+
+static inline bool
+c_ident_is_alnum (unsigned char ch)
+{
+  return ISALNUM (ch) || ch >= 0x80;
+}
+
+#endif /* C_SUPPORT_H */
diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index f522e46419..ebae56261b 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -35,6 +35,7 @@
 #include "safe-ctype.h"
 #include "demangle.h"
 #include "cp-support.h"
+#include "c-support.h"
 
 /* Bison does not make it easy to create a parser without global
    state, unfortunately.  Here are all the global variables used
@@ -1304,28 +1305,6 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
 		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
 }
 
-/* Like ISALPHA, but also returns true for the union of all UTF-8
-   multi-byte sequence bytes and non-ASCII characters in
-   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
-   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
-   identifiers, but we don't need to be pedantic so for simplicity we
-   ignore that here.  Plus this avoids the complication of actually
-   knowing what was the right encoding.  */
-
-static inline bool
-cp_ident_is_alpha (unsigned char ch)
-{
-  return ISALPHA (ch) || ch >= 0x80;
-}
-
-/* Similarly, but Like ISALNUM.  */
-
-static inline bool
-cp_ident_is_alnum (unsigned char ch)
-{
-  return ISALNUM (ch) || ch >= 0x80;
-}
-
 /* Find the end of a symbol name starting at LEXPTR.  */
 
 static const char *
@@ -1333,7 +1312,7 @@ symbol_end (const char *lexptr)
 {
   const char *p = lexptr;
 
-  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
+  while (*p && (c_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
     p++;
 
   return p;
@@ -1813,7 +1792,7 @@ yylex (void)
       return ERROR;
     }
 
-  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     {
       /* We must have come across a bad character (e.g. ';').  */
       yyerror (_("invalid character"));
@@ -1824,7 +1803,7 @@ yylex (void)
   namelen = 0;
   do
     c = tokstart[++namelen];
-  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
+  while (c_ident_is_alnum (c) || c == '_' || c == '$');
 
   lexptr += namelen;
 
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.c b/gdb/testsuite/gdb.base/utf8-identifiers.c
new file mode 100644
index 0000000000..c80b42a03d
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.c
@@ -0,0 +1,71 @@
+/* -*- coding: utf-8 -*- */
+
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2017-2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+/* UTF-8 "funÃ§Ã£o1".  */
+#define FUNCAO1 fun\u00e7\u00e3o1
+
+/* UTF-8 "funÃ§Ã£o2".  */
+#define FUNCAO2 fun\u00e7\u00e3o2
+
+/* UTF-8 "my_funÃ§Ã£o".  */
+#define MY_FUNCAO my_fun\u00e7\u00e3o
+
+/* UTF-8 "num_â‚¬".  */
+#define NUM_EUROS num_\u20ac
+
+struct S
+{
+  int NUM_EUROS;
+} g_s;
+
+void
+FUNCAO1 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+FUNCAO2 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+MY_FUNCAO (void)
+{
+}
+
+int NUM_EUROS = 2000;
+
+static void
+done ()
+{
+}
+
+int
+main ()
+{
+  FUNCAO1 ();
+  done ();
+  FUNCAO2 ();
+  MY_FUNCAO ();
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.exp b/gdb/testsuite/gdb.base/utf8-identifiers.exp
new file mode 100644
index 0000000000..9e91cc3659
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.exp
@@ -0,0 +1,77 @@
+# -*- coding: utf-8 -*- */
+
+# This testcase is part of GDB, the GNU debugger.
+
+# Copyright 2017-2018 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test GDB's support for UTF-8 C/C++ identifiers.
+
+load_lib completion-support.exp
+
+standard_testfile
+
+# Enable basic use of UTF-8.  LC_ALL gets reset for each testfile.
+setenv LC_ALL C.UTF-8
+
+if { [prepare_for_testing "failed to prepare" ${testfile} [list $srcfile]] } {
+    return -1
+}
+
+if ![runto done] {
+    fail "couldn't run to done"
+    return
+}
+
+# Test expressions.
+gdb_test "print g_s.num_â‚¬" " = 1000"
+gdb_test "print num_â‚¬" " = 2000"
+
+# Test linespecs/breakpoints.
+gdb_test "break funÃ§Ã£o2" "Breakpoint $decimal at .*$srcfile.*"
+
+set test "info breakpoints"
+gdb_test_multiple $test $test {
+    -re "in funÃ§Ã£o2 at .*$srcfile.*$gdb_prompt $" {
+	pass $test
+    }
+}
+
+gdb_test "continue" \
+    "Breakpoint $decimal, funÃ§Ã£o2 \\(\\) at .*$srcfile.*"
+
+# Unload symbols from shared libraries to avoid random symbol and file
+# names getting in the way of completion.
+gdb_test_no_output "nosharedlibrary"
+
+# Test linespec completion.
+
+# A unique completion.
+test_gdb_complete_unique "break my_fun" "break my_funÃ§Ã£o"
+
+# A multiple-matches completion:
+
+# kfailed because gdb/readline display the completion match list like
+# this, with no separating space:
+#
+#  (gdb) break funÃ§Ã£o[TAB]
+#  funÃ§Ã£o1funÃ§Ã£o2
+#
+# ... which is bogus.
+setup_kfail "gdb/NNNN" "*-*-*"
+test_gdb_complete_multiple "break " "fun" "Ã§Ã£o" {"funÃ§Ã£o1" "funÃ§Ã£o2"}
+
+# Test expression completion.
+test_gdb_complete_unique "print g_s.num" "print g_s.num_â‚¬"
-- 
2.14.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-22 16:42             ` Pedro Alves
@ 2018-05-22 17:31               ` 張俊芝
  2018-05-22 17:38                 ` Pedro Alves
  0 siblings, 1 reply; 20+ messages in thread
From: 張俊芝 @ 2018-05-22 17:31 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

Pedro Alves æ–¼ 2018/5/22 ä¸‹åˆ11:17 å¯«é“:

>>
>> I found the patch.  Let me rebase it and send it / post it.  It'll
>> be easier to just look at the patch.
> 
> Here it is.  So this is reusing the same logic added to
> cp-name-parser.y, in the C/C++ expression parser as well.
> 

I read through your code. If I understand it correctly, you keep all the 
valid ASCII characters, and treat all non-ASCII characters "as valid".

This is pretty much the same thing as I did. The only difference is that 
I blacklist invalid ASCII characters and you whitelist valid ASCII 
characters. But we both "validate" all the non-ASCII characters.

But I think your code seems better than mine because it updates and 
reuses some common code. So I think I can abondon my patch and the 
unfinished test case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: support C/C++ identifiers named with non-ASCII characters
  2018-05-22 17:31               ` 張俊芝
@ 2018-05-22 17:38                 ` Pedro Alves
  0 siblings, 0 replies; 20+ messages in thread
From: Pedro Alves @ 2018-05-22 17:38 UTC (permalink / raw)
  To: 張俊芝, gdb-patches

On 05/22/2018 04:48 PM, å¼µä¿ŠèŠ wrote:

> I read through your code. If I understand it correctly, you keep all the valid ASCII characters, and treat all non-ASCII characters "as valid".
> 
> This is pretty much the same thing as I did. The only difference is that I blacklist invalid ASCII characters and you whitelist valid ASCII characters. But we both "validate" all the non-ASCII characters.

Right.  Non-7-bit/base ASCII characters must either be part of the identifier,
or invalid, but we don't need to be pedantic here, as you've also expressed
elsewhere in the thread, I believe.

> 
> But I think your code seems better than mine because it updates and reuses some common code. So I think I can abondon my patch and the unfinished test case.

Alright, I'm pushing this in then, as below.  I've added your name
to the ChangeLog too.  And fixed the Copyright years to include 2018.
Sorry that I didn't say I had a patch mostly written in the PR!

Note I've filed gdb/23211 for the completion issue.  I'm not working
on it right now.  Let me know if you'd like to take a look at that one.

From b1b60145aedb8adcb0b9dcf43a5ae735c2f03b51 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 22 May 2018 17:35:38 +0100
Subject: [PATCH] Support UTF-8 identifiers in C/C++ expressions (PR gdb/22973)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Factor out cp_ident_is_alpha/cp_ident_is_alnum out of
gdb/cp-name-parser.y and use it in the C/C++ expression parser too.

New test included.

gdb/ChangeLog:
2018-05-22  Pedro Alves  <palves@redhat.com>
	    å¼µä¿ŠèŠ  <zjz@zjz.name>

	PR gdb/22973
	* c-exp.y: Include "c-support.h".
	(parse_number, c_parse_escape, lex_one_token): Use TOLOWER instead
	of tolower.  Use c_ident_is_alpha to scan names.
	* c-lang.c: Include "c-support.h".
	(convert_ucn, convert_octal, convert_hex, convert_escape): Use
	ISXDIGIT instead of isxdigit and ISDIGIT instead of isdigit.
	* c-support.h: New file, with bits factored out from ...
	* cp-name-parser.y: ... this file.
	Include "c-support.h".
	(cp_ident_is_alpha, cp_ident_is_alnum): Deleted, moved to
	c-support.h and renamed.
	(symbol_end, yylex): Adjust.

gdb/testsuite/ChangeLog:
2018-05-22  Pedro Alves  <palves@redhat.com>

	PR gdb/22973
	* gdb.base/utf8-identifiers.c: New file.
	* gdb.base/utf8-identifiers.exp: New file.
---
 gdb/ChangeLog                               | 17 +++++++
 gdb/testsuite/ChangeLog                     |  6 +++
 gdb/c-exp.y                                 | 27 +++++-----
 gdb/c-lang.c                                | 11 +++--
 gdb/c-support.h                             | 46 +++++++++++++++++
 gdb/cp-name-parser.y                        | 29 ++---------
 gdb/testsuite/gdb.base/utf8-identifiers.c   | 71 ++++++++++++++++++++++++++
 gdb/testsuite/gdb.base/utf8-identifiers.exp | 77 +++++++++++++++++++++++++++++
 8 files changed, 240 insertions(+), 44 deletions(-)
 create mode 100644 gdb/c-support.h
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.c
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.exp

diff --git a/gdb/ChangeLog b/gdb/ChangeLog
index ff5e0a2cbe..b34fa7b74e 100644
--- a/gdb/ChangeLog
+++ b/gdb/ChangeLog
@@ -1,3 +1,20 @@
+2018-05-22  Pedro Alves  <palves@redhat.com>
+	    å¼µä¿ŠèŠ  <zjz@zjz.name>
+
+	PR gdb/22973
+	* c-exp.y: Include "c-support.h".
+	(parse_number, c_parse_escape, lex_one_token): Use TOLOWER instead
+	of tolower.  Use c_ident_is_alpha to scan names.
+	* c-lang.c: Include "c-support.h".
+	(convert_ucn, convert_octal, convert_hex, convert_escape): Use
+	ISXDIGIT instead of isxdigit and ISDIGIT instead of isdigit.
+	* c-support.h: New file, with bits factored out from ...
+	* cp-name-parser.y: ... this file.
+	Include "c-support.h".
+	(cp_ident_is_alpha, cp_ident_is_alnum): Deleted, moved to
+	c-support.h and renamed.
+	(symbol_end, yylex): Adjust.
+
 2018-05-22  Pedro Franco de Carvalho  <pedromfc@linux.vnet.ibm.com>
 
 	* arch/ppc-linux-common.c (ppc_linux_has_isa205): Change the
diff --git a/gdb/testsuite/ChangeLog b/gdb/testsuite/ChangeLog
index 208939bf82..393ab8884a 100644
--- a/gdb/testsuite/ChangeLog
+++ b/gdb/testsuite/ChangeLog
@@ -1,3 +1,9 @@
+2018-05-22  Pedro Alves  <palves@redhat.com>
+
+	PR gdb/22973
+	* gdb.base/utf8-identifiers.c: New file.
+	* gdb.base/utf8-identifiers.exp: New file.
+
 2018-05-22  Pedro Franco de Carvalho  <pedromfc@linux.vnet.ibm.com>
 
 	* gdb.arch/powerpc-fpscr-gcore.exp: New file.
diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..ae31af52df 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -42,6 +42,7 @@
 #include "parser-defs.h"
 #include "language.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "bfd.h" /* Required by objfiles.h.  */
 #include "symfile.h" /* Required by objfiles.h.  */
 #include "objfiles.h" /* For have_full_symbols and have_partial_symbols */
@@ -1806,13 +1807,13 @@ parse_number (struct parser_state *par_state,
 	  len -= 2;
 	}
       /* Handle suffixes: 'f' for float, 'l' for long double.  */
-      else if (len >= 1 && tolower (p[len - 1]) == 'f')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'f')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_float;
 	  len -= 1;
 	}
-      else if (len >= 1 && tolower (p[len - 1]) == 'l')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'l')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_long_double;
@@ -2023,9 +2024,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
       if (output)
 	obstack_grow_str (output, "\\x");
       ++tokptr;
-      if (!isxdigit (*tokptr))
+      if (!ISXDIGIT (*tokptr))
 	error (_("\\x escape without a following hex digit"));
-      while (isxdigit (*tokptr))
+      while (ISXDIGIT (*tokptr))
 	{
 	  if (output)
 	    obstack_1grow (output, *tokptr);
@@ -2048,7 +2049,7 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	if (output)
 	  obstack_grow_str (output, "\\");
 	for (i = 0;
-	     i < 3 && isdigit (*tokptr) && *tokptr != '8' && *tokptr != '9';
+	     i < 3 && ISDIGIT (*tokptr) && *tokptr != '8' && *tokptr != '9';
 	     ++i)
 	  {
 	    if (output)
@@ -2073,9 +2074,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	    obstack_1grow (output, *tokptr);
 	  }
 	++tokptr;
-	if (!isxdigit (*tokptr))
+	if (!ISXDIGIT (*tokptr))
 	  error (_("\\%c escape without a following hex digit"), c);
-	for (i = 0; i < len && isxdigit (*tokptr); ++i)
+	for (i = 0; i < len && ISXDIGIT (*tokptr); ++i)
 	  {
 	    if (output)
 	      obstack_1grow (output, *tokptr);
@@ -2668,7 +2669,7 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	    size_t len = strlen ("selector");
 
 	    if (strncmp (p, "selector", len) == 0
-		&& (p[len] == '\0' || isspace (p[len])))
+		&& (p[len] == '\0' || ISSPACE (p[len])))
 	      {
 		lexptr = p + len;
 		return SELECTOR;
@@ -2677,9 +2678,9 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	      goto parse_string;
 	  }
 
-	while (isspace (*p))
+	while (ISSPACE (*p))
 	  p++;
-	if (strncmp (p, "entry", len) == 0 && !isalnum (p[len])
+	if (strncmp (p, "entry", len) == 0 && !c_ident_is_alnum (p[len])
 	    && p[len] != '_')
 	  {
 	    lexptr = &p[len];
@@ -2741,16 +2742,14 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
   for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+       (c == '_' || c == '$' || c_ident_is_alnum (c) || c == '<');)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
diff --git a/gdb/c-lang.c b/gdb/c-lang.c
index 15e633f8c8..6bbb470957 100644
--- a/gdb/c-lang.c
+++ b/gdb/c-lang.c
@@ -25,6 +25,7 @@
 #include "language.h"
 #include "varobj.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "valprint.h"
 #include "macroscope.h"
 #include "charset.h"
@@ -382,7 +383,7 @@ convert_ucn (char *p, char *limit, const char *dest_charset,
   gdb_byte data[4];
   int i;
 
-  for (i = 0; i < length && p < limit && isxdigit (*p); ++i, ++p)
+  for (i = 0; i < length && p < limit && ISXDIGIT (*p); ++i, ++p)
     result = (result << 4) + host_hex_value (*p);
 
   for (i = 3; i >= 0; --i)
@@ -424,7 +425,7 @@ convert_octal (struct type *type, char *p,
   unsigned long value = 0;
 
   for (i = 0;
-       i < 3 && p < limit && isdigit (*p) && *p != '8' && *p != '9';
+       i < 3 && p < limit && ISDIGIT (*p) && *p != '8' && *p != '9';
        ++i)
     {
       value = 8 * value + host_hex_value (*p);
@@ -447,7 +448,7 @@ convert_hex (struct type *type, char *p,
 {
   unsigned long value = 0;
 
-  while (p < limit && isxdigit (*p))
+  while (p < limit && ISXDIGIT (*p))
     {
       value = 16 * value + host_hex_value (*p);
       ++p;
@@ -488,7 +489,7 @@ convert_escape (struct type *type, const char *dest_charset,
 
     case 'x':
       ADVANCE;
-      if (!isxdigit (*p))
+      if (!ISXDIGIT (*p))
 	error (_("\\x used with no following hex digits."));
       p = convert_hex (type, p, limit, output);
       break;
@@ -510,7 +511,7 @@ convert_escape (struct type *type, const char *dest_charset,
 	int length = *p == 'u' ? 4 : 8;
 
 	ADVANCE;
-	if (!isxdigit (*p))
+	if (!ISXDIGIT (*p))
 	  error (_("\\u used with no following hex digits"));
 	p = convert_ucn (p, limit, dest_charset, output, length);
       }
diff --git a/gdb/c-support.h b/gdb/c-support.h
new file mode 100644
index 0000000000..3641d6f534
--- /dev/null
+++ b/gdb/c-support.h
@@ -0,0 +1,46 @@
+/* Helper routines for C support in GDB.
+   Copyright (C) 2017-2018 Free Software Foundation, Inc.
+
+   This file is part of GDB.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#ifndef C_SUPPORT_H
+#define C_SUPPORT_H
+
+#include "safe-ctype.h"
+
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+   multi-byte sequence bytes and non-ASCII characters in
+   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
+   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
+   identifiers, but we don't need to be pedantic so for simplicity we
+   ignore that here.  Plus this avoids the complication of actually
+   knowing what was the right encoding.  */
+
+static inline bool
+c_ident_is_alpha (unsigned char ch)
+{
+  return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM.  */
+
+static inline bool
+c_ident_is_alnum (unsigned char ch)
+{
+  return ISALNUM (ch) || ch >= 0x80;
+}
+
+#endif /* C_SUPPORT_H */
diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index f522e46419..ebae56261b 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -35,6 +35,7 @@
 #include "safe-ctype.h"
 #include "demangle.h"
 #include "cp-support.h"
+#include "c-support.h"
 
 /* Bison does not make it easy to create a parser without global
    state, unfortunately.  Here are all the global variables used
@@ -1304,28 +1305,6 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
 		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
 }
 
-/* Like ISALPHA, but also returns true for the union of all UTF-8
-   multi-byte sequence bytes and non-ASCII characters in
-   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
-   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
-   identifiers, but we don't need to be pedantic so for simplicity we
-   ignore that here.  Plus this avoids the complication of actually
-   knowing what was the right encoding.  */
-
-static inline bool
-cp_ident_is_alpha (unsigned char ch)
-{
-  return ISALPHA (ch) || ch >= 0x80;
-}
-
-/* Similarly, but Like ISALNUM.  */
-
-static inline bool
-cp_ident_is_alnum (unsigned char ch)
-{
-  return ISALNUM (ch) || ch >= 0x80;
-}
-
 /* Find the end of a symbol name starting at LEXPTR.  */
 
 static const char *
@@ -1333,7 +1312,7 @@ symbol_end (const char *lexptr)
 {
   const char *p = lexptr;
 
-  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
+  while (*p && (c_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
     p++;
 
   return p;
@@ -1813,7 +1792,7 @@ yylex (void)
       return ERROR;
     }
 
-  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     {
       /* We must have come across a bad character (e.g. ';').  */
       yyerror (_("invalid character"));
@@ -1824,7 +1803,7 @@ yylex (void)
   namelen = 0;
   do
     c = tokstart[++namelen];
-  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
+  while (c_ident_is_alnum (c) || c == '_' || c == '$');
 
   lexptr += namelen;
 
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.c b/gdb/testsuite/gdb.base/utf8-identifiers.c
new file mode 100644
index 0000000000..c80b42a03d
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.c
@@ -0,0 +1,71 @@
+/* -*- coding: utf-8 -*- */
+
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2017-2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+/* UTF-8 "funÃ§Ã£o1".  */
+#define FUNCAO1 fun\u00e7\u00e3o1
+
+/* UTF-8 "funÃ§Ã£o2".  */
+#define FUNCAO2 fun\u00e7\u00e3o2
+
+/* UTF-8 "my_funÃ§Ã£o".  */
+#define MY_FUNCAO my_fun\u00e7\u00e3o
+
+/* UTF-8 "num_â‚¬".  */
+#define NUM_EUROS num_\u20ac
+
+struct S
+{
+  int NUM_EUROS;
+} g_s;
+
+void
+FUNCAO1 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+FUNCAO2 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+MY_FUNCAO (void)
+{
+}
+
+int NUM_EUROS = 2000;
+
+static void
+done ()
+{
+}
+
+int
+main ()
+{
+  FUNCAO1 ();
+  done ();
+  FUNCAO2 ();
+  MY_FUNCAO ();
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.exp b/gdb/testsuite/gdb.base/utf8-identifiers.exp
new file mode 100644
index 0000000000..12fe3768e2
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.exp
@@ -0,0 +1,77 @@
+# -*- coding: utf-8 -*- */
+
+# This testcase is part of GDB, the GNU debugger.
+
+# Copyright 2017-2018 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test GDB's support for UTF-8 C/C++ identifiers.
+
+load_lib completion-support.exp
+
+standard_testfile
+
+# Enable basic use of UTF-8.  LC_ALL gets reset for each testfile.
+setenv LC_ALL C.UTF-8
+
+if { [prepare_for_testing "failed to prepare" ${testfile} [list $srcfile]] } {
+    return -1
+}
+
+if ![runto done] {
+    fail "couldn't run to done"
+    return
+}
+
+# Test expressions.
+gdb_test "print g_s.num_â‚¬" " = 1000"
+gdb_test "print num_â‚¬" " = 2000"
+
+# Test linespecs/breakpoints.
+gdb_test "break funÃ§Ã£o2" "Breakpoint $decimal at .*$srcfile.*"
+
+set test "info breakpoints"
+gdb_test_multiple $test $test {
+    -re "in funÃ§Ã£o2 at .*$srcfile.*$gdb_prompt $" {
+	pass $test
+    }
+}
+
+gdb_test "continue" \
+    "Breakpoint $decimal, funÃ§Ã£o2 \\(\\) at .*$srcfile.*"
+
+# Unload symbols from shared libraries to avoid random symbol and file
+# names getting in the way of completion.
+gdb_test_no_output "nosharedlibrary"
+
+# Test linespec completion.
+
+# A unique completion.
+test_gdb_complete_unique "break my_fun" "break my_funÃ§Ã£o"
+
+# A multiple-matches completion:
+
+# kfailed because gdb/readline display the completion match list like
+# this, with no separating space:
+#
+#  (gdb) break funÃ§Ã£o[TAB]
+#  funÃ§Ã£o1funÃ§Ã£o2
+#
+# ... which is bogus.
+setup_kfail "gdb/23211" "*-*-*"
+test_gdb_complete_multiple "break " "fun" "Ã§Ã£o" {"funÃ§Ã£o1" "funÃ§Ã£o2"}
+
+# Test expression completion.
+test_gdb_complete_unique "print g_s.num" "print g_s.num_â‚¬"
-- 
2.14.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-05-22 16:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-21  9:54 support C/C++ identifiers named with non-ASCII characters 張俊芝
2018-05-21 14:21 ` Simon Marchi
2018-05-21 15:27   ` Paul.Koning
2018-05-21 16:16     ` Eli Zaretskii
2018-05-21 18:34       ` Paul.Koning
2018-05-21 19:05         ` Eli Zaretskii
2018-05-21 19:25           ` Paul.Koning
2018-05-21 20:43         ` Joseph Myers
2018-05-22 10:31           ` 張俊芝
2018-05-22  8:34         ` 張俊芝
     [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
2018-05-21 18:00     ` 張俊芝
2018-05-21 18:03     ` 張俊芝
2018-05-21 18:14       ` Matt Rice
2018-05-22  7:06         ` 張俊芝
2018-05-22 14:39       ` Pedro Alves
2018-05-22 14:39         ` 張俊芝
2018-05-22 15:17           ` Pedro Alves
2018-05-22 16:42             ` Pedro Alves
2018-05-22 17:31               ` 張俊芝
2018-05-22 17:38                 ` Pedro Alves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).