charset.c problem with non-en

public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed

* charset.c problem with non-en_US locales
@ 2003-04-22 19:59 Elena Zannoni
  2003-04-23  9:45 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Elena Zannoni @ 2003-04-22 19:59 UTC (permalink / raw)
  To: gdb, jimb

I got a bug report against the gdb in RH Linux which I found
interesting, and since it occurs in FSF gdb as well...

Problem: If you set the locale to Turkish, gdb errors out, complaining
that it cannot find the ISO-8859-1 charset.

[ezannoni@localhost gdb]$ LC_ALL=tr_TR.UTF-8 ./gdb
GDB doesn't know of any character set named `ISO-8859-1'.
No display number 0.
Disabling display 0 to avoid infinite recursion.
[ezannoni@localhost gdb]$ 

[ezannoni@localhost gdb]$ LC_ALL=tr_TR.ISO-8859-9 ./gdb
GDB doesn't know of any character set named `ISO-8859-1'.
No display number 0.
Disabling display 0 to avoid infinite recursion.
[ezannoni@localhost gdb]$

The real problem is in the use of the tolower() function in charset.c
to do a case insensitive comparison between the two strings
"ISO-8859-1" and "iso-8859-1". 

/* Character set names are always compared ignoring case.  */
static int
strcmp_case_insensitive (const char *p, const char *q)
{
  while (*p && *q && tolower (*p) == tolower (*q))
    p++, q++;

  return tolower (*p) - tolower (*q);
}

When the locale is set to Turkish (or any other non-Latin), the
tolower/toupper functions don't work as they would in English.  The
lowercase version of 'I' is not 'i', for instance but some other
chracter ('i' w/o the dot). Indeed the man pages for tolower/toupper
warn about this. strcasecmp() also has the problem.

So, I think the whole case-insensitive approach for the names of the
charsets and the translation tables should probably be removed.  What
was the reason behind it? Was it that the user could type upper/lower
case charset names at the command line? After all the official name is
'ISO' not 'iso'.

This patch works, but I am not confident that this it's enough.

elena

Index: charset.c
===================================================================
RCS file: /cvs/uberbaum/gdb/charset.c,v
retrieving revision 1.3
diff -u -p -r1.3 charset.c
--- charset.c	14 Jan 2003 00:49:03 -0000	1.3
+++ charset.c	22 Apr 2003 19:56:53 -0000
@@ -160,10 +160,13 @@ struct translation {
 static int
 strcmp_case_insensitive (const char *p, const char *q)
 {
-  while (*p && *q && tolower (*p) == tolower (*q))
+#if 0 
+ while (*p && *q && tolower (*p) == tolower (*q))
     p++, q++;

   return tolower (*p) - tolower (*q);
+#endif
+  return strcmp (p, q);
 }

@@ -1207,24 +1210,24 @@ _initialize_charset (void)
   register_charset (simple_charset ("ascii", 1,
                                     ascii_print_literally, 0,
                                     ascii_to_control, 0));
-  register_charset (iso_8859_family_charset ("iso-8859-1"));
+  register_charset (iso_8859_family_charset ("ISO-8859-1"));
   register_charset (ebcdic_family_charset ("ebcdic-us"));
   register_charset (ebcdic_family_charset ("ibm1047"));
   register_iconv_charsets ();

   {
     struct { char *from; char *to; int *table; } tlist[] = {
-      { "ascii",      "iso-8859-1", ascii_to_iso_8859_1_table },
+      { "ascii",      "ISO-8859-1", ascii_to_iso_8859_1_table },
       { "ascii",      "ebcdic-us",  ascii_to_ebcdic_us_table },
       { "ascii",      "ibm1047",    ascii_to_ibm1047_table },
-      { "iso-8859-1", "ascii",      iso_8859_1_to_ascii_table },
-      { "iso-8859-1", "ebcdic-us",  iso_8859_1_to_ebcdic_us_table },
-      { "iso-8859-1", "ibm1047",    iso_8859_1_to_ibm1047_table },
+      { "ISO-8859-1", "ascii",      iso_8859_1_to_ascii_table },
+      { "ISO-8859-1", "ebcdic-us",  iso_8859_1_to_ebcdic_us_table },
+      { "ISO-8859-1", "ibm1047",    iso_8859_1_to_ibm1047_table },
       { "ebcdic-us",  "ascii",      ebcdic_us_to_ascii_table },
-      { "ebcdic-us",  "iso-8859-1", ebcdic_us_to_iso_8859_1_table },
+      { "ebcdic-us",  "ISO-8859-1", ebcdic_us_to_iso_8859_1_table },
       { "ebcdic-us",  "ibm1047",    ebcdic_us_to_ibm1047_table },
       { "ibm1047",    "ascii",      ibm1047_to_ascii_table },
-      { "ibm1047",    "iso-8859-1", ibm1047_to_iso_8859_1_table },
+      { "ibm1047",    "ISO-8859-1", ibm1047_to_iso_8859_1_table },
       { "ibm1047",    "ebcdic-us",  ibm1047_to_ebcdic_us_table }
     };

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-22 19:59 charset.c problem with non-en_US locales Elena Zannoni
@ 2003-04-23  9:45 ` Eli Zaretskii
  2003-04-23 17:00   ` H. J. Lu
  2003-04-23 19:45   ` Andrew Cagney
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2003-04-23  9:45 UTC (permalink / raw)
  To: ezannoni; +Cc: gdb, jimb

> From: Elena Zannoni <ezannoni@redhat.com>
> Date: Tue, 22 Apr 2003 16:04:03 -0400
> 
> When the locale is set to Turkish (or any other non-Latin), the
> tolower/toupper functions don't work as they would in English.  The
> lowercase version of 'I' is not 'i', for instance but some other
> chracter ('i' w/o the dot).

Right, that's one peculiarity of the Turkish language.

> So, I think the whole case-insensitive approach for the names of the
> charsets and the translation tables should probably be removed.

I'm not sure.

> What was the reason behind it? Was it that the user could type
> upper/lower case charset names at the command line?

Yes, that's the reason.

> This patch works, but I am not confident that this it's enough.

How about having our own clang_tolower function, which modifies only
7-bit ASCII characters in its argument?  Wouldn't this be a better
solution than requesting the user to type in a certain letter-case?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-23  9:45 ` Eli Zaretskii
@ 2003-04-23 17:00   ` H. J. Lu
  2003-04-23 19:45   ` Andrew Cagney
  1 sibling, 0 replies; 7+ messages in thread
From: H. J. Lu @ 2003-04-23 17:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ezannoni, gdb, jimb

On Wed, Apr 23, 2003 at 12:41:56PM +0200, Eli Zaretskii wrote:
> > From: Elena Zannoni <ezannoni@redhat.com>
> > Date: Tue, 22 Apr 2003 16:04:03 -0400
> > 
> > When the locale is set to Turkish (or any other non-Latin), the
> > tolower/toupper functions don't work as they would in English.  The
> > lowercase version of 'I' is not 'i', for instance but some other
> > chracter ('i' w/o the dot).
> 
> Right, that's one peculiarity of the Turkish language.
> 
> > So, I think the whole case-insensitive approach for the names of the
> > charsets and the translation tables should probably be removed.
> 
> I'm not sure.
> 
> > What was the reason behind it? Was it that the user could type
> > upper/lower case charset names at the command line?
> 
> Yes, that's the reason.
> 
> > This patch works, but I am not confident that this it's enough.
> 
> How about having our own clang_tolower function, which modifies only
> 7-bit ASCII characters in its argument?  Wouldn't this be a better
> solution than requesting the user to type in a certain letter-case?

Isn't that "safe-ctype.h" in libiberty for?


H.J.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-23  9:45 ` Eli Zaretskii
  2003-04-23 17:00   ` H. J. Lu
@ 2003-04-23 19:45   ` Andrew Cagney
  2003-04-24  2:35     ` Jim Blandy
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Cagney @ 2003-04-23 19:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ezannoni, gdb, jimb

>> From: Elena Zannoni <ezannoni@redhat.com>
>> Date: Tue, 22 Apr 2003 16:04:03 -0400
>> 
>> When the locale is set to Turkish (or any other non-Latin), the
>> tolower/toupper functions don't work as they would in English.  The
>> lowercase version of 'I' is not 'i', for instance but some other
>> chracter ('i' w/o the dot).
> 
> 
> Right, that's one peculiarity of the Turkish language.
> 
> 
>> So, I think the whole case-insensitive approach for the names of the
>> charsets and the translation tables should probably be removed.
> 
> 
> I'm not sure.
> 
> 
>> What was the reason behind it? Was it that the user could type
>> upper/lower case charset names at the command line?
> 
> 
> Yes, that's the reason.
> 
> 
>> This patch works, but I am not confident that this it's enough.
> 
> 
> How about having our own clang_tolower function, which modifies only
> 7-bit ASCII characters in its argument?  Wouldn't this be a better
> solution than requesting the user to type in a certain letter-case?

Hmm,

	(gdb) set charset <tab>

doesn't work.  If that was fixed (using GDB's enum cli method), the 
command would become case sensitive.  Since GDB's CLI is case sensative 
in general that would make sense.

The alternative would be to add a case-insensitive version of the enum

Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-23 19:45   ` Andrew Cagney
@ 2003-04-24  2:35     ` Jim Blandy
  2003-04-24 20:07       ` Elena Zannoni
  2003-04-24 20:26       ` Andrew Cagney
  0 siblings, 2 replies; 7+ messages in thread
From: Jim Blandy @ 2003-04-24  2:35 UTC (permalink / raw)
  To: Andrew Cagney; +Cc: Eli Zaretskii, ezannoni, gdb


Andrew Cagney <ac131313@redhat.com> writes:
> Hmm,
> 
> 	(gdb) set charset <tab>
> 
> doesn't work.  If that was fixed (using GDB's enum cli method), the
> command would become case sensitive.  Since GDB's CLI is case
> sensative in general that would make sense.

Actually, the CLI is inconsistent:

    zenia:jimb$ gdb
    GNU gdb 2003-04-17-cvs
    Copyright 2003 Free Software Foundation, Inc.
    ...
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for details.
    This GDB was configured as "i686-pc-linux-gnu".
    (gdb) print 0
    $1 = 0
    (gdb) PRINT 0
    $2 = 0
    (gdb) PrInT 0
    $3 = 0
    (gdb)

I would assume that the following code in cli-decode.c:lookup_cmd_1
would have the same problems as the charset stuff in a Turkish locale:

  /* 
     ** We didn't find the command in the entered case, so lower case it
     ** and search again.
   */
  if (!found || nfound == 0)
    {
      for (tmp = 0; tmp < len; tmp++)
	{
	  char x = command[tmp];
	  command[tmp] = isupper (x) ? tolower (x) : x;
	}
      found = find_cmd (command, len, clist, ignore_help_classes, &nfound);
    }

It seems to me that:
- cli-decode.c should use safe-charset.h,
- the enum code in cli-setshow.c should be changed to be
  case-insensitive, using safe-charset.h, and
- charset.c should be changed to use an enum, for completion's sake.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-24  2:35     ` Jim Blandy
@ 2003-04-24 20:07       ` Elena Zannoni
  2003-04-24 20:26       ` Andrew Cagney
  1 sibling, 0 replies; 7+ messages in thread
From: Elena Zannoni @ 2003-04-24 20:07 UTC (permalink / raw)
  To: Jim Blandy; +Cc: Andrew Cagney, Eli Zaretskii, ezannoni, gdb

Jim Blandy writes:
 > 

All the uses of tolower, toupper, isalpha, etc should be audited.

 > - charset.c should be changed to use an enum, for completion's sake.

/me doing this now. Patch coming soon.

Actually I don't understand why the charset names need to be case
insensitive.  After all they are acronyms. IBM, ISO, ASCII is the way
you see them normally, not ibm, iso, ascii. They are proper names.

elena

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: charset.c problem with non-en_US locales
  2003-04-24  2:35     ` Jim Blandy
  2003-04-24 20:07       ` Elena Zannoni
@ 2003-04-24 20:26       ` Andrew Cagney
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Cagney @ 2003-04-24 20:26 UTC (permalink / raw)
  To: Jim Blandy; +Cc: Eli Zaretskii, ezannoni, gdb

> Andrew Cagney <ac131313@redhat.com> writes:
> 
>> Hmm,
>> 
>> 	(gdb) set charset <tab>
>> 
>> doesn't work.  If that was fixed (using GDB's enum cli method), the
>> command would become case sensitive.  Since GDB's CLI is case
>> sensative in general that would make sense.
> 
> 
> Actually, the CLI is inconsistent:
> 
>     zenia:jimb$ gdb
>     GNU gdb 2003-04-17-cvs
>     Copyright 2003 Free Software Foundation, Inc.
>     ...
>     Type "show copying" to see the conditions.
>     There is absolutely no warranty for GDB.  Type "show warranty" for details.
>     This GDB was configured as "i686-pc-linux-gnu".
>     (gdb) print 0
>     $1 = 0
>     (gdb) PRINT 0
>     $2 = 0
>     (gdb) PrInT 0
>     $3 = 0
>     (gdb)

Dig, dig.  Ulgh!  That was added as part of the HP merge ..., part of 
xdb compatibility?   As best I can tell, prior to Dec '98, GDB was 
strictly case sensative.  It's now kind of both:

- does a case sensative compare (typically already lowercase against 
lowercase, so pretty imune to i18n)
- does a tolower compare

I think there is sufficient case sensatitivity left in GDB for the 
common user (me? :-) to just assume it is so.

Looking at charsets, it has a lowercase table ("iso-8859-1") where as 
the system (and ISO) specify upper case names 
(GDB_DEFAULT_TARGET_CHARSET "ISO-8859-1") causing the failure.

> It seems to me that:
> - cli-decode.c should use safe-charset.h,
> - the enum code in cli-setshow.c should be changed to be
>   case-insensitive, using safe-charset.h, and

I think that would be ill advised.  GDB relies on case sensativity vis:

	(gdb) set remote P-packet off

While at present there isn't an equivalent ``p-packet'' command, there 
will be.  Having ``set remote p-packet'' one day map to ``P-packet'' and 
then, the next, map to ``p-packet'' would be confusing.

> - charset.c should be changed to use an enum, for completion's sake.

Wonder if there is a way of automatically testing <tab> on new commands.

Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-04-24 20:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-22 19:59 charset.c problem with non-en_US locales Elena Zannoni
2003-04-23  9:45 ` Eli Zaretskii
2003-04-23 17:00   ` H. J. Lu
2003-04-23 19:45   ` Andrew Cagney
2003-04-24  2:35     ` Jim Blandy
2003-04-24 20:07       ` Elena Zannoni
2003-04-24 20:26       ` Andrew Cagney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).