printing wchar

public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed

* printing wchar_t*
@ 2006-04-13 17:07 Vladimir Prus
  2006-04-13 17:25 ` Eli Zaretskii
  2006-04-13 18:06 ` Jim Blandy
  0 siblings, 2 replies; 53+ messages in thread
From: Vladimir Prus @ 2006-04-13 17:07 UTC (permalink / raw)
  To: gdb

Hi,
at the moment, gdb seem to provide no support for printing wchar_t* values.
It prints them like this:

   (gdb) print p15
   print p15
   $486 = (wchar_t *) 0x80489f8

Is there any "standard" way to make gdb automatically traverse wchar_t*,
printing values, and stopping at '0' value. I don't care much how it's
actually printed, for example, printing raw hex values will work:

   0x56, 0x1456

or using \u escapes:

   'test\u1234'

or whatever.

I have a user-defined command that can produce the output I want, but is
defining a custom command the right approach? 

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 17:07 printing wchar_t* Vladimir Prus
@ 2006-04-13 17:25 ` Eli Zaretskii
  2006-04-14  7:29   ` Vladimir Prus
  2006-04-13 18:06 ` Jim Blandy
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-13 17:25 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From:  Vladimir Prus <ghost@cs.msu.su>
> Date:  Thu, 13 Apr 2006 20:04:32 +0400
> 
> at the moment, gdb seem to provide no support for printing wchar_t* values.
> It prints them like this:
> 
>    (gdb) print p15
>    print p15
>    $486 = (wchar_t *) 0x80489f8
> 
> Is there any "standard" way to make gdb automatically traverse wchar_t*,
> printing values, and stopping at '0' value.

What character set is used by the wide characters in the wchar_t
arrays?  GDB has some support for a few single-byte character sets,
see the node "Character Sets" in the manual.

> I have a user-defined command that can produce the output I want, but is
> defining a custom command the right approach? 

It's one possibility, the other one being to call a function in the
debuggee to produce the string.  Yet another possibility is to do the
conversion in your GUI front end.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 17:07 printing wchar_t* Vladimir Prus
  2006-04-13 17:25 ` Eli Zaretskii
@ 2006-04-13 18:06 ` Jim Blandy
  2006-04-13 21:18   ` Eli Zaretskii
  2006-04-14  7:58   ` Vladimir Prus
  1 sibling, 2 replies; 53+ messages in thread
From: Jim Blandy @ 2006-04-13 18:06 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> I have a user-defined command that can produce the output I want, but is
> defining a custom command the right approach?

Well, you'd like wide strings to be printed properly when they appear
in structures, as arguments to functions, and so on, right?  So a
user-defined command isn't ideal.

The best approach would be to extend charset.[ch] to handle wide
character sets as well, and then add code to the language-specific
printing routines to use the charset functions.  (This is fortunately
much simpler than adding support for multibyte characters.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 18:06 ` Jim Blandy
@ 2006-04-13 21:18   ` Eli Zaretskii
  2006-04-14  6:02     ` Jim Blandy
  2006-04-14  7:58   ` Vladimir Prus
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-13 21:18 UTC (permalink / raw)
  To: Jim Blandy; +Cc: ghost, gdb

> Date: Thu, 13 Apr 2006 10:31:18 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: gdb@sources.redhat.com
> 
> The best approach would be to extend charset.[ch] to handle wide
> character sets as well, and then add code to the language-specific
> printing routines to use the charset functions.  (This is fortunately
> much simpler than adding support for multibyte characters.)

Can you tell why you think it's much simpler?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 21:18   ` Eli Zaretskii
@ 2006-04-14  6:02     ` Jim Blandy
  2006-04-14  8:43       ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Jim Blandy @ 2006-04-14  6:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ghost, gdb

On 4/13/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Thu, 13 Apr 2006 10:31:18 -0700
> > From: "Jim Blandy" <jimb@red-bean.com>
> > Cc: gdb@sources.redhat.com
> >
> > The best approach would be to extend charset.[ch] to handle wide
> > character sets as well, and then add code to the language-specific
> > printing routines to use the charset functions.  (This is fortunately
> > much simpler than adding support for multibyte characters.)
>
> Can you tell why you think it's much simpler?

Okay --- just to be clear, this is about multi-byte characters, not
wide characters, which is what Volodya was asking about.

- The code for limiting how much of a string GDB will print, and for
detecting repetitions, seemed like it would be hard to adapt to
multibyte encodings.  Remember that you've got to be completely
agnostic about the encoding; there are stateful encodings out there in
widespread use, etc.

- I don't think GDB should use off-the-shelf conversion stuff like
iconv.  For example, if you're looking at ISO-2022 text with the
character set switching escape codes in there, I'd argue it'd be wrong
for GDB to display those strings without showing the escape codes. 
It's a debugger, so people are looking at strings and corresponding
indexes into those strings, and they need to be able to see exactly
what's in there.  iconv handles the escape codes silently.

- Most programs can just print an error message and die if they see
ill-formed multi-byte sequences: you gave them junk; fix it.  GDB
needs to do something more useful; its job is to be helpful exactly
when your program is misbehaving and you don't know why.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 17:25 ` Eli Zaretskii
@ 2006-04-14  7:29   ` Vladimir Prus
  2006-04-14  8:47     ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14  7:29 UTC (permalink / raw)
  To: gdb

Eli Zaretskii wrote:

>> From:  Vladimir Prus <ghost@cs.msu.su>
>> Date:  Thu, 13 Apr 2006 20:04:32 +0400
>> 
>> at the moment, gdb seem to provide no support for printing wchar_t*
>> values. It prints them like this:
>> 
>>    (gdb) print p15
>>    print p15
>>    $486 = (wchar_t *) 0x80489f8
>> 
>> Is there any "standard" way to make gdb automatically traverse wchar_t*,
>> printing values, and stopping at '0' value.
> 
> What character set is used by the wide characters in the wchar_t
> arrays?  GDB has some support for a few single-byte character sets,
> see the node "Character Sets" in the manual.

Relatively safe bet would be to assume it's some zero-terminated character
set. I plan to assume it's either UTF-16 or UTF-32 in the GUI (the
conversion code is the same for both encodings), but gdb can just print raw
values.

>> I have a user-defined command that can produce the output I want, but is
>> defining a custom command the right approach?
> 
> It's one possibility, the other one being to call a function in the
> debuggee to produce the string. 

And what such a function will return? char* in local 8-bit encoding? In that
case, no all wchar_t* variable can be printed.

> Yet another possibility is to do the 
> conversion in your GUI front end.

That's what I'm going to do, but first I need to get raw data, preferrably
without issing an MI command for every single character.

- Volodya



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-13 18:06 ` Jim Blandy
  2006-04-13 21:18   ` Eli Zaretskii
@ 2006-04-14  7:58   ` Vladimir Prus
  2006-04-14  8:07     ` Jim Blandy
  2006-04-14  8:57     ` Eli Zaretskii
  1 sibling, 2 replies; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14  7:58 UTC (permalink / raw)
  To: gdb

Jim Blandy wrote:

> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
>> I have a user-defined command that can produce the output I want, but is
>> defining a custom command the right approach?
> 
> Well, you'd like wide strings to be printed properly when they appear
> in structures, as arguments to functions, and so on, right?  So a
> user-defined command isn't ideal.

I think I'll still need to do some processing for wchar_t* on frontend side.
The problem is that I don't see any way how gdb can print wchar_t in a way
that does not require post-processing. It can print it as UTF8, but then
for printing char* gdb should use local 8 bit encoding, which is likely to
be *not* UTF8. Gdb can probably use some extra markers for values: like:

   "foo"  for string in local 8-bit encoding
   L"foo" for string in UTF8 encoding.

It's also possible to use "\u" escapes.

But then there's a problem:

   - Do we assume that wchar_t is always UTF-16 or UTF-32?
   - If not:
     - how user can select this?
     - how user-specified encoding will be handled

> The best approach would be to extend charset.[ch] to handle wide
> character sets as well, and then add code to the language-specific
> printing routines to use the charset functions.  (This is fortunately
> much simpler than adding support for multibyte characters.)

For, for each wchar_t element language-specific code will call
'target_wchar_t_to_host', that will output specific representation of that
wchar_t. Hmm, the interface there seem to assume theres 1<->1 mapping
between target and host characters.  This makes L"UTF8" format and ascii
string with \u escapes format impossible, It seems.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  7:58   ` Vladimir Prus
@ 2006-04-14  8:07     ` Jim Blandy
  2006-04-14  8:30       ` Vladimir Prus
  2006-04-14  8:57     ` Eli Zaretskii
  1 sibling, 1 reply; 53+ messages in thread
From: Jim Blandy @ 2006-04-14  8:07 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> Jim Blandy wrote:
>
> > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> >> I have a user-defined command that can produce the output I want, but is
> >> defining a custom command the right approach?
> >
> > Well, you'd like wide strings to be printed properly when they appear
> > in structures, as arguments to functions, and so on, right?  So a
> > user-defined command isn't ideal.
>
> I think I'll still need to do some processing for wchar_t* on frontend side.
> The problem is that I don't see any way how gdb can print wchar_t in a way
> that does not require post-processing. It can print it as UTF8, but then
> for printing char* gdb should use local 8 bit encoding, which is likely to
> be *not* UTF8. Gdb can probably use some extra markers for values: like:
>
>    "foo"  for string in local 8-bit encoding
>    L"foo" for string in UTF8 encoding.
>
> It's also possible to use "\u" escapes.
>
> But then there's a problem:
>
>    - Do we assume that wchar_t is always UTF-16 or UTF-32?
>    - If not:
>      - how user can select this?
>      - how user-specified encoding will be handled

You can't hard-code assumptions about the character set into GDB.  Nor
can you hard-code the assumption that the host and target character
sets are the same.  GDB needs to do explicit conversions between the
two as needed, and handle mismatches in some reasonable way.

GDB already has the commands 'set host-charset' and 'set
target-charset', so you can assume that you have accurate information
about the character sets at hand.  They fall back to ASCII.

> > The best approach would be to extend charset.[ch] to handle wide
> > character sets as well, and then add code to the language-specific
> > printing routines to use the charset functions.  (This is fortunately
> > much simpler than adding support for multibyte characters.)
>
> For, for each wchar_t element language-specific code will call
> 'target_wchar_t_to_host', that will output specific representation of that
> wchar_t. Hmm, the interface there seem to assume theres 1<->1 mapping
> between target and host characters.  This makes L"UTF8" format and ascii
> string with \u escapes format impossible, It seems.

Not at all.  The current character and string printing code uses those
routines, and it handles unprintable and invalid characters just fine.
 See, for example, host_print_char_literally, and
c_target_char_has_backslash_escape.

GDB tries to print characters and strings as they would appear in
source code.  C doesn't assume that the source and execution character
sets are the same; by using numeric escapes, you can write programs
for any execution character set in any source character set.  You just
need enough information to manage the overlap.

As far as 1-to-1 mappings are concerned, the only necessary property
is that host_char_to_target and target_char_to_host be inverses, and
return zero for characters that can't make a round trip.  The existing
string-printing code will automatically use numeric escapes for
characters that target_char_to_host won't translate.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  8:07     ` Jim Blandy
@ 2006-04-14  8:30       ` Vladimir Prus
  0 siblings, 0 replies; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14  8:30 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right?  So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> >    "foo"  for string in local 8-bit encoding
> >    L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >    - If not:
> >      - how user can select this?
> >      - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB.  Nor
> can you hard-code the assumption that the host and target character
> sets are the same.  GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand.  They fall back to ASCII.

Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at 
the same time.

> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions.  (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters.  This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all.  The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
>  See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.

Can this code output using UTF8-encoding? Consider this code from c-lang.c:

  static void
  c_emit_char (int c, struct ui_file *stream, int quoter)
 {
  const char *escape;
  int host_char;

  c &= 0xFF;			/* Avoid sign bit follies */

  escape = c_target_char_has_backslash_escape (c);
  if (escape)
    {
      if (quoter == '"' && strcmp (escape, "0") == 0)
	/* Print nulls embedded in double quoted strings as \000 to
	   prevent ambiguity.  */
	fprintf_filtered (stream, "\\000");
      else
	fprintf_filtered (stream, "\\%s", escape);
    }
  else if (target_char_to_host (c, &host_char)
           && host_char_print_literally (host_char))
    {
      if (host_char == '\\' || host_char == quoter)
        fputs_filtered ("\\", stream);
      fprintf_filtered (stream, "%c", host_char);
    }
  else
    fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
 }

With UTF8 host encoding, we'd want up to 6 host bytes to be output for a 
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so 
there's no way for 'target_char_to_host' to produce 6 characters. 

> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip.  The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.

So, assuming numeric escapes are fine with me, I'd need to:

  1. Add a way to specify encoding of wchar_t* values.
  2. Write a version of c_printstr that will handle wchar_t*. The current
     version just accesses i-th element of the string, so won't work with
     UTF-16.
  3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
     will handle escapes automatically.
  4. Make sure new version of c_printstr is invoked for wchar_t* values.

Is that about right?

- Volodya


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  6:02     ` Jim Blandy
@ 2006-04-14  8:43       ` Eli Zaretskii
  0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14  8:43 UTC (permalink / raw)
  To: Jim Blandy; +Cc: ghost, gdb

> Date: Thu, 13 Apr 2006 11:06:12 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: ghost@cs.msu.su, gdb@sources.redhat.com
> 
> On 4/13/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > > Date: Thu, 13 Apr 2006 10:31:18 -0700
> > > From: "Jim Blandy" <jimb@red-bean.com>
> > > Cc: gdb@sources.redhat.com
> > >
> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions.  (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > Can you tell why you think it's much simpler?
> 
> Okay --- just to be clear, this is about multi-byte characters, not
> wide characters, which is what Volodya was asking about.

It's both, as far as I'm concerned: I was asking to explain why you
think supporting wide characters is much easier than supporting
multi-byte characters.

> - I don't think GDB should use off-the-shelf conversion stuff like
> iconv.  For example, if you're looking at ISO-2022 text with the
> character set switching escape codes in there, I'd argue it'd be wrong
> for GDB to display those strings without showing the escape codes. 
> It's a debugger, so people are looking at strings and corresponding
> indexes into those strings, and they need to be able to see exactly
> what's in there.  iconv handles the escape codes silently.

If we add such a support, we should probably have GDB print both the
raw and printable representation of the non-ASCII strings.  We already
do something similar with char data type.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  7:29   ` Vladimir Prus
@ 2006-04-14  8:47     ` Eli Zaretskii
  2006-04-14 12:47       ` Vladimir Prus
  2006-04-14 14:08       ` Paul Koning
  0 siblings, 2 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14  8:47 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From:  Vladimir Prus <ghost@cs.msu.su>
> Date:  Fri, 14 Apr 2006 10:01:57 +0400
> 
> > What character set is used by the wide characters in the wchar_t
> > arrays?  GDB has some support for a few single-byte character sets,
> > see the node "Character Sets" in the manual.
> 
> Relatively safe bet would be to assume it's some zero-terminated character
> set. I plan to assume it's either UTF-16 or UTF-32 in the GUI (the
> conversion code is the same for both encodings), but gdb can just print raw
> values.

We should get our terminology right: UTF-16 is not a character set,
it's an encoding (and a multibyte encoding, btw).  As for UTF-32, I
don't think such a beast exists at all.

I think you meant 16-bit Unicode characters (a.k.a. the BMP) and
32-bit Unicode characters, respectively.

> > It's one possibility, the other one being to call a function in the
> > debuggee to produce the string. 
> 
> And what such a function will return? char* in local 8-bit encoding? In that
> case, no all wchar_t* variable can be printed.

If you want to display non-ASCII strings, it means you already have
some way of displaying such characters.  The function I mentioned
would not return anything, it would actually _display_ the string.

For example, in command-line version of GDB, if the terminal supports
UTF-8 encoded characters, that function would output a UTF-8 encoding
of the non-ASCII string, and then the terminal will display them with
the correct glyphs.

> > Yet another possibility is to do the 
> > conversion in your GUI front end.
> 
> That's what I'm going to do, but first I need to get raw data, preferrably
> without issing an MI command for every single character.

A wchar_t string is just an array, and GDB already has a feature to
produce N elements of an array.  In CLI, you say "print *array@20" to
print the first 20 elements of the named array.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  7:58   ` Vladimir Prus
  2006-04-14  8:07     ` Jim Blandy
@ 2006-04-14  8:57     ` Eli Zaretskii
  2006-04-14 12:52       ` Vladimir Prus
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14  8:57 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From:  Vladimir Prus <ghost@cs.msu.su>
> Date:  Fri, 14 Apr 2006 10:10:19 +0400
> 
> The problem is that I don't see any way how gdb can print wchar_t in a way
> that does not require post-processing. It can print it as UTF8, but then
> for printing char* gdb should use local 8 bit encoding, which is likely to
> be *not* UTF8.

You are talking about a GUI front-end, aren't you?  In that case, you
will need to code a routine that accepts a wchar_t string, and then
_displays_ it using the appropriate font.  It is wrong to talk about
``printing'' it and about ``local 8-bit encoding'', because you don't
want to encode it, you want to display it using the appropriate font.

In particular, if the original wchar_t uses Unicode codepoints, then
presumably there should be some GUI API call, specific to your
windowing system, that would accept such a wchar_t string and display
it using a Unicode font.

So if you are going to do this in the front-end, I think all you need
is ask GDB to supply the wchar_t string using the array notation; the
rest will have to be done inside the front-end.  Am I missing
something?

> Gdb can probably use some extra markers for values: like:
> 
>    "foo"  for string in local 8-bit encoding
>    L"foo" for string in UTF8 encoding.
> 
> It's also possible to use "\u" escapes.

Why do you need any of these?  16-bit Unicode characters are just
integers, so ask GDB to send them as integers.  That should be all you
need, since displaying them is something your FE will need to do
itself, no?

> But then there's a problem:
> 
>    - Do we assume that wchar_t is always UTF-16 or UTF-32?

You don't need to assume, you can ask the application.  Wouldn't
"sizeof(wchar_t)" do the trick?

>      - how user-specified encoding will be handled

wchar_t is not an encoding, it's the characters' codes themselves.
Encoded characters are (in general) multibyte character strings, not
wchar_t.  See, for example, the description of library functions
mbsinit, mbrlen, mbrtowc, etc., for more about this distinction.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  8:47     ` Eli Zaretskii
@ 2006-04-14 12:47       ` Vladimir Prus
  2006-04-14 13:05         ` Eli Zaretskii
  2006-04-14 14:08       ` Paul Koning
  1 sibling, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14 12:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On Friday 14 April 2006 12:30, Eli Zaretskii wrote:

> > Relatively safe bet would be to assume it's some zero-terminated
> > character set. I plan to assume it's either UTF-16 or UTF-32 in the GUI
> > (the conversion code is the same for both encodings), but gdb can just
> > print raw values.
>
> We should get our terminology right: UTF-16 is not a character set,
> it's an encoding (and a multibyte encoding, btw).  As for UTF-32, I
> don't think such a beast exists at all.
>
> I think you meant 16-bit Unicode characters (a.k.a. the BMP) and
> 32-bit Unicode characters, respectively.

No, I meant UTF-16 encoding (the one with surrogate pairs), and UTF-32 
encoding (which does exists, in the Unicode standard).

> > > It's one possibility, the other one being to call a function in the
> > > debuggee to produce the string.
> >
> > And what such a function will return? char* in local 8-bit encoding? In
> > that case, no all wchar_t* variable can be printed.
>
> If you want to display non-ASCII strings, it means you already have
> some way of displaying such characters.  The function I mentioned
> would not return anything, it would actually _display_ the string.
>
> For example, in command-line version of GDB, if the terminal supports
> UTF-8 encoded characters, that function would output a UTF-8 encoding
> of the non-ASCII string, and then the terminal will display them with
> the correct glyphs.

This is non-starter. I can't have debuggee send data to KDevelop widgets.

> > > Yet another possibility is to do the
> > > conversion in your GUI front end.
> >
> > That's what I'm going to do, but first I need to get raw data,
> > preferrably without issing an MI command for every single character.
>
> A wchar_t string is just an array, and GDB already has a feature to
> produce N elements of an array.  In CLI, you say "print *array@20" to
> print the first 20 elements of the named array.

I don't know how many elements there are, as wchar_t* is zero terminated, so 
I'd like gdb to compute the length automatically.

- Volodya


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  8:57     ` Eli Zaretskii
@ 2006-04-14 12:52       ` Vladimir Prus
  2006-04-14 13:07         ` Daniel Jacobowitz
  2006-04-14 14:16         ` Eli Zaretskii
  0 siblings, 2 replies; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14 12:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On Friday 14 April 2006 12:43, Eli Zaretskii wrote:
> > From:  Vladimir Prus <ghost@cs.msu.su>
> > Date:  Fri, 14 Apr 2006 10:10:19 +0400
> >
> > The problem is that I don't see any way how gdb can print wchar_t in a
> > way that does not require post-processing. It can print it as UTF8, but
> > then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8.
>
> You are talking about a GUI front-end, aren't you?  In that case, you
> will need to code a routine that accepts a wchar_t string, and then
> _displays_ it using the appropriate font.  It is wrong to talk about
> ``printing'' it and about ``local 8-bit encoding'', because you don't
> want to encode it, you want to display it using the appropriate font.
>
> In particular, if the original wchar_t uses Unicode codepoints, then
> presumably there should be some GUI API call, specific to your
> windowing system, that would accept such a wchar_t string and display
> it using a Unicode font.

Sure, I know how to display Unicode string. The question is how to get at pass 
raw Unicode data from gdb to frontend in the form suitable for me and most 
reasonable to other users of gdb. As I said, I already have a user-defined 
command to do this, but it won't benefit other users of gdb.

> So if you are going to do this in the front-end, I think all you need
> is ask GDB to supply the wchar_t string using the array notation; the
> rest will have to be done inside the front-end.  Am I missing
> something?

Yes, I'll need to know the length of the string. I can do this either using 
user-defined gdb command (which again will solve *my* problem, but be a local 
solution), or by looking at each character until I see zero, in which case 
I'd need to command for each characters.

>
> > Gdb can probably use some extra markers for values: like:
> >
> >    "foo"  for string in local 8-bit encoding
> >    L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
>
> Why do you need any of these?  16-bit Unicode characters are just
> integers, so ask GDB to send them as integers.  That should be all you
> need, since displaying them is something your FE will need to do
> itself, no?

In an original post, I've asked if gdb can print wchar_t just as a raw 
sequence of values, like this:

    0x56, 0x1456

"foo" and L"foo" are other alternatives which might be more handy for general 
users of gdb.

> > But then there's a problem:
> >
> >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
>
> You don't need to assume, you can ask the application.  Wouldn't
> "sizeof(wchar_t)" do the trick?

Deciding if it's UTF-16 or UTF-32 is not the problem. In fact, exactly the 
same code will handle both encodings just fine. The question if we allow 
encodings which are not UTF-16 or UTF-32. I don't know about any such 
encodings, but I'm not an i18n expert.

> >      - how user-specified encoding will be handled
>
> wchar_t is not an encoding, it's the characters' codes themselves.

I don't understand what you say here, sorry. Do you mean that each wchar_t is 
in general code point, not a complete abstract character. Yes, true, and 
what? If wchar_t* literals can use encoding other then UTF-16 and UTF-32, you 
need the code to handle that encoding, and the question arises where you'll 
get that code, will it be iconv or something else.

> Encoded characters are (in general) multibyte character strings, not
> wchar_t.  See, for example, the description of library functions
> mbsinit, mbrlen, mbrtowc, etc., for more about this distinction.

I know about this distinction.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 12:47       ` Vladimir Prus
@ 2006-04-14 13:05         ` Eli Zaretskii
  2006-04-14 13:06           ` Vladimir Prus
  2006-04-14 13:17           ` Daniel Jacobowitz
  0 siblings, 2 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 13:05 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 12:46:57 +0400
> Cc: gdb@sources.redhat.com
> 
> No, I meant UTF-16 encoding (the one with surrogate pairs), and UTF-32 
> encoding (which does exists, in the Unicode standard).

What software uses that?

Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.

> > For example, in command-line version of GDB, if the terminal supports
> > UTF-8 encoded characters, that function would output a UTF-8 encoding
> > of the non-ASCII string, and then the terminal will display them with
> > the correct glyphs.
> 
> This is non-starter. I can't have debuggee send data to KDevelop widgets.

That was just an example.  I know it's irrelevant to your case (and,
in fact, to any GUI front-end).

> > A wchar_t string is just an array, and GDB already has a feature to
> > produce N elements of an array.  In CLI, you say "print *array@20" to
> > print the first 20 elements of the named array.
> 
> I don't know how many elements there are, as wchar_t* is zero terminated, so 
> I'd like gdb to compute the length automatically.

That's easy.  Assuming that is done, is it all you need?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:05         ` Eli Zaretskii
@ 2006-04-14 13:06           ` Vladimir Prus
  2006-04-14 13:15             ` Robert Dewar
  2006-04-14 13:17           ` Daniel Jacobowitz
  1 sibling, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14 13:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
> > From: Vladimir Prus <ghost@cs.msu.su>
> > Date: Fri, 14 Apr 2006 12:46:57 +0400
> > Cc: gdb@sources.redhat.com
> >
> > No, I meant UTF-16 encoding (the one with surrogate pairs), and UTF-32
> > encoding (which does exists, in the Unicode standard).
>
> What software uses that?

I'd say, any software using std::wstring on Linux.

> Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.

Since C++ standard says nothing about encoding of wchar_t, specific 
application can do anything it likes. In particular, I believe that on 
Windows, wchar_t* is assumed to be in UTF-16 encoding.

> > > A wchar_t string is just an array, and GDB already has a feature to
> > > produce N elements of an array.  In CLI, you say "print *array@20" to
> > > print the first 20 elements of the named array.
> >
> > I don't know how many elements there are, as wchar_t* is zero terminated,
> > so I'd like gdb to compute the length automatically.
>
> That's easy.  Assuming that is done, is it all you need?

Yes, that would be sufficient for me.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 12:52       ` Vladimir Prus
@ 2006-04-14 13:07         ` Daniel Jacobowitz
  2006-04-14 14:23           ` Eli Zaretskii
  2006-04-14 14:16         ` Eli Zaretskii
  1 sibling, 1 reply; 53+ messages in thread
From: Daniel Jacobowitz @ 2006-04-14 13:07 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: Eli Zaretskii, gdb

On Fri, Apr 14, 2006 at 12:57:41PM +0400, Vladimir Prus wrote:
> > So if you are going to do this in the front-end, I think all you need
> > is ask GDB to supply the wchar_t string using the array notation; the
> > rest will have to be done inside the front-end.  Am I missing
> > something?
> 
> Yes, I'll need to know the length of the string. I can do this either using 
> user-defined gdb command (which again will solve *my* problem, but be a local 
> solution), or by looking at each character until I see zero, in which case 
> I'd need to command for each characters.

Going away from GDB support for wide characters for a moment, and back to
this; we have a "print N elements" notation; should we extend it to a
"print all non-zero elements" notation?

Alternatively, we could do it specially by recognizing wchar_t, but
I think the general solution might be more useful.

A user defined command for this isn't all that bad, though.  You can
hopefully define the user command from your frontend.  I haven't tested
this much, but I don't see a reason why it shouldn't work.  If you use
define through -interpreter-exec you get CLI prompts back; ugh, that's
nasty.  If you try this:

  -interpreter-exec console "define foo\nend"

It gets treated as junk.

Should we make multi-line strings work in -interpreter-exec?

> Deciding if it's UTF-16 or UTF-32 is not the problem. In fact, exactly the 
> same code will handle both encodings just fine. The question if we allow 
> encodings which are not UTF-16 or UTF-32. I don't know about any such 
> encodings, but I'm not an i18n expert.

Eli'd know better than me, but I think that expecting wchar_t to be
Unicode is not reliable.  The glibc manual suggests that it's valid to
use other encodings for wchar_t, although ISO 10646 is typical.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:06           ` Vladimir Prus
@ 2006-04-14 13:15             ` Robert Dewar
  0 siblings, 0 replies; 53+ messages in thread
From: Robert Dewar @ 2006-04-14 13:15 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: Eli Zaretskii, gdb

Vladimir Prus wrote:
> On Friday 14 April 2006 16:55, Eli Zaretskii wrote:
>>> From: Vladimir Prus <ghost@cs.msu.su>
>>> Date: Fri, 14 Apr 2006 12:46:57 +0400
>>> Cc: gdb@sources.redhat.com
>>>
>>> No, I meant UTF-16 encoding (the one with surrogate pairs), and UTF-32
>>> encoding (which does exists, in the Unicode standard).
>> What software uses that?
> 
> I'd say, any software using std::wstring on Linux.
> 
>> Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.
> 
> Since C++ standard says nothing about encoding of wchar_t, specific 
> application can do anything it likes. In particular, I believe that on 
> Windows, wchar_t* is assumed to be in UTF-16 encoding.

It only makes sense to talk about UTF-16 encoding in the context
of wchar_t if wchar_t is 16-bits, otherwise, as noted above, UTF-32
is a variable length encoding, not suitable for wchar_t.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:05         ` Eli Zaretskii
  2006-04-14 13:06           ` Vladimir Prus
@ 2006-04-14 13:17           ` Daniel Jacobowitz
  2006-04-14 13:59             ` Robert Dewar
  2006-04-14 14:37             ` Eli Zaretskii
  1 sibling, 2 replies; 53+ messages in thread
From: Daniel Jacobowitz @ 2006-04-14 13:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Vladimir Prus, gdb

On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii wrote:
> Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.

There's a rant about this in the glibc manual I was just reading...

In fact, on many platforms, wchar_t is only 16-bit.  How exactly you
handle UTF-8 or UCS-4 input in this case, I don't really understand.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:17           ` Daniel Jacobowitz
@ 2006-04-14 13:59             ` Robert Dewar
  2006-04-14 14:37             ` Eli Zaretskii
  1 sibling, 0 replies; 53+ messages in thread
From: Robert Dewar @ 2006-04-14 13:59 UTC (permalink / raw)
  To: Eli Zaretskii, Vladimir Prus, gdb

Daniel Jacobowitz wrote:
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii wrote:
>> Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.
> 
> There's a rant about this in the glibc manual I was just reading...
> 
> In fact, on many platforms, wchar_t is only 16-bit.  How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't really understand.

Seems clear, you can only represent a limited range of codes if you
only have 16 bits!

UTF-8 is a variable length encoding that can represent any character
in the 32-bit range. Obviously if you have to construct wchar_t
values from UTF-8 input, then you will not be able to represent
characters whose codes exceed 65535. Same with UCS-4.
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14  8:47     ` Eli Zaretskii
  2006-04-14 12:47       ` Vladimir Prus
@ 2006-04-14 14:08       ` Paul Koning
  2006-04-14 14:47         ` Eli Zaretskii
  1 sibling, 1 reply; 53+ messages in thread
From: Paul Koning @ 2006-04-14 14:08 UTC (permalink / raw)
  To: eliz; +Cc: ghost, gdb

>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

 >> From: Vladimir Prus <ghost@cs.msu.su> Date: Fri, 14 Apr 2006
 >> 10:01:57 +0400
 >> 
 >> > What character set is used by the wide characters in the wchar_t
 >> > arrays?  GDB has some support for a few single-byte character
 >> sets, > see the node "Character Sets" in the manual.
 >> 
 >> Relatively safe bet would be to assume it's some zero-terminated
 >> character set. I plan to assume it's either UTF-16 or UTF-32 in
 >> the GUI (the conversion code is the same for both encodings), but
 >> gdb can just print raw values.

 Eli> We should get our terminology right: UTF-16 is not a character
 Eli> set, it's an encoding (and a multibyte encoding, btw).  As for
 Eli> UTF-32, I don't think such a beast exists at all.

I seem to remember seeing it mentioned.  It certainly makes sense.

 Eli> I think you meant 16-bit Unicode characters (a.k.a. the BMP) and
 Eli> 32-bit Unicode characters, respectively.

If you have 16 bit wide chars, it seems possible that those might
contain UTF-16 encoding of full (beyond BMP) Unicode characters.

	paul

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 12:52       ` Vladimir Prus
  2006-04-14 13:07         ` Daniel Jacobowitz
@ 2006-04-14 14:16         ` Eli Zaretskii
  2006-04-14 14:50           ` Vladimir Prus
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 14:16 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 12:57:41 +0400
> Cc: gdb@sources.redhat.com
> 
> > In particular, if the original wchar_t uses Unicode codepoints, then
> > presumably there should be some GUI API call, specific to your
> > windowing system, that would accept such a wchar_t string and display
> > it using a Unicode font.
> 
> Sure, I know how to display Unicode string. The question is how to get at pass 
> raw Unicode data from gdb to frontend in the form suitable for me and most 
> reasonable to other users of gdb.

I suggested to use array features for that.

> In an original post, I've asked if gdb can print wchar_t just as a raw 
> sequence of values, like this:
> 
>     0x56, 0x1456

The answer is YES.  Use array notation, and add a feature to report
the length of a wchar_t array.

> "foo" and L"foo" are other alternatives which might be more handy for general 
> users of gdb.

L"foo" will not help you here, because the characters in question are
not printable.  If GDB outputs L"foo" where every character is not
printable, you will have the same problem as you have now.

> > > But then there's a problem:
> > >
> > >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >
> > You don't need to assume, you can ask the application.  Wouldn't
> > "sizeof(wchar_t)" do the trick?
> 
> Deciding if it's UTF-16 or UTF-32 is not the problem.

Well, you did ask about the distinction.

> In fact, exactly the same code will handle both encodings just fine.

Again, please don't use encoding when you mean character's codepoint.
It's confusing, and runs a risk to obfuscate the problem.  See below.

> The question if we allow encodings which are not UTF-16 or UTF-32. I
> don't know about any such encodings, but I'm not an i18n expert.

There are a myriad of encodings, but the only ones that could ever
qualify as wchar_t are single-byte (8-bit) encodings that are
generally used for Latin languages (and for several others, like
Cyrillic and Hebrew).

What you need is a way to tell GDB how are the strings represented in
the debuggee's wchar_t, and then GDB should convert that
representation into something your FE can display.  Assuming your FE
will be able to display Unicode characters, GDB should convert to
Unicode, if the debugge's wchar_t is not Unicode already.

There's no universal way for GDB to know what is held in wchar_t by
the debuggee, so I think the only reasonable way is for the user to
tell that.  A reasonable default would be 16-bit Unicode codepoints
from the BMP, or 32-bit Unicode codepoints from the entire range of
Unicode characters.  (I think glibc uses the latter.)

> > >      - how user-specified encoding will be handled
> >
> > wchar_t is not an encoding, it's the characters' codes themselves.
> 
> I don't understand what you say here, sorry. Do you mean that each wchar_t is 
> in general code point, not a complete abstract character. Yes, true, and 
> what? If wchar_t* literals can use encoding other then UTF-16 and UTF-32, you 
> need the code to handle that encoding, and the question arises where you'll 
> get that code, will it be iconv or something else.
> 
> > Encoded characters are (in general) multibyte character strings, not
> > wchar_t.  See, for example, the description of library functions
> > mbsinit, mbrlen, mbrtowc, etc., for more about this distinction.
> 
> I know about this distinction.

If you know about this distinction, then you should have no trouble
understanding what I said about wchar_t NOT being an encoding.  UTF-8
and UTF-16 are multibyte variable-length _encodings_ of Unicode
character's _codepoints_.  For example, the Cyrillic letter ``small
a'' has Unicode codepoint 0x0430, but its UTF-8 encoding is a two-byte
sequence 0xD0 0xB0.  The codepoint is something you will find in a
wchar_t array, while the UTF-8 encoding is something you will find in
a multibyte string.

Now, the same letter ``small a'' can be encoded in several other ways:
for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
etc.  It should be obvious that, of all the encodings, only the
fixed-length ones can be used in a wchar_t array (because wchar_t
arrays are stateless, while multibyte encodings produce stateful
strings, where the beginning of each encoded character cannot be
decided without processing all the characters before it).  It should
also be obvious that using wchar_t for single-byte encodings is not
useful (you waste storage).  Thus, the only practical use of wchar_t
is for character sets that do not fit into a single byte, and for
those, all the encodings I know of are variable-length multibyte
encodings, which are not suitable for wchar_t, as mentioned above.

This is why I said that wchar_t is not used for an encoding (such as
ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
nowadays almost universally accepted that wchar_t is a Unicode
codepoint, the only difference between applications being whether only
the first 64K characters (the so-called BMP) are supported by 16-bit
wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:07         ` Daniel Jacobowitz
@ 2006-04-14 14:23           ` Eli Zaretskii
  2006-04-14 14:29             ` Daniel Jacobowitz
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 14:23 UTC (permalink / raw)
  To: Vladimir Prus, gdb

> Date: Fri, 14 Apr 2006 09:05:27 -0400
> From: Daniel Jacobowitz <drow@false.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, gdb@sources.redhat.com
> 
> Going away from GDB support for wide characters for a moment, and back to
> this; we have a "print N elements" notation; should we extend it to a
> "print all non-zero elements" notation?

How about "print elements until you find X", where X is any 8-bit
code, including zero?  That would useful in situations, I think.

We will probably need some user-settable limit for the max number of
elements, to avoid running amok in case there's no X.

> Alternatively, we could do it specially by recognizing wchar_t, but
> I think the general solution might be more useful.

I agree.

> Eli'd know better than me, but I think that expecting wchar_t to be
> Unicode is not reliable.

I think we cannot assume Unicode is the only character set, but we can
make Unicode the default and let the user say otherwise if not.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:23           ` Eli Zaretskii
@ 2006-04-14 14:29             ` Daniel Jacobowitz
  2006-04-14 14:53               ` Eli Zaretskii
  2006-04-14 17:55               ` Jim Blandy
  0 siblings, 2 replies; 53+ messages in thread
From: Daniel Jacobowitz @ 2006-04-14 14:29 UTC (permalink / raw)
  To: gdb

On Fri, Apr 14, 2006 at 05:08:17PM +0300, Eli Zaretskii wrote:
> > Date: Fri, 14 Apr 2006 09:05:27 -0400
> > From: Daniel Jacobowitz <drow@false.org>
> > Cc: Eli Zaretskii <eliz@gnu.org>, gdb@sources.redhat.com
> > 
> > Going away from GDB support for wide characters for a moment, and back to
> > this; we have a "print N elements" notation; should we extend it to a
> > "print all non-zero elements" notation?
> 
> How about "print elements until you find X", where X is any 8-bit
> code, including zero?  That would useful in situations, I think.

Well, I suppose.  But in the general case, there's always user-defined
functions, and hopefully better scripting languages in the future;
is this something that will be frequently useful direct from the
command line?

It'll involve another extension to the language expression parsers, you
see.  We ought to minimize such extensions; e.g. the set of operators
available is fairly limited.

I was thinking "print *ptr@@", by analogy to "print *ptr@5".  Or we
could use the existing @ N syntax.  Right now we issue errors for
anything less than one; so how about "print *ptr@0" for "print *ptr
until you encounter a zero"?

> We will probably need some user-settable limit for the max number of
> elements, to avoid running amok in case there's no X.

We can just use the "set print elements" limit for that.  Although,
it's always bugged me that we use the same setting for "number of
members of an array" and "number of characters in a string"; I usually
want only a few elements of an array, but much more of a string.  Maybe
someday we should separate them.

> I think we cannot assume Unicode is the only character set, but we can
> make Unicode the default and let the user say otherwise if not.

Seems reasonable to me.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 13:17           ` Daniel Jacobowitz
  2006-04-14 13:59             ` Robert Dewar
@ 2006-04-14 14:37             ` Eli Zaretskii
  1 sibling, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 14:37 UTC (permalink / raw)
  To: Vladimir Prus, gdb

> Date: Fri, 14 Apr 2006 09:07:29 -0400
> From: Daniel Jacobowitz <drow@false.org>
> Cc: Vladimir Prus <ghost@cs.msu.su>, gdb@sources.redhat.com
> 
> On Fri, Apr 14, 2006 at 03:55:49PM +0300, Eli Zaretskii wrote:
> > Anyway, UTF-16 is a variable-length encoding, so wchar_t is not it.
> 
> There's a rant about this in the glibc manual I was just reading...
> 
> In fact, on many platforms, wchar_t is only 16-bit.  How exactly you
> handle UTF-8 or UCS-4 input in this case, I don't really understand.

Robert answered to that, and I agree with his response.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:08       ` Paul Koning
@ 2006-04-14 14:47         ` Eli Zaretskii
  2006-04-14 15:00           ` Vladimir Prus
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 14:47 UTC (permalink / raw)
  To: Paul Koning; +Cc: ghost, gdb

> Date: Fri, 14 Apr 2006 09:43:01 -0400
> From: Paul Koning <pkoning@equallogic.com>
> Cc: ghost@cs.msu.su, gdb@sources.redhat.com
> 
> If you have 16 bit wide chars, it seems possible that those might
> contain UTF-16 encoding of full (beyond BMP) Unicode characters.

You could use wchar_t arrays for that, but then not every array
element will be a full character, and you will not be able to access
individual characters by their positional index.

In other words, in this case each element of the wchar_t array is no
longer a ``wide character'', but one of the few shorts that encode a
character.

If we want to support wchar_t arrays that store UTF-16, we will need
to add a feature to GDB to convert UTF-16 to the full UCS-4
codepoints, and output those.  Alternatively, the FE will have to
support display of UTF-16 encoded characters.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:16         ` Eli Zaretskii
@ 2006-04-14 14:50           ` Vladimir Prus
  2006-04-14 17:18             ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14 14:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On Friday 14 April 2006 17:59, Eli Zaretskii wrote:

> > In an original post, I've asked if gdb can print wchar_t just as a raw
> > sequence of values, like this:
> >
> >     0x56, 0x1456
>
> The answer is YES.  Use array notation, and add a feature to report
> the length of a wchar_t array.

Ok.

> Now, the same letter ``small a'' can be encoded in several other ways:
> for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
> 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
> etc.  It should be obvious that, of all the encodings, only the
> fixed-length ones can be used in a wchar_t array (because wchar_t
> arrays are stateless, 

I don't think this statement is backed up by anything.

> This is why I said that wchar_t is not used for an encoding (such as
> ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
> nowadays almost universally accepted that wchar_t is a Unicode
> codepoint, 

Again, can you provide any specific pointers to support that view?

> the only difference between applications being whether only 
> the first 64K characters (the so-called BMP) are supported by 16-bit
> wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.

I believe that on Windows:

- wchar_t is 16-bit
- wchar_t* values are supposed to be in UTF-16 encoding
(see    
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp

Do you disagree with any of the above statements? If not, then it directly 
follows that a given wchar_t is not a Unicode code point, but a code unit in 
specific representation (UTF-16), and a given code points takes either one or 
two code units, that is either one or two wchar_t. This is contrary to your 
statement that wchar_t is a single code point.

Anyway, this is quickly getting off-topic for gdb list, so maybe we should 
bring this somewhere else.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:29             ` Daniel Jacobowitz
@ 2006-04-14 14:53               ` Eli Zaretskii
  2006-04-14 17:10                 ` Daniel Jacobowitz
  2006-04-14 17:55               ` Jim Blandy
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 14:53 UTC (permalink / raw)
  To: gdb

> Date: Fri, 14 Apr 2006 10:16:40 -0400
> From: Daniel Jacobowitz <drow@false.org>
> 
> > How about "print elements until you find X", where X is any 8-bit
> > code, including zero?  That would useful in situations, I think.
> 
> Well, I suppose.  But in the general case, there's always user-defined
> functions, and hopefully better scripting languages in the future;
> is this something that will be frequently useful direct from the
> command line?
> 
> It'll involve another extension to the language expression parsers, you
> see.  We ought to minimize such extensions; e.g. the set of operators
> available is fairly limited.

No, that's not what I had in mind.  I thought about a command which
will set the value of the delimiter, with zero being the default.
Then just use the same syntax as what you had in mind for
zero-delimited arrays.

Does this make sense?

> I was thinking "print *ptr@@", by analogy to "print *ptr@5".

Looks good to me.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:47         ` Eli Zaretskii
@ 2006-04-14 15:00           ` Vladimir Prus
  2006-04-14 17:53             ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-14 15:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Koning, gdb

On Friday 14 April 2006 18:29, Eli Zaretskii wrote:
> > Date: Fri, 14 Apr 2006 09:43:01 -0400
> > From: Paul Koning <pkoning@equallogic.com>
> > Cc: ghost@cs.msu.su, gdb@sources.redhat.com
> >
> > If you have 16 bit wide chars, it seems possible that those might
> > contain UTF-16 encoding of full (beyond BMP) Unicode characters.
>
> You could use wchar_t arrays for that, but then not every array
> element will be a full character, and you will not be able to access
> individual characters by their positional index.

And what? Even if wchar_t is 32 bit then element at position 'i' can be 
combining character modifying another character, and be of little use itself.

> In other words, in this case each element of the wchar_t array is no
> longer a ``wide character'', but one of the few shorts that encode a
> character.
>
> If we want to support wchar_t arrays that store UTF-16, we will need
> to add a feature to GDB to convert UTF-16 to the full UCS-4
> codepoints, and output those.  

That's what I mentioned in a reply to Jim -- since the current string printing 
code operated "one wchar_t at a time", it's not suitable for outputing UTF-16 
encoded wchar_t values to the user.

> Alternatively, the FE will have to 
> support display of UTF-16 encoded characters.

Speaking about FE, handling UTF-16 is trivial, so printing just wchar_t values 
will be sufficient. Only if we want to properly show UTF-16 strings to a user 
of console gdb, some work may be necessary.

- Volodya


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:53               ` Eli Zaretskii
@ 2006-04-14 17:10                 ` Daniel Jacobowitz
  0 siblings, 0 replies; 53+ messages in thread
From: Daniel Jacobowitz @ 2006-04-14 17:10 UTC (permalink / raw)
  To: gdb

On Fri, Apr 14, 2006 at 05:47:06PM +0300, Eli Zaretskii wrote:
> No, that's not what I had in mind.  I thought about a command which
> will set the value of the delimiter, with zero being the default.
> Then just use the same syntax as what you had in mind for
> zero-delimited arrays.
> 
> Does this make sense?

It seems like something which would be more useful in the expression
than as a global state, but on the other hand, I already made the point
that this wouldn't be frequently used.  I wouldn't object to such a
variable (although I probably wouldn't implement it, either).

> > I was thinking "print *ptr@@", by analogy to "print *ptr@5".
> 
> Looks good to me.

I was going to suggest *ptr@0 again, but I've remembered that these
actually take expressions, not just integers.  So @@ sounds good
to me, unless anyone knows a language where we can get away with using
@ for artificial arrays, but can't steal @@ also.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:50           ` Vladimir Prus
@ 2006-04-14 17:18             ` Eli Zaretskii
  2006-04-14 18:03               ` Jim Blandy
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 17:18 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 18:37:25 +0400
> Cc: gdb@sources.redhat.com
> 
> > Now, the same letter ``small a'' can be encoded in several other ways:
> > for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
> > 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
> > etc.  It should be obvious that, of all the encodings, only the
> > fixed-length ones can be used in a wchar_t array (because wchar_t
> > arrays are stateless, 
> 
> I don't think this statement is backed up by anything.
> 
> > This is why I said that wchar_t is not used for an encoding (such as
> > ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
> > nowadays almost universally accepted that wchar_t is a Unicode
> > codepoint, 
> 
> Again, can you provide any specific pointers to support that view?

I think Robert and myself already explained that in later messages.
Feel free to ask specific questions if something is still unclear.

> I believe that on Windows:
> 
> - wchar_t is 16-bit
> - wchar_t* values are supposed to be in UTF-16 encoding
> (see    
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp
> 
> Do you disagree with any of the above statements?

wchar_t is just an integer type.  You can stuff _anything_ into an
integer array, but if you put UTF-16 there, each element is no longer
a character, it is one of a few 16-bit integers that encode a
character.  In other words, it's a variant of multibyte strings,
except that each element is 16-bit wide.

Now, I know that Windows holds 16-bit UTF-16 encodings in wchar_t
arrays, but that is not the L"foo" strings of wide characters.  In the
L"foo" notation, each of the 3 string characters _always_ occupies
exactly one wchar_t element, and L"foo"[1] is _always_ the second
character of the string.  This is not true for UTF-16, as I hope is
clear from this discussion.  In UTF-16, array[1] is the second 16-bit
value that encodes a character, and that character's encoding could
need more than 1 16-bit value.

> If not, then it directly 
> follows that a given wchar_t is not a Unicode code point, but a code unit in 
> specific representation (UTF-16), and a given code points takes either one or 
> two code units, that is either one or two wchar_t. This is contrary to your 
> statement that wchar_t is a single code point.

My statement was based on the assumption that you are coding for a
system where wchar_t is used for complete characters, not for UTF-16
strings.  Only in that case, you can talk about ``wide characters''
and about wchar_t being a character.  In UTF-16, an arbitrary element
of the array might not be a complete character.

> Anyway, this is quickly getting off-topic for gdb list, so maybe we should 
> bring this somewhere else.

It _is_ on topic, IMHO, as long as we discuss features to be added to
GDB.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 15:00           ` Vladimir Prus
@ 2006-04-14 17:53             ` Eli Zaretskii
  2006-04-17  7:05               ` Vladimir Prus
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 17:53 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: pkoning, gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 18:50:07 +0400
> Cc: Paul Koning <pkoning@equallogic.com>,  gdb@sources.redhat.com
> 
> > You could use wchar_t arrays for that, but then not every array
> > element will be a full character, and you will not be able to access
> > individual characters by their positional index.
> 
> And what? Even if wchar_t is 32 bit then element at position 'i' can be 
> combining character modifying another character, and be of little use itself.

You are introducing into the argument yet another face of a character:
how it is displayed.  It's true that some characters, when they are
adjacent to each other, are displayed in some special way (the ff
ligature is one simple example of that), but that is something for the
rendering engine to take care of, it has nothing to do with the
string's content.  As far as any software, except the rendering
engine, is concerned, the combining character is, in fact, part of the
string.  For example, if the user wants to search for such a
character, the program must find it.

So, for the purposes of processing the wchar_t strings, it is very
important to know whether they are fixed-size wide characters or
variable-size encoding.  If you just copy the string verbatim to and
fro, then it doesn't matter, but for anything more complex the
difference is very large.

> > If we want to support wchar_t arrays that store UTF-16, we will need
> > to add a feature to GDB to convert UTF-16 to the full UCS-4
> > codepoints, and output those.  
> 
> That's what I mentioned in a reply to Jim -- since the current string printing 
> code operated "one wchar_t at a time", it's not suitable for outputing UTF-16 
> encoded wchar_t values to the user.

I don't understand: if the wchar_t array holds a UTF-16 encoding, then
when you receive the entire string, you have a UTF-16 encoding of what
you want to display, and you yourself said that displaying a UTF-16
encoded string is easy for you.  So where is the problem? is that only
that you cannot know the length of the UTF-16 encoded string? or is
there something else missing?

> > Alternatively, the FE will have to 
> > support display of UTF-16 encoded characters.
> 
> Speaking about FE, handling UTF-16 is trivial

Maybe in your environment and windowing system, but not in all cases,
AFAIK.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 14:29             ` Daniel Jacobowitz
  2006-04-14 14:53               ` Eli Zaretskii
@ 2006-04-14 17:55               ` Jim Blandy
  2006-04-14 18:27                 ` Eli Zaretskii
  1 sibling, 1 reply; 53+ messages in thread
From: Jim Blandy @ 2006-04-14 17:55 UTC (permalink / raw)
  To: gdb

On 4/14/06, Daniel Jacobowitz <drow@false.org> wrote:
> I was thinking "print *ptr@@", by analogy to "print *ptr@5".  Or we
> could use the existing @ N syntax.  Right now we issue errors for
> anything less than one; so how about "print *ptr@0" for "print *ptr
> until you encounter a zero"?

I much prefer LVAL@@ to LVAL@0.

I don't think it's worth complicating the syntax for searching for a
zero terminator in order to allow one to search for an arbitrary
terminator.  I think that will require more typing in the much more
common case, and there are other ways to serve the need to search for
arbitrary terminators.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 17:18             ` Eli Zaretskii
@ 2006-04-14 18:03               ` Jim Blandy
  2006-04-14 19:16                 ` Eli Zaretskii
  2006-04-14 19:53                 ` Mark Kettenis
  0 siblings, 2 replies; 53+ messages in thread
From: Jim Blandy @ 2006-04-14 18:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Vladimir Prus, gdb

I think folks are seeing difficult problems where there aren't any. 
Even if the host character set (that is, the character set GDB is
using to communicate with its user, or in its MI communications) is
plain, old ASCII, GDB can, without any loss of information, convey the
contents of a wide string using an arbitrary target character set via
MI to a GUI, using code the GUI must already have.

Suppose we have a wide string where wchar_t values are Unicode code
points.  Suppose our host character set is plain ASCII.  Suppose the
user's program has a string containing the digits '123', followed by
some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
'xyz'.  When asked to print that string, GDB should print the
following twenty-one ASCII characters:

L"123\x0f04\x0fccxyz"

Since this is a valid way to write that string in a source program, a
user at the GDB command line should understand it.  Since consumers of
MI information must contain parsers for C values already, they can
reliably find the contents of the string.

Note that this gets a GUI the contents of the string in the *target*
character set.  The GUI itself should be responsible for converting
target characters to whatever character set it wants to use to present
data to its user.  Here, GDB's 'host' character set is just the
character set used to carry information from GDB to the GUI; it should
probably be set to ASCII, just to avoid needless variation.  But
either way, it's just acting as a medium for values in C source code
syntax, and has no bearing on either the character set the target
program is using, or the character set the GUI will use to present
data to its user.

Unicode technical report #17 lays out the terminology the Unicode
folks use for all this stuff, with good explanations:
http://www.unicode.org/reports/tr17/

According to the ISO C standard, the coding character set used by
wchar_t must be a superset of that used by char for members of the
basic character set.  See ISO/IEC 9899:1999 (E) section 7.17,
paragraph 2.  So I think it's sufficient for the user to specify the
coding character set used by wide characters; that fixes the ccs used
for char values.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 17:55               ` Jim Blandy
@ 2006-04-14 18:27                 ` Eli Zaretskii
  2006-04-14 18:30                   ` Jim Blandy
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 18:27 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

> Date: Fri, 14 Apr 2006 10:18:10 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> 
> I much prefer LVAL@@ to LVAL@0.

Agreed.

> I don't think it's worth complicating the syntax for searching for a
> zero terminator in order to allow one to search for an arbitrary
> terminator.

Then how will you find the zero terminator?  With wcslen?  That is
only good for wchar_t strings, not for arbitrary integer arrays.  And
I thought Daniel was suggesting something more general than just
wchar_t arrays.

> I think that will require more typing in the much more common case

??? What typing?  I suggested an additional command that will set the
terminator; after that, it's the same typing as with zero.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 18:27                 ` Eli Zaretskii
@ 2006-04-14 18:30                   ` Jim Blandy
  2006-04-14 19:19                     ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Jim Blandy @ 2006-04-14 18:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On 4/14/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Fri, 14 Apr 2006 10:18:10 -0700
> > From: "Jim Blandy" <jimb@red-bean.com>
> >
> > I much prefer LVAL@@ to LVAL@0.
>
> Agreed.
>
> > I don't think it's worth complicating the syntax for searching for a
> > zero terminator in order to allow one to search for an arbitrary
> > terminator.
>
> Then how will you find the zero terminator?  With wcslen?  That is
> only good for wchar_t strings, not for arbitrary integer arrays.  And
> I thought Daniel was suggesting something more general than just
> wchar_t arrays.

He is.  I am, too.  Just search for elements equal to zero.  If LVAL's
type can't be compared with zero, then you can't use @@ on it.

> > I think that will require more typing in the much more common case
>
> ??? What typing?  I suggested an additional command that will set the
> terminator; after that, it's the same typing as with zero.

Yes.  I said, "I don't think it's worth complicating the syntax for
searching for a zero terminator...".  Providing an additional command
to set the terminator doesn't complicate the syntax.  You're assuming
I was speaking directly to your suggestion, when I was instead simply
stating the requirements I think we should meet.

That said, I don't even think we should have a separate command for
setting the terminating value for @@.  I think we should wait until
someone has a need for it arising out of a real-life use case, not a
design conversation.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 18:03               ` Jim Blandy
@ 2006-04-14 19:16                 ` Eli Zaretskii
  2006-04-14 19:22                   ` Jim Blandy
  2006-04-14 19:53                 ` Mark Kettenis
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 19:16 UTC (permalink / raw)
  To: Jim Blandy; +Cc: ghost, gdb

> Date: Fri, 14 Apr 2006 10:53:44 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: "Vladimir Prus" <ghost@cs.msu.su>, gdb@sources.redhat.com
> 
> I think folks are seeing difficult problems where there aren't any. 

What difficulties? there _are_ no difficulties ;-)

> Suppose we have a wide string where wchar_t values are Unicode code
> points.  Suppose our host character set is plain ASCII.  Suppose the
> user's program has a string containing the digits '123', followed by
> some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> 'xyz'.  When asked to print that string, GDB should print the
> following twenty-one ASCII characters:
> 
> L"123\x0f04\x0fccxyz"

This will work, if we accept your assumptions (which are by no means
universally correct, e.g. parts of our discussion were around whether
the string contains U+XXXX Unicode codepoints or their UTF-16
encodings).  But all you did is invent an encoding (and a
variable-size encoding at that).  Something in the GUI FE still has to
interpret that encoding, i.e. convert it back to binary representation
of the characters, because your encoding cannot be displayed by any
known GUI API.

Compare this with the facility that we already have today:

 (gdb) print *warray@8
  {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}

Except for using up 60-odd characters where you used 21, this is IMHO
better, since it doesn't require any code on the FE side: just convert
the strings to integers, and you've got Unicode, ready to be used for
whatever purposes.

> Since this is a valid way to write that string in a source program, a
> user at the GDB command line should understand it.  Since consumers of
> MI information must contain parsers for C values already, they can
> reliably find the contents of the string.

I only partly agree with the first sentence, and not at all with the
second.

For the interactive user, understanding non-ASCII strings in the
suggested ASCII encoding might not be easy at all.  For example, for
all my knowledge of Hebrew, if someone shows me \x05D2, I will have
hard time recognizing the letter Gimel.

As for the second sentence, ``reliably find the contents of the
string'' there obviously doesn't consider the complexities of handling
wide characters.  In my experience, for any non-trivial string
processing, working with variable-size encoding is much harder than
with fixed-size wchar_t arrays, because you need to interpret the
bytes as you go, even if all you need is to find the n-th character.
Even the simple task of computing the number of characters in the
string becomes complicated.

> Note that this gets a GUI the contents of the string in the *target*
> character set.  The GUI itself should be responsible for converting
> target characters to whatever character set it wants to use to present
> data to its user.  Here, GDB's 'host' character set is just the
> character set used to carry information from GDB to the GUI; it should
> probably be set to ASCII, just to avoid needless variation.  But
> either way, it's just acting as a medium for values in C source code
> syntax, and has no bearing on either the character set the target
> program is using, or the character set the GUI will use to present
> data to its user.

What you are suggesting is simple for GDB, but IMHo leaves too much
complexity to the FE.  I think GDB could do better.  In particular, if
I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
show me Unicode characters in their normal glyphs, which would require
GDB to output the characters in their UTF-8 encoding (which the
terminal will then display in human-readable form).  Your suggestion
doesn't allow such a feature, AFAICS, at least not for CLI users.

That said, if someone volunteers to do the job of adding your
suggestions to GDB, I won't object to accepting the patches, because
whoever does the job gets to choose the tools.

> Unicode technical report #17 lays out the terminology the Unicode
> folks use for all this stuff, with good explanations:
> http://www.unicode.org/reports/tr17/

Yes, that's a good background reading for related stuff.

> According to the ISO C standard, the coding character set used by
> wchar_t must be a superset of that used by char for members of the
> basic character set.  See ISO/IEC 9899:1999 (E) section 7.17,
> paragraph 2.  So I think it's sufficient for the user to specify the
> coding character set used by wide characters; that fixes the ccs used
> for char values.

If wchar_t uses fixed-size characters, not their variable-size
encodings, then specifying the CCS will do.  Encodings are another
matter; as I wrote earlier, there could be many different encodings of
the same CCS, and I suppose some weirdo software somewhere could stuff
such encoding into a wchar_t.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 18:30                   ` Jim Blandy
@ 2006-04-14 19:19                     ` Eli Zaretskii
  2006-04-14 19:48                       ` Jim Blandy
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-14 19:19 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

> Date: Fri, 14 Apr 2006 11:03:38 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: gdb@sourceware.org
> 
> > > I don't think it's worth complicating the syntax for searching for a
> > > zero terminator in order to allow one to search for an arbitrary
> > > terminator.
> >
> > Then how will you find the zero terminator?  With wcslen?  That is
> > only good for wchar_t strings, not for arbitrary integer arrays.  And
> > I thought Daniel was suggesting something more general than just
> > wchar_t arrays.
> 
> He is.  I am, too.  Just search for elements equal to zero.

How is this different or more complex than searching for elements that
are equal to some other constant value?

> That said, I don't even think we should have a separate command for
> setting the terminating value for @@.  I think we should wait until
> someone has a need for it arising out of a real-life use case, not a
> design conversation.

What Daniel suggested didn't come from a clear-cut real-life use-case,
either.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 19:16                 ` Eli Zaretskii
@ 2006-04-14 19:22                   ` Jim Blandy
  2006-04-14 22:18                     ` Daniel Jacobowitz
  2006-04-15  7:14                     ` Eli Zaretskii
  0 siblings, 2 replies; 53+ messages in thread
From: Jim Blandy @ 2006-04-14 19:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: ghost, gdb

On 4/14/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > Suppose we have a wide string where wchar_t values are Unicode code
> > points.  Suppose our host character set is plain ASCII.  Suppose the
> > user's program has a string containing the digits '123', followed by
> > some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> > 'xyz'.  When asked to print that string, GDB should print the
> > following twenty-one ASCII characters:
> >
> > L"123\x0f04\x0fccxyz"
>
> This will work, if we accept your assumptions (which are by no means
> universally correct, e.g. parts of our discussion were around whether
> the string contains U+XXXX Unicode codepoints or their UTF-16
> encodings).  But all you did is invent an encoding (and a
> variable-size encoding at that).  Something in the GUI FE still has to
> interpret that encoding, i.e. convert it back to binary representation
> of the characters, because your encoding cannot be displayed by any
> known GUI API.

The command line and MI already use the ISO C syntax for conveying
values to the user/consumer.  I'm just saying we should expand our use
of the syntax we already use.

I posited that the target character set was Unicode, but the same
mechanism will work no matter what character set and encoding the
target uses.  No matter what string appears on the target, there is
always a source-language representation for that target.  According to
ISO C, the \x escapes specify char or wchar_t values in the target
character set.  So you can always write whatever you've got.

> Compare this with the facility that we already have today:
>
>  (gdb) print *warray@8
>   {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}
>
> Except for using up 60-odd characters where you used 21, this is IMHO
> better, since it doesn't require any code on the FE side: just convert
> the strings to integers, and you've got Unicode, ready to be used for
> whatever purposes.

If you're printing an expression that evaluates to a string, sure. 
But what if you're printing a value of type struct { wchar *key;
wchar_t *value }?  What if you're using -stack-list-arguments to show
values in a stack frame?

My point is, MI consumers are already parsing ISO C strings.  They
just need to parse more of them.

> > Since this is a valid way to write that string in a source program, a
> > user at the GDB command line should understand it.  Since consumers of
> > MI information must contain parsers for C values already, they can
> > reliably find the contents of the string.
>
> I only partly agree with the first sentence, and not at all with the
> second.
>
> For the interactive user, understanding non-ASCII strings in the
> suggested ASCII encoding might not be easy at all.  For example, for
> all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> hard time recognizing the letter Gimel.

If the host character set includes Gimel, then GDB won't print it with
a hex escape.

> As for the second sentence, ``reliably find the contents of the
> string'' there obviously doesn't consider the complexities of handling
> wide characters.  In my experience, for any non-trivial string
> processing, working with variable-size encoding is much harder than
> with fixed-size wchar_t arrays, because you need to interpret the
> bytes as you go, even if all you need is to find the n-th character.
> Even the simple task of computing the number of characters in the
> string becomes complicated.

I don't understand what you mean.  The rules for parsing ISO C string
literals into arrays of chars and wide string literals into arrays of
wide characters are straightforward.

> What you are suggesting is simple for GDB, but IMHo leaves too much
> complexity to the FE.  I think GDB could do better.  In particular, if
> I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> show me Unicode characters in their normal glyphs, which would require
> GDB to output the characters in their UTF-8 encoding (which the
> terminal will then display in human-readable form).  Your suggestion
> doesn't allow such a feature, AFAICS, at least not for CLI users.

When the host character set contains a character, there's no need for
GDB to use an escape to show it.

> If wchar_t uses fixed-size characters, not their variable-size
> encodings, then specifying the CCS will do.

There is no provision in ISO C for variable-size wchar_t encodings. 
The portion of the standard I referred to says that wchar_t "...is an
integer type whose range of values can represent distinct codes for
all members of the largest extended character set speciﬁed among the
supported locales".

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 19:19                     ` Eli Zaretskii
@ 2006-04-14 19:48                       ` Jim Blandy
  0 siblings, 0 replies; 53+ messages in thread
From: Jim Blandy @ 2006-04-14 19:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: gdb

On 4/14/06, Eli Zaretskii <eliz@gnu.org> wrote:
> > Date: Fri, 14 Apr 2006 11:03:38 -0700
> > From: "Jim Blandy" <jimb@red-bean.com>
> > Cc: gdb@sourceware.org
> >
> > > > I don't think it's worth complicating the syntax for searching for a
> > > > zero terminator in order to allow one to search for an arbitrary
> > > > terminator.
> > >
> > > Then how will you find the zero terminator?  With wcslen?  That is
> > > only good for wchar_t strings, not for arbitrary integer arrays.  And
> > > I thought Daniel was suggesting something more general than just
> > > wchar_t arrays.
> >
> > He is.  I am, too.  Just search for elements equal to zero.
>
> How is this different or more complex than searching for elements that
> are equal to some other constant value?

It's not hard; it's trivial.  I just think we shouldn't add the option
until there's a real-life use case showing someone who wants it.

> > That said, I don't even think we should have a separate command for
> > setting the terminating value for @@.  I think we should wait until
> > someone has a need for it arising out of a real-life use case, not a
> > design conversation.
>
> What Daniel suggested didn't come from a clear-cut real-life use-case,
> either.

It came from Volodya's original request.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 18:03               ` Jim Blandy
  2006-04-14 19:16                 ` Eli Zaretskii
@ 2006-04-14 19:53                 ` Mark Kettenis
  1 sibling, 0 replies; 53+ messages in thread
From: Mark Kettenis @ 2006-04-14 19:53 UTC (permalink / raw)
  To: jimb; +Cc: eliz, ghost, gdb

> Date: Fri, 14 Apr 2006 10:53:44 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> 
> I think folks are seeing difficult problems where there aren't any. 
> Even if the host character set (that is, the character set GDB is
> using to communicate with its user, or in its MI communications) is
> plain, old ASCII, GDB can, without any loss of information, convey the
> contents of a wide string using an arbitrary target character set via
> MI to a GUI, using code the GUI must already have.
> 
> Suppose we have a wide string where wchar_t values are Unicode code
> points.  Suppose our host character set is plain ASCII.  Suppose the
> user's program has a string containing the digits '123', followed by
> some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> 'xyz'.  When asked to print that string, GDB should print the
> following twenty-one ASCII characters:
> 
> L"123\x0f04\x0fccxyz"
> 
> Since this is a valid way to write that string in a source program, a
> user at the GDB command line should understand it.  Since consumers of
> MI information must contain parsers for C values already, they can
> reliably find the contents of the string.

I think this makes an awful lot of sense.

Mark

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 19:22                   ` Jim Blandy
@ 2006-04-14 22:18                     ` Daniel Jacobowitz
  2006-04-16 11:39                       ` Jim Blandy
  2006-04-15  7:14                     ` Eli Zaretskii
  1 sibling, 1 reply; 53+ messages in thread
From: Daniel Jacobowitz @ 2006-04-14 22:18 UTC (permalink / raw)
  To: Jim Blandy; +Cc: Eli Zaretskii, ghost, gdb

On Fri, Apr 14, 2006 at 12:16:36PM -0700, Jim Blandy wrote:
> The command line and MI already use the ISO C syntax for conveying
> values to the user/consumer.  I'm just saying we should expand our use
> of the syntax we already use.

I don't agree.

Saying "we use ISO C syntax for conveying data" is fairly inaccurate. 
We are inconsistent.  Some things are escaped in a C-like fashion. 
Other things are escaped in other fashions, with their own quoting
rules.  This is true in both directions, for user input and for output.

Let's consider strings in particular.  Strings are printed using
LA_PRINT_STRING.  As the name implies, the quoting done is adjusted
to match the source language convention.  Asking an FE to grok that
is just impractical.  In data intended for CLI users, we can
prettyprint things any way we want; in data intended for anything
more machinelike, I recommend we define a syntax and stick with it.

Personally, I'd just use UTF-8.  If you want GDB's output, expect it to
be UTF-8.  The MI layer is a "transport", and can add its own necessary
escaping (of quote marks, mostly).  Alternatively, make GDB output in
the current locale's character set.

So, if we print a wchar_t string as a string, and the user has conveyed
to us that their wchar_t strings are Unicode code points, then we can
convert that to the appropriate multibyte string on output using the
host character set.

Picked a host character set that can't represent some target characters?
The CLI should fall back to pretty escape sequences, I don't know what
the MI should do, but probably the answer is unimportant.

> My point is, MI consumers are already parsing ISO C strings.  They
> just need to parse more of them.

IMO, we need to make them parse less of them.

Everywhere the MI consumer needs to parse something which originated
as GDB CLI output, things go bad.  For instance, MI consumers may get
confused by the automatic limits on "set print elements", which
truncates strings.

After "set print elements 2":

(gdb) interpreter-exec mi "-var-create - * \"(char *)&__libc_version\""
^done,name="var1",numchild="1",type="char *"
(gdb) 
(gdb) interpreter-exec mi "-var-evaluate-expression var1"
^done,value="0x102a80 \"2.\"..."
(gdb) 

Not very nice of us, was that?

> There is no provision in ISO C for variable-size wchar_t encodings. 
> The portion of the standard I referred to says that wchar_t "...is an
> integer type whose range of values can represent distinct codes for
> all members of the largest extended character set speci???ed among the
> supported locales".

(A) GDB supports languages other than C.

(B) While I am inclined to agree with you about the language of ISO C,
we don't get to ignore the reality of platforms with a 16-bit wchar_t
which store UTF-16 in it.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 19:22                   ` Jim Blandy
  2006-04-14 22:18                     ` Daniel Jacobowitz
@ 2006-04-15  7:14                     ` Eli Zaretskii
  2006-04-17  7:16                       ` Vladimir Prus
  1 sibling, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-15  7:14 UTC (permalink / raw)
  To: Jim Blandy; +Cc: ghost, gdb

> Date: Fri, 14 Apr 2006 12:16:36 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: ghost@cs.msu.su, gdb@sources.redhat.com
> 
> >  (gdb) print *warray@8
> >   {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}
> >
> > Except for using up 60-odd characters where you used 21, this is IMHO
> > better, since it doesn't require any code on the FE side: just convert
> > the strings to integers, and you've got Unicode, ready to be used for
> > whatever purposes.
> 
> If you're printing an expression that evaluates to a string, sure. 
> But what if you're printing a value of type struct { wchar *key;
> wchar_t *value }?  What if you're using -stack-list-arguments to show
> values in a stack frame?

Sorry, I don't see the difference.  Perhaps I'm too dense.  Are you
talking about the amount of ASCII characters, or something else?

> My point is, MI consumers are already parsing ISO C strings.  They
> just need to parse more of them.

This ``more parsing'' is not magic.  It's a lot of work, in general.

> > For the interactive user, understanding non-ASCII strings in the
> > suggested ASCII encoding might not be easy at all.  For example, for
> > all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> > hard time recognizing the letter Gimel.
> 
> If the host character set includes Gimel, then GDB won't print it with
> a hex escape.

The host character set has nothing to do, in general, with what
characters can be displayed.  The same host character set can be
displayed on an appropriately localized xterm, but not on a bare-bones
character terminal.  Not every system that runs in the Hebrew locale
has Hebrew-enabled xterm.  Some characters may be missing from a
particular font, especially a Unicode-based font (because there so
many Unicode characters).  Etc., etc.

Even if I do have a Hebrew enabled xterm, chances are that it cannot
display characters sent in 16-bit Unicode codepoints, it will want
some single-byte encoding, like UTF-8 or maybe ISO 8859-8.

GDB will generally know nothing about these complications, unless we
teach it.  For example, to display Hebrew letters on a UTF-8 enabled
xterm, we (i.e. the user, through appropriate GDB commands) will have
to tell GDB that wchar_t strings should be encoded in UTF-8 by the CLI
output routines.  Sometimes these settings can be gleaned from the
environment variables, but Emacs's experience shows how very
unreliable and error-prone this is.

> > As for the second sentence, ``reliably find the contents of the
> > string'' there obviously doesn't consider the complexities of handling
> > wide characters.  In my experience, for any non-trivial string
> > processing, working with variable-size encoding is much harder than
> > with fixed-size wchar_t arrays, because you need to interpret the
> > bytes as you go, even if all you need is to find the n-th character.
> > Even the simple task of computing the number of characters in the
> > string becomes complicated.
> 
> I don't understand what you mean.  The rules for parsing ISO C string
> literals into arrays of chars and wide string literals into arrays of
> wide characters are straightforward.

You seem to assume here that the target and the front-end's character
sets and their notion of wchar_t are identical.  Otherwise, what was a
valid array of wide characters on the target side will be gibberish on
the host side, and will certainly not display as anything legible.
Unlike GDB core, which just wants to pass the bytes from here to
there, the UI needs to be able to display the string, and for that it
needs to understand how it is encoded, how many glyphs will it produce
on the screen, where it can be broken into several lines if it is too
long, etc.  This is all trivial with 7-bit ASCII (every byte produces
a single glyph, except a few non-printables, whitespace characters
signal possible locations to break the line, etc.), but can get very
complex with other character sets.

GDB cannot be asked to know about all of those complications, but I
think it should at least provide a few simple translation services so
that a front end will not have to work too hard to handle and display
strings as mostly readable text.  Passing the characters as fixed-size
codepoints expressed as ASCII hex strings leaves the front-end with
only very simple job.  What's more, it uses an existing feature: array
printing.

> > What you are suggesting is simple for GDB, but IMHo leaves too much
> > complexity to the FE.  I think GDB could do better.  In particular, if
> > I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> > show me Unicode characters in their normal glyphs, which would require
> > GDB to output the characters in their UTF-8 encoding (which the
> > terminal will then display in human-readable form).  Your suggestion
> > doesn't allow such a feature, AFAICS, at least not for CLI users.
> 
> When the host character set contains a character, there's no need for
> GDB to use an escape to show it.

Whose host character set? GDB's?  But GDB is not displaying the
strings, the front end is.  And as I wrote above, there's no
guarantees that the host character set can be transparently displayed
on the screen.  This only works for ASCII and some simple single-byte
encodings, mostly Latin ones.  But it doesn't work in general.

And why are you talking about host character set?  The
L"123\x0f04\x0fccxyz" string came from the target, GDB simply
converted it to 7-bit ASCII.  These are characters from the target
character set.  And the target doesn't necessarily talk in the host
locale's character set and language, you could be debugging a program
which talks Farsi with GDB that runs in a German locale.

> > If wchar_t uses fixed-size characters, not their variable-size
> > encodings, then specifying the CCS will do.
> 
> There is no provision in ISO C for variable-size wchar_t encodings. 
> The portion of the standard I referred to says that wchar_t "...is an
> integer type whose range of values can represent distinct codes for
> all members of the largest extended character set specified among the
> supported locales".

I agree, but Windows and who knows what else violates that.  Of
course, for the BMP, UTF-16 is indistinguishable from Unicode
codepoints, so in practice this might not matter too much.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 22:18                     ` Daniel Jacobowitz
@ 2006-04-16 11:39                       ` Jim Blandy
  2006-04-16 15:07                         ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Jim Blandy @ 2006-04-16 11:39 UTC (permalink / raw)
  To: Jim Blandy, Eli Zaretskii, ghost, gdb

As far as conveying strings accurately to GUI's via MI is concerned:

It's fine to improve the way MI conveys data to the front end.  It
seems to me we still need to do things like repetition elimination and
length limiting, but that syntax should certainly be designed to make
the front ends' life easier.

I'm not so sure about GDB doing character set conversion.  I think I'd
rather see GDB concentrate on accurately and safely conveying target
code points to the front end, and make the front end responsible for
displaying it.  If the front end hasn't asked GDB to "print" the value
in GDB's own way, then the front end has accepted responsibility for
presentation, it seems to me.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-16 11:39                       ` Jim Blandy
@ 2006-04-16 15:07                         ` Eli Zaretskii
  0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-16 15:07 UTC (permalink / raw)
  To: Jim Blandy; +Cc: ghost, gdb

> Date: Fri, 14 Apr 2006 15:18:50 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> 
> As far as conveying strings accurately to GUI's via MI is concerned:
> 
> It's fine to improve the way MI conveys data to the front end.  It
> seems to me we still need to do things like repetition elimination and
> length limiting, but that syntax should certainly be designed to make
> the front ends' life easier.

Do you agree that the array@@ feature suggested by Daniel is a step in
the right direction?

> I'm not so sure about GDB doing character set conversion.  I think I'd
> rather see GDB concentrate on accurately and safely conveying target
> code points to the front end, and make the front end responsible for
> displaying it.

I'd rather see GDB offering something in this area as well, but until
we have a volunteer for this job, this disagreement is academic.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-14 17:53             ` Eli Zaretskii
@ 2006-04-17  7:05               ` Vladimir Prus
  2006-04-17  8:35                 ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-17  7:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pkoning, gdb

On Friday 14 April 2006 21:10, Eli Zaretskii wrote:

> > > If we want to support wchar_t arrays that store UTF-16, we will need
> > > to add a feature to GDB to convert UTF-16 to the full UCS-4
> > > codepoints, and output those.
> >
> > That's what I mentioned in a reply to Jim -- since the current string
> > printing code operated "one wchar_t at a time", it's not suitable for
> > outputing UTF-16 encoded wchar_t values to the user.
>
> I don't understand: if the wchar_t array holds a UTF-16 encoding, then
> when you receive the entire string, you have a UTF-16 encoding of what
> you want to display, and you yourself said that displaying a UTF-16
> encoded string is easy for you.  So where is the problem? is that only
> that you cannot know the length of the UTF-16 encoded string? or is
> there something else missing?

For my frontend -- there's no problem, I can handle UTF-16 myself. However, if
gdb is to ever produce output in UTF-8, that should be readable by the 
console, then it should handle surrogate pairs itself. Taking first and 
second element of surrogate pair and converting both to UTF-8, individually, 
won't work, for obvious reasons.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-15  7:14                     ` Eli Zaretskii
@ 2006-04-17  7:16                       ` Vladimir Prus
  2006-04-17  8:58                         ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-17  7:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Jim Blandy, gdb

On Saturday 15 April 2006 01:37, Eli Zaretskii wrote:

> > My point is, MI consumers are already parsing ISO C strings.  They
> > just need to parse more of them.
>
> This ``more parsing'' is not magic.  It's a lot of work, in general.

I don't quite get it. Say that frontend and gdb somehow agree on the 8-bit 
encoding using by gdb to print the strings. Then frontend can look at the 
string and:

  - If it sees \x, look at the following hex digits and convert it to either
    code point or code unit
  - If it sees anything else, convert it from local 8-bit to Unicode

The only question here is whether \x encodes a code unit or code point. If it 
encodes a code unit, frontend needs extra processing (for me, that's easy). 
If it encodes code point, then further changes in gdb are needed.

Note that due to charset function interface using 'int', you can't use UTF-8 
for encoding passed to frontend, but using ASCII + \x is still feasible.

There's one nice thing about this approach. If there's new 'print array until 
XX" syntax, I indeed need to special-case processing of values in several 
contexts -- most notably arguments in stack trace. With "\x" escapes I'd need 
to write a code to handle them once. In fact, I can add this code right to MI 
parser (which operates using Unicode-enabled QString class already). That 
will be more convenient than invoking 'print array' for any wchar_t* I ever 
see.

> > > For the interactive user, understanding non-ASCII strings in the
> > > suggested ASCII encoding might not be easy at all.  For example, for
> > > all my knowledge of Hebrew, if someone shows me \x05D2, I will have
> > > hard time recognizing the letter Gimel.
> >
> > If the host character set includes Gimel, then GDB won't print it with
> > a hex escape.
>
> The host character set has nothing to do, in general, with what
> characters can be displayed.  The same host character set can be
> displayed on an appropriately localized xterm, but not on a bare-bones
> character terminal.  Not every system that runs in the Hebrew locale
> has Hebrew-enabled xterm.  Some characters may be missing from a
> particular font, especially a Unicode-based font (because there so
> many Unicode characters).  Etc., etc.
>
> Even if I do have a Hebrew enabled xterm, chances are that it cannot
> display characters sent in 16-bit Unicode codepoints, it will want
> some single-byte encoding, like UTF-8 or maybe ISO 8859-8.
>
> GDB will generally know nothing about these complications, unless we
> teach it.  For example, to display Hebrew letters on a UTF-8 enabled
> xterm, we (i.e. the user, through appropriate GDB commands) will have
> to tell GDB that wchar_t strings should be encoded in UTF-8 by the CLI
> output routines.  Sometimes these settings can be gleaned from the
> environment variables, but Emacs's experience shows how very
> unreliable and error-prone this is.

I don't quite get. First you say you want \x05D2 to display using Unicode font 
on console, now you say it's very hard. Now, if you want Unicode display for 
\x05D2, there should be some method to tell gdb that your console can display 
Unicode, and if user told that Unicode is supported, what are the problems?

> how many glyphs will it produce 
> on the screen, where it can be broken into several lines if it is too
> long, etc.  This is all trivial with 7-bit ASCII (every byte produces
> a single glyph, except a few non-printables, whitespace characters
> signal possible locations to break the line, etc.), but can get very
> complex with other character sets.

Isn't this completely outside of GDB? In fact, this is also outside of 
frontend -- GUI toolkit will handle this transparently (and if it won't, it's 
broken).

> GDB cannot be asked to know about all of those complications, but I
> think it should at least provide a few simple translation services so
> that a front end will not have to work too hard to handle and display
> strings as mostly readable text.  Passing the characters as fixed-size
> codepoints expressed as ASCII hex strings leaves the front-end with
> only very simple job.  What's more, it uses an existing feature: array
> printing.

Using \x escapes, provided they encode *code units*, leaves frontend with the 
same simple job. Really, using strings with \x escapes differs from array 
printing in just one point: some characters are printed not as hex values, 
but as characters in local 8-bit encoding. Why do you think this is a 
problem? I can't see what's wrong with that.

> > > What you are suggesting is simple for GDB, but IMHo leaves too much
> > > complexity to the FE.  I think GDB could do better.  In particular, if
> > > I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
> > > show me Unicode characters in their normal glyphs, which would require
> > > GDB to output the characters in their UTF-8 encoding (which the
> > > terminal will then display in human-readable form).  Your suggestion
> > > doesn't allow such a feature, AFAICS, at least not for CLI users.
> >
> > When the host character set contains a character, there's no need for
> > GDB to use an escape to show it.
>
> Whose host character set? GDB's?  But GDB is not displaying the
> strings, the front end is.  And as I wrote above, there's no
> guarantees that the host character set can be transparently displayed
> on the screen.  This only works for ASCII and some simple single-byte
> encodings, mostly Latin ones.  But it doesn't work in general.
>
> And why are you talking about host character set?  The
> L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> converted it to 7-bit ASCII.  These are characters from the target
> character set.  And the target doesn't necessarily talk in the host
> locale's character set and language, you could be debugging a program
> which talks Farsi with GDB that runs in a German locale.

So, characters that happen to exist in German locale are printed as literal 
chars. Other characters are printed using \x. FE reads the string, and when 
it sees literal char, it converts it from German locale to Unicode used 
internally. Where's the problem?

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17  7:05               ` Vladimir Prus
@ 2006-04-17  8:35                 ` Eli Zaretskii
  0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-17  8:35 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: pkoning, gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 10:17:40 +0400
> Cc: pkoning@equallogic.com,
>  gdb@sources.redhat.com
> 
> On Friday 14 April 2006 21:10, Eli Zaretskii wrote:
> 
> > > > If we want to support wchar_t arrays that store UTF-16, we will need
> > > > to add a feature to GDB to convert UTF-16 to the full UCS-4
> > > > codepoints, and output those.
> > >
> > > That's what I mentioned in a reply to Jim -- since the current string
> > > printing code operated "one wchar_t at a time", it's not suitable for
> > > outputing UTF-16 encoded wchar_t values to the user.
> >
> > I don't understand: if the wchar_t array holds a UTF-16 encoding, then
> > when you receive the entire string, you have a UTF-16 encoding of what
> > you want to display, and you yourself said that displaying a UTF-16
> > encoded string is easy for you.  So where is the problem? is that only
> > that you cannot know the length of the UTF-16 encoded string? or is
> > there something else missing?
> 
> For my frontend -- there's no problem, I can handle UTF-16 myself. However, if
> gdb is to ever produce output in UTF-8

We were talking about wchar_t and wide character strings, which UTF-8
isn't.  Let's not confuse ourselves more than we already did.  Adding
to GDB support for converting arbitrary encoded text into UTF-8 would
be a giant job.

> then it should handle surrogate pairs itself. Taking first and 
> second element of surrogate pair and converting both to UTF-8, individually, 
> won't work, for obvious reasons.

I don't think it's quite as ``obvious'' as you imply.  Handling
surrogates is generally a job for a display engine, so a UTF-8 enabled
terminal could very well do it itself.  I don't know if they actually
do that, though.  But anyway, this is a different issue.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17  7:16                       ` Vladimir Prus
@ 2006-04-17  8:58                         ` Eli Zaretskii
  2006-04-17 10:35                           ` Vladimir Prus
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-17  8:58 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: jimb, gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 10:36:47 +0400
> Cc: "Jim Blandy" <jimb@red-bean.com>,
>  gdb@sources.redhat.com
> 
> On Saturday 15 April 2006 01:37, Eli Zaretskii wrote:
> 
> > > My point is, MI consumers are already parsing ISO C strings.  They
> > > just need to parse more of them.
> >
> > This ``more parsing'' is not magic.  It's a lot of work, in general.
> 
> I don't quite get it. Say that frontend and gdb somehow agree on the 8-bit 
> encoding using by gdb to print the strings. Then frontend can look at the 
> string and:
>   
>   - If it sees \x, look at the following hex digits and convert it to either
>     code point or code unit
>   - If it sees anything else, convert it from local 8-bit to Unicode

That's what Jim was saying.  He thought (or so it seemed to me) that,
once the ASCII-encoded string was read by the front end and converted
back to the integer values, the job is done.  That is, in Jim's
example with L"123\x0f04\x0fccxyz", the character `1' is converted to
its code 49 decimal, \x0f04 is converted to the 16-bit code 3844
decimal, `x' is converted to 120 decimal, etc.

What I was saying that indeed this conversion is easy, but it's not
even close to doing what the front end generally would like to do with
the string.  You want to _process_ the string, which means you want to
know its length in characters (not bytes), you want to know what
character set they encode, you want to be able to find the n-th
character in the string, etc.  The encoding suggested by Jim makes
these tasks very hard, much harder than if we send the string as an
array of fixed-length wide characters.

> Note that due to charset function interface using 'int', you can't use UTF-8 
> for encoding passed to frontend, but using ASCII + \x is still feasible.

I don't understand why UTF-8 cannot be used (an int can hold an 8-bit
byte just fine), nor can I see why this is an issue.  We are not
discussing addition of UTF-8 encoding to GDB, we are discussing how to
pass to a front end wide-character strings held within the debuggee.
Or at least that's what I thought you were trying to solve.

> There's one nice thing about this approach. If there's new 'print array until 
> XX" syntax, I indeed need to special-case processing of values in several 
> contexts -- most notably arguments in stack trace. With "\x" escapes I'd need 
> to write a code to handle them once. In fact, I can add this code right to MI 
> parser (which operates using Unicode-enabled QString class already). That 
> will be more convenient than invoking 'print array' for any wchar_t* I ever 
> see.

I don't think we should optimize GDB for one specific toolkit, even if
that toolkit is Qt.

> I don't quite get. First you say you want \x05D2 to display using Unicode font 
> on console, now you say it's very hard.

No, I said that a GUI front end will be able to display the _binary_
_code_ 0x05D2 with a suitable Unicode font.  Jim suggested that seeing
the _string_ "\x05D2" in GDB's output will allow me to read the text,
to which I replied that it will not be easy at all, since humans
generally don't remember Unicode codepoints by heart, even for their
native languages.

> Now, if you want Unicode display for 
> \x05D2, there should be some method to tell gdb that your console can display 
> Unicode, and if user told that Unicode is supported, what are the problems?

Please read my other messages: the program being debugged might talk
Hebrew in Unicode codepoints, but the locale where we are running GDB
might not support Hebrew on the console.  So, as long as we are
talking about console output (which is different from a GUI front
end), just sending Unicode to the display is not enough.

I suggest not to mix issues relevant for GUI front ends and text-mode
front ends, including the CLI ``front end'' built into GDB itself.
These are different issues, each one with its own set of complexities.

Jim's L"123\x0f04\x0fccxyz" proposal was (I think) more oriented to
text terminals and the CLI, so the discussion wandered off in that
direction.  I don't think your original problem is related to that.

> > how many glyphs will it produce 
> > on the screen, where it can be broken into several lines if it is too
> > long, etc.  This is all trivial with 7-bit ASCII (every byte produces
> > a single glyph, except a few non-printables, whitespace characters
> > signal possible locations to break the line, etc.), but can get very
> > complex with other character sets.
> 
> Isn't this completely outside of GDB?

No, not completely: the ui_output routines do this for the console
output.  Again, this part was about text-mode output, and the CLI in
particular.

> > GDB cannot be asked to know about all of those complications, but I
> > think it should at least provide a few simple translation services so
> > that a front end will not have to work too hard to handle and display
> > strings as mostly readable text.  Passing the characters as fixed-size
> > codepoints expressed as ASCII hex strings leaves the front-end with
> > only very simple job.  What's more, it uses an existing feature: array
> > printing.
> 
> Using \x escapes, provided they encode *code units*, leaves frontend with the 
> same simple job.

Yes, but GDB will need to generate the code units first, e.g. convert
fixed-size Unicode wide characters into UTF-8.  That's extra job for
GDB.  (Again, we were originally talking about wchar_t, not multibyte
strings.)

> Really, using strings with \x escapes differs from array 
> printing in just one point: some characters are printed not as hex values, 
> but as characters in local 8-bit encoding. Why do you think this is a 
> problem?

Because knowing what is the ``local 8-bit encoding'' is in itself a
huge problem.  Emacs is trying to solve it since 1996, and it still
haven't got all the details right in some marginal cases, although we
have people on the Emacs development team who understand more about
i18n than I ever will.  In short, there's no reliable method of
finding out what is the correct 8-bit encoding in which to talk to any
given text-mode display.

And you certainly do NOT want any local 8-bit encodings when you are
going to display the string on a GUI, because that would require that
the front end does some extra job of converting the encoded text back
to what it needs to communicate with the text widgets.

> > And why are you talking about host character set?  The
> > L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> > converted it to 7-bit ASCII.  These are characters from the target
> > character set.  And the target doesn't necessarily talk in the host
> > locale's character set and language, you could be debugging a program
> > which talks Farsi with GDB that runs in a German locale.
> 
> So, characters that happen to exist in German locale are printed as literal 
> chars. Other characters are printed using \x. FE reads the string, and when 
> it sees literal char, it converts it from German locale to Unicode used 
> internally. Where's the problem?

If this conversion is lossless, it's redundant.  It is easier to just
send everything as hex escapes, since no human will see them, only the
FE.  This saves the needless conversion (and potential problems with
incorrect notion of the current locale and encoding).

But some conversions to ``literal characters'' (i.e. to 8-bit binary
codes) are lossy, because the underlying converter needs state
information to correctly interpret the byte stream.  This state
information is thrown away once the conversion is done, and so the
opposite conversion fails to reconstruct the original codepoints.
This is usually the case with ISO-2022 encodings.

So I think on balance it's better to send the original wide characters
as hex, the only downside being that it uses more bytes per character.
(Again, this is about GUI front ends, not about GDB's own CLI output
routines.)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17  8:58                         ` Eli Zaretskii
@ 2006-04-17 10:35                           ` Vladimir Prus
  2006-04-17 12:26                             ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-17 10:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: jimb, gdb

On Monday 17 April 2006 12:35, Eli Zaretskii wrote:

> >   - If it sees \x, look at the following hex digits and convert it to
> > either code point or code unit
> >   - If it sees anything else, convert it from local 8-bit to Unicode
>
> That's what Jim was saying.  He thought (or so it seemed to me) that,
> once the ASCII-encoded string was read by the front end and converted
> back to the integer values, the job is done.  That is, in Jim's
> example with L"123\x0f04\x0fccxyz", the character `1' is converted to
> its code 49 decimal, \x0f04 is converted to the 16-bit code 3844
> decimal, `x' is converted to 120 decimal, etc.
>
> What I was saying that indeed this conversion is easy, but it's not
> even close to doing what the front end generally would like to do with
> the string.  You want to _process_ the string, which means you want to
> know its length in characters (not bytes), you want to know what
> character set they encode, you want to be able to find the n-th
> character in the string, etc.  The encoding suggested by Jim makes
> these tasks very hard, much harder than if we send the string as an
> array of fixed-length wide characters.

That's a *completely* different topic. First, frontend needs to get the data, 
in whatever form. Using \x escapes is just as suitable as using list of hex 
values -- those approaches are just isomorphic. Second, frontend needs to 
display the data, however it will operate using its own data structures, and 
it does not matter if \x escapes were used or not. No frontend will ever work 
on a string containing embedded "\x" escapes.

> > Note that due to charset function interface using 'int', you can't use
> > UTF-8 for encoding passed to frontend, but using ASCII + \x is still
> > feasible.
>
> I don't understand why UTF-8 cannot be used (an int can hold an 8-bit
> byte just fine), 

Int can't hold 6 bytes, at least on common machines. And interface is 
charset.h requires that result of conversion of one host character to one 
target character fit into int. Anyway, I don't think charset.h was designed 
with Unicode in mind, so we probably should stop dicussing it.

> > There's one nice thing about this approach. If there's new 'print array
> > until XX" syntax, I indeed need to special-case processing of values in
> > several contexts -- most notably arguments in stack trace. With "\x"
> > escapes I'd need to write a code to handle them once. In fact, I can add
> > this code right to MI parser (which operates using Unicode-enabled
> > QString class already). That will be more convenient than invoking 'print
> > array' for any wchar_t* I ever see.
>
> I don't think we should optimize GDB for one specific toolkit, even if
> that toolkit is Qt.

Replace QString with Gtkmm::ustring and the same argument holds. Whenever 
string type is used inside frontend to represent Unicode string, you can 
perform the conversion from \x escapes to that string class in one place, and 
don't do this separately, inside variable display widget, inside stack 
display widget and where not.

> > I don't quite get. First you say you want \x05D2 to display using Unicode
> > font on console, now you say it's very hard.
>
> No, I said that a GUI front end will be able to display the _binary_
> _code_ 0x05D2 with a suitable Unicode font.  Jim suggested that seeing
> the _string_ "\x05D2" in GDB's output will allow me to read the text,
> to which I replied that it will not be easy at all, since humans
> generally don't remember Unicode codepoints by heart, even for their
> native languages.

Ok, seeing the string "\x05D2" will be sufficient for frontend.

> > > GDB cannot be asked to know about all of those complications, but I
> > > think it should at least provide a few simple translation services so
> > > that a front end will not have to work too hard to handle and display
> > > strings as mostly readable text.  Passing the characters as fixed-size
> > > codepoints expressed as ASCII hex strings leaves the front-end with
> > > only very simple job.  What's more, it uses an existing feature: array
> > > printing.
> >
> > Using \x escapes, provided they encode *code units*, leaves frontend with
> > the same simple job.
>
> Yes, but GDB will need to generate the code units first, e.g. convert
> fixed-size Unicode wide characters into UTF-8.  

Sorry, where does that UTF-8 comes from? If you generate ASCII + \x escapes, 
you don't need UTF-8. 

> That's extra job for 
> GDB.  (Again, we were originally talking about wchar_t, not multibyte
> strings.)

I don't understand what's this extra job. This is as simple as:

   for c in wchar_t* literal:
       if c is representable in host encoding:
            output_literal
       else
            output_hex_escape

> > Really, using strings with \x escapes differs from array
> > printing in just one point: some characters are printed not as hex
> > values, but as characters in local 8-bit encoding. Why do you think this
> > is a problem?
>
> Because knowing what is the ``local 8-bit encoding'' is in itself a
> huge problem.  Emacs is trying to solve it since 1996, and it still
> haven't got all the details right in some marginal cases, although we
> have people on the Emacs development team who understand more about
> i18n than I ever will.  In short, there's no reliable method of
> finding out what is the correct 8-bit encoding in which to talk to any
> given text-mode display.

I trust you on that, but nothing prevents user/frontend to explicitly specify 
the encoding.

> And you certainly do NOT want any local 8-bit encodings when you are
> going to display the string on a GUI, because that would require that
> the front end does some extra job of converting the encoded text back
> to what it needs to communicate with the text widgets.

I would expect that any GUI toolkit that pretend to support Unicode *has* to 
support conversion from local 8 bit encodings. Otherwise, such toolkit is of 
no use in real world.

By the way, unless your target encoding is ASCII, frontend has to be aware of 
local 8 bit encoding anyway. If I wrote program using KOI8-R and frontend 
shows the char* (not wchar_t*) strings as ASCII, the frontend is broken 
already.


> > > And why are you talking about host character set?  The
> > > L"123\x0f04\x0fccxyz" string came from the target, GDB simply
> > > converted it to 7-bit ASCII.  These are characters from the target
> > > character set.  And the target doesn't necessarily talk in the host
> > > locale's character set and language, you could be debugging a program
> > > which talks Farsi with GDB that runs in a German locale.
> >
> > So, characters that happen to exist in German locale are printed as
> > literal chars. Other characters are printed using \x. FE reads the
> > string, and when it sees literal char, it converts it from German locale
> > to Unicode used internally. Where's the problem?
>
> If this conversion is lossless, it's redundant.  It is easier to just
> send everything as hex escapes, since no human will see them, only the
> FE.  This saves the needless conversion (and potential problems with
> incorrect notion of the current locale and encoding).

Well, using string with just hex escapes is fine for frontend. It might not be 
as fine to the user.

> But some conversions to ``literal characters'' (i.e. to 8-bit binary
> codes) are lossy, because the underlying converter needs state
> information to correctly interpret the byte stream.  This state
> information is thrown away once the conversion is done, and so the
> opposite conversion fails to reconstruct the original codepoints.
> This is usually the case with ISO-2022 encodings.
>
> So I think on balance it's better to send the original wide characters
> as hex, the only downside being that it uses more bytes per character.
> (Again, this is about GUI front ends, not about GDB's own CLI output
> routines.)

Well, I'd prefer to address one problem at a time:

1. Gbd should be modified to print wchar_t* literals. It should use the same 
logic as for char* to decide if value is representable in the host charset, 
and use \x escapes otherwise.

2. If you believe that using literals is not suitable for MI, that can be a 
separate change.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17 10:35                           ` Vladimir Prus
@ 2006-04-17 12:26                             ` Eli Zaretskii
  2006-04-17 13:56                               ` Vladimir Prus
  0 siblings, 1 reply; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-17 12:26 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: jimb, gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 13:01:58 +0400
> Cc: jimb@red-bean.com,
>  gdb@sources.redhat.com
> 
> > What I was saying that indeed this conversion is easy, but it's not
> > even close to doing what the front end generally would like to do with
> > the string.  You want to _process_ the string, which means you want to
> > know its length in characters (not bytes), you want to know what
> > character set they encode, you want to be able to find the n-th
> > character in the string, etc.  The encoding suggested by Jim makes
> > these tasks very hard, much harder than if we send the string as an
> > array of fixed-length wide characters.
> 
> That's a *completely* different topic.

Yes, it is.  But we must keep it in mind because the front ends want
strings to do something with them.

> Second, frontend needs to display the data, however it will operate
> using its own data structures, and it does not matter if \x escapes
> were used or not. No frontend will ever work on a string containing
> embedded "\x" escapes.

I was saying that the ASCII encoding suggested by Jim makes it harder
to convert the text into wide characters, that's all.

> > > Using \x escapes, provided they encode *code units*, leaves frontend with
> > > the same simple job.
> >
> > Yes, but GDB will need to generate the code units first, e.g. convert
> > fixed-size Unicode wide characters into UTF-8.  
> 
> Sorry, where does that UTF-8 comes from?

UTF-8 was an example, the general point being that code units are
present only in encodings, not in fixed-length wide characters.

> > That's extra job for 
> > GDB.  (Again, we were originally talking about wchar_t, not multibyte
> > strings.)
> 
> I don't understand what's this extra job. This is as simple as:
> 
>    for c in wchar_t* literal:
>        if c is representable in host encoding:
>             output_literal
>        else
>             output_hex_escape

That might sound simple for you, but it isn't, in general.  The
``representable in host encoding'' part is very non-trivial; for
example, how do you tell whether the Unicode codepoints 0x05C3 and
0x05C4 can be represented in the Windows codepage 1255 (the former
can, the latter cannot)?  This is generally impossible without using
very complicated algorithms and/or large data bases.

The other complex part is ``output_literal'': again, there's no simple
algorithm to map Unicode's 0x05C3 into cp1255's 0xD3.  You need tables
again, and you need separate tables for each possible encoding (Hebrew
has at least 3 widely used ones, Russian has at least 5, etc.).

> > > Really, using strings with \x escapes differs from array
> > > printing in just one point: some characters are printed not as hex
> > > values, but as characters in local 8-bit encoding. Why do you think this
> > > is a problem?
> >
> > Because knowing what is the ``local 8-bit encoding'' is in itself a
> > huge problem.
> [...]
> I trust you on that, but nothing prevents user/frontend to explicitly specify 
> the encoding.

What makes you think the user and/or front end will know what to
specify?  Experience shows they generally don't.

> > And you certainly do NOT want any local 8-bit encodings when you are
> > going to display the string on a GUI, because that would require that
> > the front end does some extra job of converting the encoded text back
> > to what it needs to communicate with the text widgets.
> 
> I would expect that any GUI toolkit that pretend to support Unicode *has* to 
> support conversion from local 8 bit encodings. Otherwise, such toolkit is of 
> no use in real world.

Then most of them are ``of no use''.  You can rely on most of the
modern GUI toolkits to support conversion from UTF-8 to Unicode, but
that's about it.  For anything more complex, your best bet is to link
against libiconv or similar.

> By the way, unless your target encoding is ASCII, frontend has to be aware of 
> local 8 bit encoding anyway. If I wrote program using KOI8-R and frontend 
> shows the char* (not wchar_t*) strings as ASCII, the frontend is broken 
> already.

This only works as long as you use the encoding that matches your
default fonts.  Once it doesn't match, or the encoded characters come
from a program written for different locale conventions, you are out
of luck.

It is important to realize that programs don't know anything about
characters, all they see is integer code values.  To display those
codes in human-readable form, a program needs to know what display API
to call and which font to request.  This kind of information is absent
from simple text files that hold encoded non-ASCII text, so programs
generally need additional info to DTRT.  The same holds for arbitrary
strings GDB spills on you from some address in the debuggee.

> 1. Gbd should be modified to print wchar_t* literals.

``Print'' is ambiguous in this context.  I believe you mean ``send to
the front end'', since this was your original problem.  If the front
end is charged with displaying the wchar_t strings, GDB does not need
to print anything by itself.  Am I right?

> It should use the same 
> logic as for char* to decide if value is representable in the host charset, 

I hope I explained above why this part is highly non-trivial.  That is
why I think GDB should use hex notation for all characters, and leave
it for the FE to deal with their display.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17 12:26                             ` Eli Zaretskii
@ 2006-04-17 13:56                               ` Vladimir Prus
  2006-04-18  5:31                                 ` Eli Zaretskii
  0 siblings, 1 reply; 53+ messages in thread
From: Vladimir Prus @ 2006-04-17 13:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: jimb, gdb

On Monday 17 April 2006 15:21, Eli Zaretskii wrote:

> > > What I was saying that indeed this conversion is easy, but it's not
> > > even close to doing what the front end generally would like to do with
> > > the string.  You want to _process_ the string, which means you want to
> > > know its length in characters (not bytes), you want to know what
> > > character set they encode, you want to be able to find the n-th
> > > character in the string, etc.  The encoding suggested by Jim makes
> > > these tasks very hard, much harder than if we send the string as an
> > > array of fixed-length wide characters.
> >
> > That's a *completely* different topic.
>
> Yes, it is.  But we must keep it in mind because the front ends want
> strings to do something with them.

Eli, I think we're running in circles. I'd like to reiterate why I ideally 
want from gdb:

  1. For any wchar_t* value, be it value of a variable, or function
     parameter three levels up the stack, or member of structure, I want
     gdb to print that value in specific format that's easy for frontend
     to use. String with escapes is fine.
  2. I want that formatting to take effect both for MI commands and for
     'print' command, since the user can issue 'print' command manually.
  3. I don't mind having this behaviour only when --interpreter=mi is
     specified.

I think that two question we did not agree on are:

  1. When talking to FE, should literals be used at all, or string should 
     consist of just \x escapes.
  2. When talking to user, should we use string literals, or just \x escapes.

I hope you'll agree that using \x escapes when talking to user in not 
acceptable. And since gdb right now assumes ASCII charset for output, I don't 
think there will be any problems if ASCII characters are output as-is, 
without escaping.

> > Second, frontend needs to display the data, however it will operate
> > using its own data structures, and it does not matter if \x escapes
> > were used or not. No frontend will ever work on a string containing
> > embedded "\x" escapes.
>
> I was saying that the ASCII encoding suggested by Jim makes it harder
> to convert the text into wide characters, that's all.

I don't see why it's so, but nevermind.

> > > That's extra job for
> > > GDB.  (Again, we were originally talking about wchar_t, not multibyte
> > > strings.)
> >
> > I don't understand what's this extra job. This is as simple as:
> >
> >    for c in wchar_t* literal:
> >        if c is representable in host encoding:
> >             output_literal
> >        else
> >             output_hex_escape
>
> That might sound simple for you, but it isn't, in general.  The
> ``representable in host encoding'' part is very non-trivial; for
> example, how do you tell whether the Unicode codepoints 0x05C3 and
> 0x05C4 can be represented in the Windows codepage 1255 (the former
> can, the latter cannot)?  This is generally impossible without using
> very complicated algorithms and/or large data bases.
>
> The other complex part is ``output_literal'': again, there's no simple
> algorithm to map Unicode's 0x05C3 into cp1255's 0xD3.  You need tables
> again, and you need separate tables for each possible encoding (Hebrew
> has at least 3 widely used ones, Russian has at least 5, etc.).

iconv has those tables. You see problems where there are none.

> > > > Really, using strings with \x escapes differs from array
> > > > printing in just one point: some characters are printed not as hex
> > > > values, but as characters in local 8-bit encoding. Why do you think
> > > > this is a problem?
> > >
> > > Because knowing what is the ``local 8-bit encoding'' is in itself a
> > > huge problem.
> >
> > [...]
> > I trust you on that, but nothing prevents user/frontend to explicitly
> > specify the encoding.
>
> What makes you think the user and/or front end will know what to
> specify?  Experience shows they generally don't.

First you say it's not possible to detect encoding from environment. Then you 
say you can't trust user/frontend. Together, that sounds like the problem of 
making gdb print char* literals reliably is impossible. Is that what you're 
trying to say? 

> > 1. Gbd should be modified to print wchar_t* literals.
>
> ``Print'' is ambiguous in this context.  I believe you mean ``send to
> the front end'', since this was your original problem.  If the front
> end is charged with displaying the wchar_t strings, GDB does not need
> to print anything by itself.  Am I right?
>
> > It should use the same
> > logic as for char* to decide if value is representable in the host
> > charset,
>
> I hope I explained above why this part is highly non-trivial.  

Using existing logic is in fact absolutely trivial -- that logic already 
*exists*, you don't need to do anything. 

> That is 
> why I think GDB should use hex notation for all characters, and leave
> it for the FE to deal with their display.

I disagree, for the simple reason that for char* values, existing logic did 
not cause any problems. Also, while I can take a stab at wchar_t* output, I 
would not be comfortable with special casing wchar_t* output to frontend.

- Volodya

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: printing wchar_t*
  2006-04-17 13:56                               ` Vladimir Prus
@ 2006-04-18  5:31                                 ` Eli Zaretskii
  0 siblings, 0 replies; 53+ messages in thread
From: Eli Zaretskii @ 2006-04-18  5:31 UTC (permalink / raw)
  To: Vladimir Prus; +Cc: jimb, gdb

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 16:16:26 +0400
> Cc: jimb@red-bean.com,
>  gdb@sources.redhat.com
> 
> Eli, I think we're running in circles.

Fine, then I'll just stop responding.  This is my last (and hopefully
short) contribution to this thread.

>   1. For any wchar_t* value, be it value of a variable, or function
>      parameter three levels up the stack, or member of structure, I want
>      gdb to print that value in specific format that's easy for frontend
>      to use. String with escapes is fine.

A noble goal.  If you (or someone else) submits patches, I'll be happy
to review them.

>   2. I want that formatting to take effect both for MI commands and for
>      'print' command, since the user can issue 'print' command manually.

I think CLI and MI are two different cases, and thus simple solutions
that are appropriate for MI (because it doesn't display) will not be
good enough for CLI.

>   3. I don't mind having this behaviour only when --interpreter=mi is
>      specified.

I don't think `print' should behave differently depending on the
interpreter, but whatever.

> First you say it's not possible to detect encoding from environment. Then you 
> say you can't trust user/frontend. Together, that sounds like the problem of 
> making gdb print char* literals reliably is impossible. Is that what you're 
> trying to say? 

I'm trying to say that it would be absurd to add all that complexity
to GDB.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2006-04-17 13:56 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-13 17:07 printing wchar_t* Vladimir Prus
2006-04-13 17:25 ` Eli Zaretskii
2006-04-14  7:29   ` Vladimir Prus
2006-04-14  8:47     ` Eli Zaretskii
2006-04-14 12:47       ` Vladimir Prus
2006-04-14 13:05         ` Eli Zaretskii
2006-04-14 13:06           ` Vladimir Prus
2006-04-14 13:15             ` Robert Dewar
2006-04-14 13:17           ` Daniel Jacobowitz
2006-04-14 13:59             ` Robert Dewar
2006-04-14 14:37             ` Eli Zaretskii
2006-04-14 14:08       ` Paul Koning
2006-04-14 14:47         ` Eli Zaretskii
2006-04-14 15:00           ` Vladimir Prus
2006-04-14 17:53             ` Eli Zaretskii
2006-04-17  7:05               ` Vladimir Prus
2006-04-17  8:35                 ` Eli Zaretskii
2006-04-13 18:06 ` Jim Blandy
2006-04-13 21:18   ` Eli Zaretskii
2006-04-14  6:02     ` Jim Blandy
2006-04-14  8:43       ` Eli Zaretskii
2006-04-14  7:58   ` Vladimir Prus
2006-04-14  8:07     ` Jim Blandy
2006-04-14  8:30       ` Vladimir Prus
2006-04-14  8:57     ` Eli Zaretskii
2006-04-14 12:52       ` Vladimir Prus
2006-04-14 13:07         ` Daniel Jacobowitz
2006-04-14 14:23           ` Eli Zaretskii
2006-04-14 14:29             ` Daniel Jacobowitz
2006-04-14 14:53               ` Eli Zaretskii
2006-04-14 17:10                 ` Daniel Jacobowitz
2006-04-14 17:55               ` Jim Blandy
2006-04-14 18:27                 ` Eli Zaretskii
2006-04-14 18:30                   ` Jim Blandy
2006-04-14 19:19                     ` Eli Zaretskii
2006-04-14 19:48                       ` Jim Blandy
2006-04-14 14:16         ` Eli Zaretskii
2006-04-14 14:50           ` Vladimir Prus
2006-04-14 17:18             ` Eli Zaretskii
2006-04-14 18:03               ` Jim Blandy
2006-04-14 19:16                 ` Eli Zaretskii
2006-04-14 19:22                   ` Jim Blandy
2006-04-14 22:18                     ` Daniel Jacobowitz
2006-04-16 11:39                       ` Jim Blandy
2006-04-16 15:07                         ` Eli Zaretskii
2006-04-15  7:14                     ` Eli Zaretskii
2006-04-17  7:16                       ` Vladimir Prus
2006-04-17  8:58                         ` Eli Zaretskii
2006-04-17 10:35                           ` Vladimir Prus
2006-04-17 12:26                             ` Eli Zaretskii
2006-04-17 13:56                               ` Vladimir Prus
2006-04-18  5:31                                 ` Eli Zaretskii
2006-04-14 19:53                 ` Mark Kettenis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).