public inbox for gdb@sourceware.org
 help / color / mirror / Atom feed
* Using UTF-8 as host charset
@ 2012-03-03 16:34 Mathias Kunter
  2012-03-05 16:13 ` Paul_Koning
  2012-03-05 16:41 ` Tom Tromey
  0 siblings, 2 replies; 12+ messages in thread
From: Mathias Kunter @ 2012-03-03 16:34 UTC (permalink / raw)
  To: gdb

Dear members of the gdb mailing list,

I'm working on a patch for Eclipse which adds full charset support to 
the CDT debugger. We're setting gdb's host-charset to UTF-8 to achieve 
this. There already had been discussion about this back in 2010 here on 
the gdb mailing list. Tom Tromey said back then - quoted from 
http://sourceware.org/ml/gdb/2010-08/msg00129.html

 > It is an oddity that currently an MI consumer must check gdb's
 > host charset in order to know how to decode its output.  I would
 > recommend that the client force it to be UTF-8, but I think this
 > currently may not work with PHONY_ICONV.

So the question is, is it actually a good idea to simply always set 
gdb's host charset to UTF-8? Which hosts do use the phony iconv, and is 
it indeed a problem for them if the host charset is UTF-8?

Note that we're only talking about gdb 7.0 or later. We don't plan to 
support this feature for gdb < 7.0 within CDT.

Thanks for any hints!
Mathias


PS: Just for reference: the corresponding Eclipse bug report can be 
found at https://bugs.eclipse.org/bugs/show_bug.cgi?id=307311

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Using UTF-8 as host charset
  2012-03-03 16:34 Using UTF-8 as host charset Mathias Kunter
@ 2012-03-05 16:13 ` Paul_Koning
  2012-03-05 18:09   ` Tom Tromey
  2012-03-05 16:41 ` Tom Tromey
  1 sibling, 1 reply; 12+ messages in thread
From: Paul_Koning @ 2012-03-05 16:13 UTC (permalink / raw)
  To: mathiaskunter, gdb

While it doesn't use phony iconv, there are some other questions that have come up on this in the past.  NetBSD (and possibly others) have an iconv implementation that doesn't provide the "wchar_t" encoding GDB assumes every iconv will have.  I remember trying to do something about this and running into concerns that wchar_t, formally speaking, is not the same as UCS-2 even though for practical purposes the two are interchangeable.

	paul

-----Original Message-----
From: gdb-owner@sourceware.org [mailto:gdb-owner@sourceware.org] On Behalf Of Mathias Kunter
Sent: Saturday, March 03, 2012 11:34 AM
To: gdb@sourceware.org
Subject: Using UTF-8 as host charset

Dear members of the gdb mailing list,

I'm working on a patch for Eclipse which adds full charset support to the CDT debugger. We're setting gdb's host-charset to UTF-8 to achieve this. There already had been discussion about this back in 2010 here on the gdb mailing list. Tom Tromey said back then - quoted from http://sourceware.org/ml/gdb/2010-08/msg00129.html

 > It is an oddity that currently an MI consumer must check gdb's  > host charset in order to know how to decode its output.  I would  > recommend that the client force it to be UTF-8, but I think this  > currently may not work with PHONY_ICONV.

So the question is, is it actually a good idea to simply always set gdb's host charset to UTF-8? Which hosts do use the phony iconv, and is it indeed a problem for them if the host charset is UTF-8?

Note that we're only talking about gdb 7.0 or later. We don't plan to support this feature for gdb < 7.0 within CDT.

Thanks for any hints!
Mathias


PS: Just for reference: the corresponding Eclipse bug report can be found at https://bugs.eclipse.org/bugs/show_bug.cgi?id=307311

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-03 16:34 Using UTF-8 as host charset Mathias Kunter
  2012-03-05 16:13 ` Paul_Koning
@ 2012-03-05 16:41 ` Tom Tromey
  2012-03-05 16:44   ` Pedro Alves
  2012-03-05 20:51   ` Mathias Kunter
  1 sibling, 2 replies; 12+ messages in thread
From: Tom Tromey @ 2012-03-05 16:41 UTC (permalink / raw)
  To: Mathias Kunter; +Cc: gdb

>>>>> "Mathias" == Mathias Kunter <mathiaskunter@gmail.com> writes:

Mathias> Dear members of the gdb mailing list,
Mathias> I'm working on a patch for Eclipse which adds full charset support to
Mathias> the CDT debugger. We're setting gdb's host-charset to UTF-8 to achieve
Mathias> this. There already had been discussion about this back in 2010 here
Mathias> on the gdb mailing list. Tom Tromey said back then - quoted from
Mathias> http://sourceware.org/ml/gdb/2010-08/msg00129.html

Tom> It is an oddity that currently an MI consumer must check gdb's
Tom> host charset in order to know how to decode its output.  I would
Tom> recommend that the client force it to be UTF-8, but I think this
Tom> currently may not work with PHONY_ICONV.

Mathias> So the question is, is it actually a good idea to simply always set
Mathias> gdb's host charset to UTF-8? Which hosts do use the phony iconv, and
Mathias> is it indeed a problem for them if the host charset is UTF-8?

I think it probably isn't really safe to just set host-charset.
Instead you should arrange to run gdb in a UTF-8 locale.

I'm not sure exactly what might break though.

This area is somewhat of a mess.  I wouldn't mind fixing MI.  However, I
don't know exactly what would be most useful.  Also, because some hosts
have bad iconv implementations, you are at the mercy of whoever built
gdb.  IMNSHO, for non-Linux hosts, everybody ought to build against GNU
libiconv; but I am not positive that this is universally done.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 16:41 ` Tom Tromey
@ 2012-03-05 16:44   ` Pedro Alves
  2012-03-05 20:51   ` Mathias Kunter
  1 sibling, 0 replies; 12+ messages in thread
From: Pedro Alves @ 2012-03-05 16:44 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Mathias Kunter, gdb

On 03/05/2012 04:40 PM, Tom Tromey wrote:

> IMNSHO, for non-Linux hosts, everybody ought to build against GNU
> libiconv; but I am not positive that this is universally done.


Pedantically, I think you mean for non-glibc hosts.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 16:13 ` Paul_Koning
@ 2012-03-05 18:09   ` Tom Tromey
  2012-03-05 18:11     ` Paul_Koning
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Tromey @ 2012-03-05 18:09 UTC (permalink / raw)
  To: Paul_Koning; +Cc: mathiaskunter, gdb

>>>>> "Paul" ==   <Paul_Koning@Dell.com> writes:

Paul> While it doesn't use phony iconv, there are some other questions that
Paul> have come up on this in the past.  NetBSD (and possibly others) have
Paul> an iconv implementation that doesn't provide the "wchar_t" encoding
Paul> GDB assumes every iconv will have.  I remember trying to do something
Paul> about this and running into concerns that wchar_t, formally speaking,
Paul> is not the same as UCS-2 even though for practical purposes the two
Paul> are interchangeable.

I guess NetBSD should use libiconv.

We could in theory write a portable "phony libiconv" that uses the
standard C wide/multi-byte conversion functions.  But... libiconv
already did this, so it seemed simpler to just reuse it rather than try
to write our own.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Using UTF-8 as host charset
  2012-03-05 18:09   ` Tom Tromey
@ 2012-03-05 18:11     ` Paul_Koning
  2012-03-05 21:04       ` Tom Tromey
  0 siblings, 1 reply; 12+ messages in thread
From: Paul_Koning @ 2012-03-05 18:11 UTC (permalink / raw)
  To: tromey; +Cc: mathiaskunter, gdb

The issue here is that NetBSD has a fully functional iconv, except that it doesn't include the wchar_t "character set".  I think it has something to do with the notion that wchar_t is not the same as ucs-2, at least not in some corner cases.  I'm not particularly convinced, especially since GNU libiconv does make that exact equivalence.

	paul

-----Original Message-----
From: Tom Tromey [mailto:tromey@redhat.com] 
Sent: Monday, March 05, 2012 1:09 PM
To: Koning, Paul
Cc: mathiaskunter@gmail.com; gdb@sourceware.org
Subject: Re: Using UTF-8 as host charset

>>>>> "Paul" ==   <Paul_Koning@Dell.com> writes:

Paul> While it doesn't use phony iconv, there are some other questions 
Paul> that have come up on this in the past.  NetBSD (and possibly 
Paul> others) have an iconv implementation that doesn't provide the 
Paul> "wchar_t" encoding GDB assumes every iconv will have.  I remember 
Paul> trying to do something about this and running into concerns that 
Paul> wchar_t, formally speaking, is not the same as UCS-2 even though 
Paul> for practical purposes the two are interchangeable.

I guess NetBSD should use libiconv.

We could in theory write a portable "phony libiconv" that uses the standard C wide/multi-byte conversion functions.  But... libiconv already did this, so it seemed simpler to just reuse it rather than try to write our own.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 16:41 ` Tom Tromey
  2012-03-05 16:44   ` Pedro Alves
@ 2012-03-05 20:51   ` Mathias Kunter
  2012-03-05 21:13     ` Tom Tromey
  1 sibling, 1 reply; 12+ messages in thread
From: Mathias Kunter @ 2012-03-05 20:51 UTC (permalink / raw)
  To: gdb

> I think it probably isn't really safe to just set host-charset.
> Instead you should arrange to run gdb in a UTF-8 locale.

This unfortunately isn't generally possible for a cross-platform IDE 
like Eclipse CDT.


> I'm not sure exactly what might break though.

As long as GDB doesn't crash and still prints ASCII strings correctly, 
it won't concern us. I mean, we didn't support any non-ASCII characters 
at all until now within the CDT debugger. And I assume that ASCII 
strings will always be printed correctly by GDB, even if the host 
charset is set to UTF-8, since ASCII is just a subset of UTF-8.

What we're trying to find out is whether we should enable the debugger 
charset support by default within future releases of Eclipse CDT or not. 
If GDB is generally stable with using UTF-8 as host charset, we'd do so. 
Can you give a recommendation?

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 18:11     ` Paul_Koning
@ 2012-03-05 21:04       ` Tom Tromey
  0 siblings, 0 replies; 12+ messages in thread
From: Tom Tromey @ 2012-03-05 21:04 UTC (permalink / raw)
  To: Paul_Koning; +Cc: mathiaskunter, gdb

>>>>> "Paul" ==   <Paul_Koning@Dell.com> writes:

Paul> The issue here is that NetBSD has a fully functional iconv, except
Paul> that it doesn't include the wchar_t "character set".  I think it has
Paul> something to do with the notion that wchar_t is not the same as ucs-2,
Paul> at least not in some corner cases.  I'm not particularly convinced,
Paul> especially since GNU libiconv does make that exact equivalence.

GNU iconv uses mbrtowc and friends to do the conversion in a portable
way.

I'm open to other ways to solve the problem as well.
I just don't know of any.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 20:51   ` Mathias Kunter
@ 2012-03-05 21:13     ` Tom Tromey
  2012-03-05 21:34       ` Mathias Kunter
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Tromey @ 2012-03-05 21:13 UTC (permalink / raw)
  To: Mathias Kunter; +Cc: gdb

>>>>> "Mathias" == Mathias Kunter <mathiaskunter@gmail.com> writes:

Mathias> As long as GDB doesn't crash and still prints ASCII strings correctly,
Mathias> it won't concern us.

I think the failure mode would be something more like gdb making the
wrong decision about whether a character is "printable".  That is, I
think you could conceivably see weird output in some scenario.

I don't have an example at hand.  Maybe the concern is even just
theoretical.

Mathias> If GDB is generally stable with using UTF-8 as host charset,
Mathias> we'd do so.

I have run it that way every day for multiple years.

One problem is that if you have a gdb version that uses the phony iconv,
then it cannot handle UTF-8.  You can detect this at startup, though.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 21:13     ` Tom Tromey
@ 2012-03-05 21:34       ` Mathias Kunter
  2012-03-05 21:41         ` Tom Tromey
  0 siblings, 1 reply; 12+ messages in thread
From: Mathias Kunter @ 2012-03-05 21:34 UTC (permalink / raw)
  To: gdb

> One problem is that if you have a gdb version that uses the phony iconv,
> then it cannot handle UTF-8.

Okay, but even then the worst thing that should happen is that GDB might 
scramble Unicode characters when printing, right?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 21:34       ` Mathias Kunter
@ 2012-03-05 21:41         ` Tom Tromey
  2012-03-05 22:23           ` Mathias Kunter
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Tromey @ 2012-03-05 21:41 UTC (permalink / raw)
  To: Mathias Kunter; +Cc: gdb

>>>>> "Mathias" == Mathias Kunter <mathiaskunter@gmail.com> writes:

>> One problem is that if you have a gdb version that uses the phony iconv,
>> then it cannot handle UTF-8.

Mathias> Okay, but even then the worst thing that should happen is that GDB
Mathias> might scramble Unicode characters when printing, right?

I think so; but you can make a test case easily enough by setting gdb's
host-charset to ISO-8859-1 but Eclipse's notion to UTF-8, and then
trying things.

Tom

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Using UTF-8 as host charset
  2012-03-05 21:41         ` Tom Tromey
@ 2012-03-05 22:23           ` Mathias Kunter
  0 siblings, 0 replies; 12+ messages in thread
From: Mathias Kunter @ 2012-03-05 22:23 UTC (permalink / raw)
  To: gdb

> I think so; but you can make a test case easily enough by setting gdb's
> host-charset to ISO-8859-1 but Eclipse's notion to UTF-8, and then
> trying things.

I did several of such tests with different versions of GDB on Linux, Mac 
OS and Windows. I didn't experience any problems. I just wasn't sure 
whether any of the GDB's I tested used the phony iconv, so I thought I 
ask about this edge case here on the mailing list.

Since you say that at worst GDB would probably only print some strings 
incorrectly even when the phony iconv is used, I think we're going to 
enable this feature by default. Thanks for your help!

Mathias

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-03-05 22:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-03 16:34 Using UTF-8 as host charset Mathias Kunter
2012-03-05 16:13 ` Paul_Koning
2012-03-05 18:09   ` Tom Tromey
2012-03-05 18:11     ` Paul_Koning
2012-03-05 21:04       ` Tom Tromey
2012-03-05 16:41 ` Tom Tromey
2012-03-05 16:44   ` Pedro Alves
2012-03-05 20:51   ` Mathias Kunter
2012-03-05 21:13     ` Tom Tromey
2012-03-05 21:34       ` Mathias Kunter
2012-03-05 21:41         ` Tom Tromey
2012-03-05 22:23           ` Mathias Kunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).