From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <patrick@monnerat.net>
Received: from jupiter.monnerat.net (jupiter.monnerat.net [46.226.111.226])
 by sourceware.org (Postfix) with ESMTPS id F0E453858CDB
 for <gdb-patches@sourceware.org>; Sun,  9 Oct 2022 00:47:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F0E453858CDB
Received: from [192.168.0.128] ([192.168.0.128])
 by jupiter.monnerat.net (8.14.8/8.14.8) with ESMTP id 2990lIYe012650
 (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256 verify=OK);
 Sun, 9 Oct 2022 02:47:24 +0200
DKIM-Filter: OpenDKIM Filter v2.10.3 jupiter.monnerat.net 2990lIYe012650
Message-ID: <2f10efe4-1095-b620-ea1c-08cc047c45c4@monnerat.net>
Date: Sun, 9 Oct 2022 02:47:18 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.1
Subject: Re: [PATCH] gdb: add UTF16/UTF32 target charsets in phony_iconv
Content-Language: en-US
To: Tom Tromey <tom@tromey.com>
Cc: Patrick Monnerat via Gdb-patches <gdb-patches@sourceware.org>
References: <20221002140010.106238-1-patrick@monnerat.net>
 <87k05bs8c5.fsf@tromey.com>
 <0a978271-3085-8bf3-f5fd-6a0b3f9f3ea2@monnerat.net>
 <874jwejgbb.fsf@tromey.com>
From: Patrick Monnerat <patrick@monnerat.net>
In-Reply-To: <874jwejgbb.fsf@tromey.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, NICE_REPLY_A,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Oct 2022 00:47:47 -0000


On 10/8/22 20:55, Tom Tromey wrote:
>
> The comments at the top of gdb_wchar.h describe the situation somewhat,
> though they don't really explain what was wrong with Solaris.  My
> recollection, though, is that the Solaris wchar_t doesn't have any
> ordinary encoding but is instead a weird hybrid thing, and furthermore
> that the Solaris iconv doesn't accept "wchar_t" as an encoding name.
> So, on Solaris, there's no convenient way to do the conversions (it's
> possible to convert wchar_t to/from the locale's multi-byte encoding,
> but I didn't implement that since it seemed like a pain).

Thanks for the additional explanation.

This describes the particular case of Solaris. Are there other OSes with 
similar implementations?

In all cases, this tends to demonstrate wchar_t is not reliable.

>
> All of this is based on the idea that it's convenient to work in a wide
> character representation at some points in the code.  At the time, I
> figured relying on wchar_t would be good for this because (presumably)
> hosts would support that reasonably well and we wouldn't have to do
> extra work in gdb.
>
> However, it seems to me that it doesn't really have to be done this way.
> We could use UTF-32 instead, by making our own tables (along the lines
> of ada-unicode.py) for "isdigit" and "isprint".

Totally agreed: we need to have something more "predictable". UTF-32 
seems a good choice, but the endian problem should still be resolved. 
Should it be fixed (UTF-32[BL]E) or machine dependent? Both have pros 
and cons. We could have a class implementing those chars + their 
ctype-like methods and even a basic_string instance subclass supporting 
conversions.

I nevertheless don't have any idea what is the amount of work required 
to change this.

> In addition to this, I suppose we could simply require iconv.  Probably
> any host that has iconv will support UTF-32 (if not, what good is it
> really).  And libiconv exists and can even be conveniently dropped into
> the source tree if there are any hosts that don't have it.  This may not
> be a good plan if there are active host platforms where this would be a
> pain to deal with.

IMO, only old platforms (>~15 years) have an iconv that does not feature 
UTF. Do we have to support them?

For the particular case of Solaris, did things changed nowadays and how 
old versions should be supported?

>
> Anyway, what do you think of this plan?

Globally and in the long term, I fully agree. Requiring iconv, using 
UTF-32 instead of wchar_t and dropping phony_iconv looks like the best 
solution. But again, I don't imagine the amount of rework implied.

As I'm mainly the Insight maintainer (since 2014) and a very small and 
recent committer to gdb, I don't want to make a revolution into the 
latter, just make it clean and usable from Insight when called in my 
test contexts (currently linux OK, cygwin OK, mingw BAD). That said, I'm 
not against a reasonable contribution that benefits to bare gdb too, 
within the limits of my knowledge and understanding of its code.

In the short and middle terms, I think the current patch is still 
useful: it immediately (and dirtily!) solves the problem introduced by 
Ada support and will allow a smooth and gentle UTF-32 transition until 
reaching a situation where phony_iconv can be dropped.


The questions I ask above are more to emphasize important strategic 
points rather than requiring an immediate answer, I guess!

Patrick