From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from jupiter.monnerat.net (jupiter.monnerat.net [46.226.111.226]) by sourceware.org (Postfix) with ESMTPS id F0E453858CDB for ; Sun, 9 Oct 2022 00:47:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F0E453858CDB Received: from [192.168.0.128] ([192.168.0.128]) by jupiter.monnerat.net (8.14.8/8.14.8) with ESMTP id 2990lIYe012650 (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256 verify=OK); Sun, 9 Oct 2022 02:47:24 +0200 DKIM-Filter: OpenDKIM Filter v2.10.3 jupiter.monnerat.net 2990lIYe012650 Message-ID: <2f10efe4-1095-b620-ea1c-08cc047c45c4@monnerat.net> Date: Sun, 9 Oct 2022 02:47:18 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.1 Subject: Re: [PATCH] gdb: add UTF16/UTF32 target charsets in phony_iconv Content-Language: en-US To: Tom Tromey Cc: Patrick Monnerat via Gdb-patches References: <20221002140010.106238-1-patrick@monnerat.net> <87k05bs8c5.fsf@tromey.com> <0a978271-3085-8bf3-f5fd-6a0b3f9f3ea2@monnerat.net> <874jwejgbb.fsf@tromey.com> From: Patrick Monnerat In-Reply-To: <874jwejgbb.fsf@tromey.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, JMQ_SPF_NEUTRAL, NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Oct 2022 00:47:47 -0000 On 10/8/22 20:55, Tom Tromey wrote: > > The comments at the top of gdb_wchar.h describe the situation somewhat, > though they don't really explain what was wrong with Solaris. My > recollection, though, is that the Solaris wchar_t doesn't have any > ordinary encoding but is instead a weird hybrid thing, and furthermore > that the Solaris iconv doesn't accept "wchar_t" as an encoding name. > So, on Solaris, there's no convenient way to do the conversions (it's > possible to convert wchar_t to/from the locale's multi-byte encoding, > but I didn't implement that since it seemed like a pain). Thanks for the additional explanation. This describes the particular case of Solaris. Are there other OSes with similar implementations? In all cases, this tends to demonstrate wchar_t is not reliable. > > All of this is based on the idea that it's convenient to work in a wide > character representation at some points in the code. At the time, I > figured relying on wchar_t would be good for this because (presumably) > hosts would support that reasonably well and we wouldn't have to do > extra work in gdb. > > However, it seems to me that it doesn't really have to be done this way. > We could use UTF-32 instead, by making our own tables (along the lines > of ada-unicode.py) for "isdigit" and "isprint". Totally agreed: we need to have something more "predictable". UTF-32 seems a good choice, but the endian problem should still be resolved. Should it be fixed (UTF-32[BL]E) or machine dependent? Both have pros and cons. We could have a class implementing those chars + their ctype-like methods and even a basic_string instance subclass supporting conversions. I nevertheless don't have any idea what is the amount of work required to change this. > In addition to this, I suppose we could simply require iconv. Probably > any host that has iconv will support UTF-32 (if not, what good is it > really). And libiconv exists and can even be conveniently dropped into > the source tree if there are any hosts that don't have it. This may not > be a good plan if there are active host platforms where this would be a > pain to deal with. IMO, only old platforms (>~15 years) have an iconv that does not feature UTF. Do we have to support them? For the particular case of Solaris, did things changed nowadays and how old versions should be supported? > > Anyway, what do you think of this plan? Globally and in the long term, I fully agree. Requiring iconv, using UTF-32 instead of wchar_t and dropping phony_iconv looks like the best solution. But again, I don't imagine the amount of rework implied. As I'm mainly the Insight maintainer (since 2014) and a very small and recent committer to gdb, I don't want to make a revolution into the latter, just make it clean and usable from Insight when called in my test contexts (currently linux OK, cygwin OK, mingw BAD). That said, I'm not against a reasonable contribution that benefits to bare gdb too, within the limits of my knowledge and understanding of its code. In the short and middle terms, I think the current patch is still useful: it immediately (and dirtily!) solves the problem introduced by Ada support and will allow a smooth and gentle UTF-32 transition until reaching a situation where phony_iconv can be dropped. The questions I ask above are more to emphasize important strategic points rather than requiring an immediate answer, I guess! Patrick