From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tom@tromey.com>
Received: from outbound-ss-820.bluehost.com (outbound-ss-820.bluehost.com
 [69.89.24.241])
 by sourceware.org (Postfix) with ESMTPS id 713493858D28
 for <gdb-patches@sourceware.org>; Wed, 23 Feb 2022 22:28:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 713493858D28
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=tromey.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com
Received: from cmgw11.mail.unifiedlayer.com (unknown [10.0.90.126])
 by progateway2.mail.pro1.eigbox.com (Postfix) with ESMTP id 31B0410048178
 for <gdb-patches@sourceware.org>; Wed, 23 Feb 2022 22:28:55 +0000 (UTC)
Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP
 id N07ynuTQkwm8iN07ynvf6j; Wed, 23 Feb 2022 22:28:55 +0000
X-Authority-Reason: nr=8
X-Authority-Analysis: v=2.4 cv=DpSTREz+ c=1 sm=1 tr=0 ts=6216b527
 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17
 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=oGFeUVbbRNcA:10:nop_rcvd_month_year
 a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=20KFwNOVAAAA:8
 a=wjwl_FfooIJmtL9bYiEA:9
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; 
 s=default;
 h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References
 :Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID:
 Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
 :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
 List-Post:List-Owner:List-Archive;
 bh=dzB9zVWapkmX9KxVqqOnDK8yq6K6OF2vVMXpbLUU/vY=; b=NEr39OCTxnAmcDkMGz7LUUeU7b
 vyNYVdrXMeSYUgcmExel4HRut5lhLfAClOQExb8iLs9uObBDEXjboyx6MO2N3fLy8Cfirocj6FRtQ
 cb1ZBnzoRWNbRxjicufkg0IAy;
Received: from 75-166-146-214.hlrn.qwest.net ([75.166.146.214]:35432
 helo=prentzel) by box5379.bluehost.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <tom@tromey.com>)
 id 1nN07y-000ay8-3X; Wed, 23 Feb 2022 15:28:54 -0700
From: Tom Tromey <tom@tromey.com>
To: Andrew Burgess <aburgess@redhat.com>
Cc: Tom Tromey <tom@tromey.com>,  gdb-patches@sourceware.org
Subject: Re: [PATCH v2 09/18] Include \0 in printable wide characters
References: <20220217220547.3874030-1-tom@tromey.com>
 <20220217220547.3874030-10-tom@tromey.com>
 <20220223134930.GT2571@redhat.com>
X-Attribution: Tom
Date: Wed, 23 Feb 2022 15:28:53 -0700
In-Reply-To: <20220223134930.GT2571@redhat.com> (Andrew Burgess's message of
 "Wed, 23 Feb 2022 13:49:30 +0000")
Message-ID: <87czjddxp6.fsf@tromey.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - box5379.bluehost.com
X-AntiAbuse: Original Domain - sourceware.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - tromey.com
X-BWhitelist: no
X-Source-IP: 75.166.146.214
X-Source-L: No
X-Exim-ID: 1nN07y-000ay8-3X
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Source-Sender: 75-166-146-214.hlrn.qwest.net (prentzel)
 [75.166.146.214]:35432
X-Source-Auth: tom+tromey.com
X-Email-Count: 2
X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20=
X-Local-Domain: yes
X-Spam-Status: No, score=-3025.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Feb 2022 22:28:58 -0000

>>>>> "Andrew" == Andrew Burgess <aburgess@redhat.com> writes:

Andrew> My confusion here is that I initially thought; if we have multiple
Andrew> characters, some that are printable, and some that are not, then
Andrew> surely, we would want to print the initial printable ones for real,
Andrew> and only later switch to escape sequences, right?

Andrew> Except, that's not what we do.

Andrew> And the reason (probably obvious to quicker minds than mine) is that
Andrew> characters might have different widths, so we can't "just" print the
Andrew> initial characters, and then print the unprintable as escape
Andrew> sequences, as we wouldn't know where in BUF the unprintable character
Andrew> actually starts.

Yeah, that's my understanding as well.

Andrew> OK, so my idea of removing wchar_printable is clearly a bad idea, but
Andrew> how does this relate to your change?

Andrew> Well, prior to this patch, if we had 3 characters, the first two are
Andrew> printable, and the third was \0, we would spot the non-printable \0,
Andrew> and so print the whole buffer, all 3 characters, as escape sequences.

Andrew> With this patch, all 3 characters will appear to be printable.  So now
Andrew> we will print the first character, just fine.  Then print the second
Andrew> character just fine.  Now for the third character, the \0, we call to
Andrew> print_wchar.  The \0 is not handled by anything but the 'default' case
Andrew> of the switch.

Andrew> In the default case, the \0 is non-printable, so we end up in the
Andrew> escape sequence printing code, which then tries to load bytes starting
Andrew> from BUF - which isn't going to be correct.

I think the idea behind this is that only a real \0 in the input will
really ever turn into a L'\0' in the wchar_t form.  It seems to me that
an L'\0' pretty much has to correspond exactly to a target \0, just
because C is pervasive and an encoding where stray \0 bytes can appear
would break everything.

Andrew> Now, this is where things are a bit weird.  The code in
Andrew> generic_emit_char is clearly written to handle multiple characters,
Andrew> but, I've only ever seen it print 1 character, which is why, I claim,
Andrew> your above change to wchar_printable works.

That's most likely because you are trying this on Linux.  Linux uses
UTF-32 for wchar_t, and so there aren't target characters that can't be
converted to a single wchar_t -- because UTF-32 is pretty much designed
to round-trip everything else.  So, on Linux hosts, I think some of
these loops aren't really needed.

However, Windows uses UTF-16 and a single target character can be
converted to two wchar_t, via surrogate pairs.

On Solaris and (IIRC) NetBSD, wchar_t is even weirder, though I don't
recall whether it is a variable-length encoding.

Anyway the \0 case is only really here for Rust.  So maybe another idea
is to handle it exactly there, somehow.  The Rust printer can assume the
use of UTF-32 on the target, so that would all work out fine.

Tom