From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from outbound-ss-820.bluehost.com (outbound-ss-820.bluehost.com [69.89.24.241]) by sourceware.org (Postfix) with ESMTPS id 713493858D28 for ; Wed, 23 Feb 2022 22:28:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 713493858D28 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tromey.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com Received: from cmgw11.mail.unifiedlayer.com (unknown [10.0.90.126]) by progateway2.mail.pro1.eigbox.com (Postfix) with ESMTP id 31B0410048178 for ; Wed, 23 Feb 2022 22:28:55 +0000 (UTC) Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP id N07ynuTQkwm8iN07ynvf6j; Wed, 23 Feb 2022 22:28:55 +0000 X-Authority-Reason: nr=8 X-Authority-Analysis: v=2.4 cv=DpSTREz+ c=1 sm=1 tr=0 ts=6216b527 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=oGFeUVbbRNcA:10:nop_rcvd_month_year a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=20KFwNOVAAAA:8 a=wjwl_FfooIJmtL9bYiEA:9 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; s=default; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References :Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=dzB9zVWapkmX9KxVqqOnDK8yq6K6OF2vVMXpbLUU/vY=; b=NEr39OCTxnAmcDkMGz7LUUeU7b vyNYVdrXMeSYUgcmExel4HRut5lhLfAClOQExb8iLs9uObBDEXjboyx6MO2N3fLy8Cfirocj6FRtQ cb1ZBnzoRWNbRxjicufkg0IAy; Received: from 75-166-146-214.hlrn.qwest.net ([75.166.146.214]:35432 helo=prentzel) by box5379.bluehost.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nN07y-000ay8-3X; Wed, 23 Feb 2022 15:28:54 -0700 From: Tom Tromey To: Andrew Burgess Cc: Tom Tromey , gdb-patches@sourceware.org Subject: Re: [PATCH v2 09/18] Include \0 in printable wide characters References: <20220217220547.3874030-1-tom@tromey.com> <20220217220547.3874030-10-tom@tromey.com> <20220223134930.GT2571@redhat.com> X-Attribution: Tom Date: Wed, 23 Feb 2022 15:28:53 -0700 In-Reply-To: <20220223134930.GT2571@redhat.com> (Andrew Burgess's message of "Wed, 23 Feb 2022 13:49:30 +0000") Message-ID: <87czjddxp6.fsf@tromey.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - box5379.bluehost.com X-AntiAbuse: Original Domain - sourceware.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tromey.com X-BWhitelist: no X-Source-IP: 75.166.146.214 X-Source-L: No X-Exim-ID: 1nN07y-000ay8-3X X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: 75-166-146-214.hlrn.qwest.net (prentzel) [75.166.146.214]:35432 X-Source-Auth: tom+tromey.com X-Email-Count: 2 X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20= X-Local-Domain: yes X-Spam-Status: No, score=-3025.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Feb 2022 22:28:58 -0000 >>>>> "Andrew" == Andrew Burgess writes: Andrew> My confusion here is that I initially thought; if we have multiple Andrew> characters, some that are printable, and some that are not, then Andrew> surely, we would want to print the initial printable ones for real, Andrew> and only later switch to escape sequences, right? Andrew> Except, that's not what we do. Andrew> And the reason (probably obvious to quicker minds than mine) is that Andrew> characters might have different widths, so we can't "just" print the Andrew> initial characters, and then print the unprintable as escape Andrew> sequences, as we wouldn't know where in BUF the unprintable character Andrew> actually starts. Yeah, that's my understanding as well. Andrew> OK, so my idea of removing wchar_printable is clearly a bad idea, but Andrew> how does this relate to your change? Andrew> Well, prior to this patch, if we had 3 characters, the first two are Andrew> printable, and the third was \0, we would spot the non-printable \0, Andrew> and so print the whole buffer, all 3 characters, as escape sequences. Andrew> With this patch, all 3 characters will appear to be printable. So now Andrew> we will print the first character, just fine. Then print the second Andrew> character just fine. Now for the third character, the \0, we call to Andrew> print_wchar. The \0 is not handled by anything but the 'default' case Andrew> of the switch. Andrew> In the default case, the \0 is non-printable, so we end up in the Andrew> escape sequence printing code, which then tries to load bytes starting Andrew> from BUF - which isn't going to be correct. I think the idea behind this is that only a real \0 in the input will really ever turn into a L'\0' in the wchar_t form. It seems to me that an L'\0' pretty much has to correspond exactly to a target \0, just because C is pervasive and an encoding where stray \0 bytes can appear would break everything. Andrew> Now, this is where things are a bit weird. The code in Andrew> generic_emit_char is clearly written to handle multiple characters, Andrew> but, I've only ever seen it print 1 character, which is why, I claim, Andrew> your above change to wchar_printable works. That's most likely because you are trying this on Linux. Linux uses UTF-32 for wchar_t, and so there aren't target characters that can't be converted to a single wchar_t -- because UTF-32 is pretty much designed to round-trip everything else. So, on Linux hosts, I think some of these loops aren't really needed. However, Windows uses UTF-16 and a single target character can be converted to two wchar_t, via surrogate pairs. On Solaris and (IIRC) NetBSD, wchar_t is even weirder, though I don't recall whether it is a variable-length encoding. Anyway the \0 case is only really here for Rust. So maybe another idea is to handle it exactly there, somehow. The Rust printer can assume the use of UTF-32 on the target, so that would all work out fine. Tom