From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from qproxy2-pub.mail.unifiedlayer.com (qproxy2-pub.mail.unifiedlayer.com [69.89.16.161]) by sourceware.org (Postfix) with ESMTPS id D13A0385840B for ; Thu, 24 Feb 2022 00:00:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D13A0385840B Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tromey.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com Received: from gproxy3-pub.mail.unifiedlayer.com (gproxy3-pub.mail.unifiedlayer.com [69.89.30.42]) by qproxy2.mail.unifiedlayer.com (Postfix) with ESMTP id DD90680293AE for ; Thu, 24 Feb 2022 00:00:04 +0000 (UTC) Received: from cmgw12.mail.unifiedlayer.com (unknown [10.0.90.127]) by progateway5.mail.pro1.eigbox.com (Postfix) with ESMTP id 92CFA10047806 for ; Wed, 23 Feb 2022 23:59:04 +0000 (UTC) Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP id N1XDndcGt8lmIN1XEn2LGs; Wed, 23 Feb 2022 23:59:04 +0000 X-Authority-Reason: nr=8 X-Authority-Analysis: v=2.4 cv=HvGzp2fS c=1 sm=1 tr=0 ts=6216ca48 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=oGFeUVbbRNcA:10:nop_rcvd_month_year a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=8pif782wAAAA:8 a=STO2OFXP0-bedfDiSgUA:9 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; s=default; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:References :Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=UA9p0NerE77q4fwA5/nnuxcgIWqgORYzeyrtOwrkZ2w=; b=aGFUylwMaNpkL0cZMNlMiFMcbW 4V+I0CqBY14+Js66b2Cp5KrPQJgZOTXhMdT08bCBEdbf55EQBFmatuOD5zQ6JxQ7xpqmqTzL9jXyh 79bm8eZVOcLfhJmV8L/V47nPe; Received: from 75-166-146-214.hlrn.qwest.net ([75.166.146.214]:35434 helo=prentzel) by box5379.bluehost.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nN1XD-001rFd-BQ; Wed, 23 Feb 2022 16:59:03 -0700 From: Tom Tromey To: Tom Tromey Cc: Andrew Burgess , gdb-patches@sourceware.org Subject: Re: [PATCH v2 09/18] Include \0 in printable wide characters References: <20220217220547.3874030-1-tom@tromey.com> <20220217220547.3874030-10-tom@tromey.com> <20220223134930.GT2571@redhat.com> <87czjddxp6.fsf@tromey.com> X-Attribution: Tom Date: Wed, 23 Feb 2022 16:59:02 -0700 In-Reply-To: <87czjddxp6.fsf@tromey.com> (Tom Tromey's message of "Wed, 23 Feb 2022 15:28:53 -0700") Message-ID: <878ru1dtix.fsf@tromey.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - box5379.bluehost.com X-AntiAbuse: Original Domain - sourceware.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tromey.com X-BWhitelist: no X-Source-IP: 75.166.146.214 X-Source-L: No X-Exim-ID: 1nN1XD-001rFd-BQ X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: 75-166-146-214.hlrn.qwest.net (prentzel) [75.166.146.214]:35434 X-Source-Auth: tom+tromey.com X-Email-Count: 2 X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20= X-Local-Domain: yes X-Spam-Status: No, score=-3025.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Feb 2022 00:00:08 -0000 Tom> I think the idea behind this is that only a real \0 in the input will Tom> really ever turn into a L'\0' in the wchar_t form. It seems to me that Tom> an L'\0' pretty much has to correspond exactly to a target \0, just Tom> because C is pervasive and an encoding where stray \0 bytes can appear Tom> would break everything. I went for a short walk and, naturally, realized this is only half right. An L'\0' can come from a non-zero target encoding in the Java flavor of UTF-8, which exists precisely to smuggle a wide '\0' through a multi-byte encoding. https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 So while I still believe that a target '\0' will always map to a L'\0', it's not the case that an L'\0' necessarily came from one such, with the Java-style 0xc0 0x80 being a counter-example. In this case it's not 100% clear what is the best thing to do. Possibly iconv will just give an encoding error, as that is an overlong sequence. Anyway maybe the right thing to do is print \xc0\x80 or the like, to make it clear that something unusual is going on. Tom> That's most likely because you are trying this on Linux. Linux uses Tom> UTF-32 for wchar_t, and so there aren't target characters that can't be Tom> converted to a single wchar_t I now wonder if this is true as well, because you might see a "CESU-8" encoding: https://en.wikipedia.org/wiki/CESU-8 ... where surrogate pairs are represented as two UTF-8 sequences. This could show up as a target program decision to use this encoding, combined with using UTF-8 in gdb. I didn't experiment to see what iconv does for this sort of thing. I'll look into a Rust-specific fix and just drop this patch. Tom