From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from outbound-ss-761.bluehost.com (outbound-ss-761.bluehost.com [74.220.211.250]) by sourceware.org (Postfix) with ESMTPS id D01973858D1E for ; Mon, 4 Apr 2022 09:48:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D01973858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tromey.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com Received: from cmgw10.mail.unifiedlayer.com (unknown [10.0.90.125]) by progateway8.mail.pro1.eigbox.com (Postfix) with ESMTP id DA49710044464 for ; Mon, 4 Apr 2022 09:48:49 +0000 (UTC) Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP id bJKLnDRzsQs3CbJKLnJNgP; Mon, 04 Apr 2022 09:48:49 +0000 X-Authority-Reason: nr=8 X-Authority-Analysis: v=2.4 cv=A+Opg4aG c=1 sm=1 tr=0 ts=624abf01 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=IkcTkHD0fZMA:10:nop_charset_1 a=z0gMJWrwH1QA:10:nop_rcvd_month_year a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=roKULrj1U9cyczvsgD4A:9 a=QEXdDO2ut3YA:10:nop_charset_2 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID: In-Reply-To:Date:References:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=dwKMfxScY3vg17Ah0EMOnsmFHI4HiNzIpIkv4X+n99A=; b=uqVDL0oCH6l9FM3xN2fvkjt1Rb 4qJOHx0voXPD5hTGVTPZI4puphPkAlrsajZ65gyipsW8TQusFXYe50WWO364vjtLifZpUch4AFyQw XC7CvxzFrz+gEE58S9EIsw3Rj; Received: from 71-211-176-135.hlrn.qwest.net ([71.211.176.135]:39594 helo=prentzel) by box5379.bluehost.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nbJKK-002v33-UT; Mon, 04 Apr 2022 03:48:49 -0600 From: Tom Tromey To: Andrew Burgess Cc: Tom Tromey , gdb-patches@sourceware.org Subject: Re: [PATCH] Allow non-ASCII characters in Rust identifiers References: <20220126231501.1031201-1-tom@tromey.com> <87y22nwxqb.fsf@tromey.com> <87ee2e87l6.fsf@redhat.com> <87mth26rgo.fsf@tromey.com> <875ynq8418.fsf@redhat.com> <87zgl16wp1.fsf@redhat.com> X-Attribution: Tom Date: Mon, 04 Apr 2022 03:48:48 -0600 In-Reply-To: <87zgl16wp1.fsf@redhat.com> (Andrew Burgess's message of "Mon, 04 Apr 2022 10:10:18 +0100") Message-ID: <87a6d16uwv.fsf@tromey.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - box5379.bluehost.com X-AntiAbuse: Original Domain - sourceware.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tromey.com X-BWhitelist: no X-Source-IP: 71.211.176.135 X-Source-L: No X-Exim-ID: 1nbJKK-002v33-UT X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: 71-211-176-135.hlrn.qwest.net (prentzel) [71.211.176.135]:39594 X-Source-Auth: tom+tromey.com X-Email-Count: 4 X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20= X-Local-Domain: yes X-Spam-Status: No, score=-3024.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, JMQ_SPF_NEUTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Apr 2022 09:48:52 -0000 Andrew> So I put this into a text file 'unicode.tcl': Andrew> puts "print =F0=9D=95=AF" Andrew> (just in case that gets mangled in transit, that's the same unicode Andrew> character as is used in the gdb.rust/unicode.exp test) Andrew> I'm currently running tcl 8.6. I wonder if you could compare this = to Andrew> the behaviour of your tclsh. This works for me. I'm using the system tclsh on Fedora 34. tclsh doesn't seem to support --version, but: prentzel. rpm -q tcl tcl-8.6.10-5.fc34.x86_64 Andrew> Compared to the original, the first '0xf0' changes to '0xc3 0xb0', = while Andrew> all the subequent bytes get a 0xc2 byte before them. Unicode U+00f0 is represented as 0xc3 0xb0 in UTF-8. So one idea is if tclsh thinks the input is Latin-1, where the code points generally map identically to Unicode code points, then this conversion would be done if converting from the file encoding to UTF-8. That is, tclsh reads 0xf0, but thinking it is reading a Latin-1 character, converts that to the corresponding Uncode character U+00f0, and from there to the bytes that are seen. I have LANG=3Den_US.UTF-8, which may explain why my default encoding is UTF-8. Perhaps this setting is the problem - if you are running in a Latin-1 locale then the .exp files will be recoded incorrectly as they are read by the interpreter. I don't know if there's a way to set the file encoding in a way that Tcl will recognize. We could maybe try a UTF-8 BOM. Tom