From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 3268E3858D20 for ; Fri, 2 Dec 2022 18:42:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3268E3858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 74809A2C; Fri, 2 Dec 2022 19:42:01 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202211; t=1670006521; bh=6JcXDGST51KFrmqMpi8ckLLBvhRPGPPq1XyGyQnc5U0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Vb8Ka/DdSCR8N7hP3zGIRWhQCBtk8XsWjxRCQ1YDN+oM7BmE8V0H51QCd80FBo99Y LNegYG7mCz36yyIG07QqpPqC2rrqz/uC35AaxJcx1yQ9pZVGyhEjZppZNhLgDez2C8 cDG0Tt4Zb6bA59N+SIuh3vGQbVytrpA6s51qbK4eKP5xw9hcVJs54VJXpKFq0HnCHD wudw1m6JsSa9FwXpxOMQR29PE8yhCJrjRROObOFBo4VoFCGWg19GowmeZmqFQJz7kE ZLGSoBOza1oeXHzl+7+SEDpl4KNJ4m8QKBw+dWr4p56gtHDE48VXlOlU2z1M8I5DgG wATmPlpb8Us5A== Date: Fri, 2 Dec 2022 19:42:00 +0100 From: =?utf-8?B?0L3QsNCx?= To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] Message-ID: <20221202184200.gfcnjwcfnc75wqqi@tarta.nabijaczleweli.xyz> References: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> <874jv8dxat.fsf@oldenburg.str.redhat.com> <87wn8344by.fsf@oldenburg.str.redhat.com> <20221128162453.supbaps5ftl2mg3s@tarta.nabijaczleweli.xyz> <87mt85pv1h.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="d3knoeizpge5g76w" Content-Disposition: inline In-Reply-To: <87mt85pv1h.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20220429 X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FROM_SUSPICIOUS_NTLD,KAM_INFOUSMEBIZ,PDS_RDNS_DYNAMIC_FP,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,TXREP,T_PDS_OTHER_BAD_TLD autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --d3knoeizpge5g76w Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! On Fri, Dec 02, 2022 at 06:36:26PM +0100, Florian Weimer wrote: > * =D0=BD=D0=B0=D0=B1: > > On Thu, Nov 10, 2022 at 09:10:57AM +0100, Florian Weimer wrote: > >> Raised on the musl list here: > >> Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale > >> > > > > That thread seems to've been exhausted (at least I don't see anything > > fresh in the archive) =E2=80=92 should I just resend with the comments = for v7 > > applied, or do you have a mapping range you'd rather see given those > > givens? >=20 > I still can't make up my mind. I think the options are: >=20 > * Some sort of custom encoding (like you posted). For which there's Prior Art in both other libcs and implementations of similar mechanisms in unrelated software, making it just about what users expect, and the lowest-energy conversion to what POSIX has mandated for close to 8 years now. > * Latin-1 Sorry, what? The Latin-1 that's so poorly defined W3C requires, per spec, that Latin-1 charset specs be ignored? The Latin-1 they got wrong so badly they made two subsequent standards, only one of which compatible? The Latin-1 that has some random subset of germanic and maybe like french if you squint and that's apparently fine? The iron curtain has fallen, for better or for worse, since the 80s. If that's the "solution", then leave "C" 7-bit. I'm gonna assume that's a joke. (Also, people would then try to "use" it, and then (a) you've lost, but (b) the collation sequence is gonna be wrong always because it no longer represents any given language (though apparently it's "fine" if they collate in any random order, so it's legal per spec to make it just spanish, I think; this is somehow even worse, and I may be misunderstanding POSIX 7.3.2.6 because it'd mean that other parts of the standard that use and recommend "LC_ALL=3DC utility ..." to process bytes, like for sort, are also wrong?).) > * UTF-8 with surrogate escape encoding (and encouraging POSIX to change a= gain) Well it's not gonna =E2=80=92 at least I don't think it is =E2=80=92 given = that I don't think it /changed/ anything actually? Issue 7-2008 7.2 says > The tables in Locale Definition describe the characteristics and > behavior of the POSIX locale for data consisting entirely of > characters from the portable character set and the control character > set. For other characters, the behavior is unspecified. And just TC2 specified what had been unspecified behaviour I think? Implementations had freedom to do whatever, including UTF-8, until 2016. Naturally, as we're seeing now, not one has exercised that freedom. If glibc /did/ do POSIX=3DC=3DC.UTF-8 before then, then maybe we'd see a different result, but it hadn't, so we didn't. > What argues in favor of the last point is that many, many people are > using C.UTF-8 nowadays. Great! They can continue to use C.UTF-8. They have had to opt in to their preferred encoding like everyone else, and they will continue. 0 changes observed here. > And effectively disabling wide/multibyte > conversion until you call setlocale does not seem particularly useful. "Mangling input data until explicitly disabled" is worse than "input data is data, and you can make it characters". Don't take me for not-a-UTF-8-maximalist, but, y'know, it will never, unfortunately, be all I see, and being able to completely opt out of additional input processing would be nice; we're kinda close now with the current hard-7-bit ASCII, and making it, essentially, I Can't Believe It's Not Just Bytes!, per pt. 1, would eliminate even more head-aches IME. Putting pro-verbial KOI-8 or [your grandma's favourite encoding] through the UTF-8 grinder is much worse than just degrading to strcmp(), but at this point I think I'm rambling, and spilled enough ink; your call to make, at the end of the day. =D0=BD=D0=B0=D0=B1 --d3knoeizpge5g76w Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmOKRvUACgkQvP0LAY0m WPGQKQ/8DREy0fjHETj/p8dWZ1mBHdr1+fkOYnvXcSGdrq1Hq5Qy3s1wdxBlB0ZG 3eNzLlCQevTTJy60/Wr31Rnwwq602n63yPDJpNmEvcvVnahckEQHiu+r6McLg697 LYfrY4SWRzby6YuF8+K2BnLwoHQjUS44BQ1hHBbyTIbnacmebbH7VlQzbKvEznqG o4dOq5uJw7qttnwI4yllpKqUuDwrW3sKVGX46HUnAeL86o1NVa9OiyvYvK+prmC1 wImpY7UH8egc0U/1bg02MKwVqEeXqUngjLKZAxD10B2seZw/kErW1cFd/8RX+ZjT SHyWAeZ1NOQSlPDZEOsDgYUZDUPuzsJ/kecjS07JBUQXMuyOdW+61ZqGsczz3yQw SV62kmcBBt4ulIEPdCgn9b1c9tNGDBdpxi3eZPPIc9kvb7r/c0ZweqwXr5VdaZo3 LIzcOZsEJk8lVXNSeU97+3j9l1gfYFxAD22VIVLOC60xF31RhRVQH60OfNXZqf/M Zyu41MpxOmU6H5gawXEuZDu3x6JWHl10nLa1It9zZDOsbG1TGzpXEgPs3F8yfHvc CK0pVuqbbFmXZLRXEHNHd2mZ2FX/qj1hb6T/t6lSLENj6ni+6dY3LVT31Br+cpsS /En2ShHR0RDu/NZAWf93N7GfrLcKq7u2N0SSk9JxR0G+EwlcTfQ= =Z//v -----END PGP SIGNATURE----- --d3knoeizpge5g76w--