From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 789C53858C53 for ; Wed, 26 Apr 2023 18:54:15 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 789C53858C53 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 8E8616C38; Wed, 26 Apr 2023 20:54:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202211; t=1682535253; bh=4Yo+Hg6mjxvzBs08jAwrYW86Ych2vrgS7FAM0TRGFwk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=rGMH297KP2YLrxW9qux4Hc3dVHiU49c5LKeC26uT2C5L+uwKUVgQwDvSb5nEsKkTm DKvoq4wmvhMPmpH/c/XTMhhZBdmHDD+fiA1c1dROFXwQrijqXok/Ex2Balxom45fjq y/EJHz7jmjD6aOwtmTWWcM30YcuUTB1o471ZhOwpiLNqw6iTChJaOsdTJqPGtbyZUe 81l2u/NsMlPXa7OX9liHuTWFSn2GyIKudYOB6RhHXJJqDbIrMTdt2DptMqBOi/cbtD xdxPPN8ymYQPeYDZoLIQywSjcOumXPeTfPmAole9Ot5LV17CMBhA/EuN1DR3p4V2hy +a6w/61XkZXEA== Date: Wed, 26 Apr 2023 20:54:12 +0200 From: =?utf-8?B?0L3QsNCx?= To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v9] POSIX locale covers every byte [BZ# 29511] Message-ID: References: <20230109151747.j3b7ls2kumcxa4px@tarta.nabijaczleweli.xyz> <20230207141645.fox6f5w6fn524bch@tarta.nabijaczleweli.xyz> <87lel1d3e1.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="pweol33dldl42332" Content-Disposition: inline In-Reply-To: <87lel1d3e1.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20230407 X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_INFOUSMEBIZ,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --pweol33dldl42332 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! Long time, apologies. On Mon, Feb 13, 2023 at 03:52:06PM +0100, Florian Weimer wrote: > > This largely duplicates the ASCII code with the error path changed > > > > There are two user-facing changes: > > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, > > (a) is 1-byte, stateless, and contains 256 characters > > (b) they collate in byte order > > (c) the first 128 characters are equivalent to ASCII (like previous) > > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > > changes to the standard; > > in short, this means that mbrtowc() must never fail and must return > > b if b <=3D 0x7F else ab+c for all bytes b > > where c is some constant >=3D0x80 > > and a is a positive integer constant > > > > By strategically picking c=3D we land at the tail-end of the > > Unicode Low Surrogate Area at DC00-DFFF, described as > > > Isolated surrogate code points have no interpretation; > > > consequently, no character code charts or names lists > > > are provided for this range. > > and match musl >=20 > I've thought about this some more, and I don't think this is the > direction we should be going in. >=20 > * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in > the Python style). It should have the property that it can encode > every byte string as a string of wchar_t characters, and convert the > result back. It's not entirely trivial because we need to handle > partial UTF-8 sequences at the end of the buffer carefully. There > might be some warts regarding EILSEQ handling lurking there. Like the > Python approach, it is somewhat imperfect because it's not preserving > identity under string concatenation, i.e. f(x) || f(y) is not always > equal to f(x || y), but that's just unavoidable. >=20 > * Switch the charset for the default C locale to UTF-8SE. This matches > the POSIX requirement that every byte can be encoded. The main point of LC_CTYPE=3DPOSIX as specified is that it allows you to process paths (which are sequences of bytes, not characters) in a sane way =E2=80=92 part of that is that collation needs to be correct, so maybe,= as a smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c". >>> b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape') '=C4=BF' >>> b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape') '\udcc4\udcc0' >>> >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors=3D'surrogateescape'))] [319] >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors=3D'surrogateescape'))] [56516, 56512] which, I mean, sure, maybe that's sensible (I wouldn't say so), but >>> b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape') '\uffff' >>> b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape') '\udcef\udcbf\udcc0' >>> >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors=3D'surrogateescape'= ))] [65535] >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors=3D'surrogateescape'= ))] [56559, 56511, 56512] Which means you can't process arbitrary data (pathnames) in a way that makes sense. In my opinion this would be /worse/ than the current behaviour, behaving erratically in the presence of Some Data instead of simply not supporting it. > * Work with POSIX to drop the requirement that the C locale needs to be > a single-byte locale. That's not going to happen because it's the /only/ way to process paths. Indeed, XBD 8.2 puts it nicely: Users may use the following environment variables to announce specific localization requirements to applications. As a user, I want to be able to announce "each byte is a character, in natural ordering". This is what LC_CTYPE=3DC lets me do. I hope you'll agree this is a good feature to be support. POSIX, also, explicitly says that (XBD 8.2): 5499 1. If the LC_ALL environment variable is defined and is not null, the= value of LC_ALL shall 5500 be used. 5501 2. If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGE= S, 5502 LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the = value of the 5503 environment variable shall be used to initialize the category that= corresponds to the 5504 environment variable. 5505 3. If the LANG environment variable is defined and is not null, the v= alue of the LANG 5506 environment variable shall be used. 5507 4. If the LANG environment variable is not set or is set to the empty= string, the 5508 implementation-defined default locale shall be used. and XBD 7.2: 3643 All implementations shall define a locale as the default locale, to b= e invoked when no 3644 environment variables are set, or set to the empty string. This defau= lt locale can be the POSIX 3645 locale or any other implementation-defined locale. Some implementatio= ns may provide facilities 3646 for local installation administrators to set the default locale, cust= omizing it for each location. 3647 POSIX.1-202x does not require such a facility. To that end, how's about: * invent UTF-8SE encoding as you say * invent POSIX encoding like in this patch (but move the area to match UTF-8SE probably, it's a good precedent) * hook up POSIX to POSIX as in here * change the implementation-defined default locale to POSIX-but-UTF-8SE * (maybe) change the default locale on entry to main() to POSIX-but-UTF-8= SE POSIX requires that LC_ALL=3DPOSIX is the default on entry to main(). That said, I wouldn't mind violating /that/, since anything we do with it is backwards-compatible. Maybe it makes sense to do that for programs that don't call setlocale() at all, and they'll behave better when used internationally. Or not. Logically, this translates to: * if the user has their native locale selected, use that * if the user has explicitly selected the bytewise locale, use that * if the user hasn't configured their locales at all, assume they want UTF-8 but degrade sensibly * (maybe) if the program hasn't been written with locales in mind, assume the user will be using it with UTF-8 input but degrade sensibly I think this leaves the wolf full and the sheep alive =E2=80=92 the default behaviour is UTF-8(ish), and can be overridden to full UTF-8 or bytes, per the user's requirements. Existing users will thus gain the ability to: * process data that's UTF-8 but skip over/retain illegal/otherwise-encoded bytes losslessly (this makes the sample above a killer feature instead of non-sensible, so long as it's an encoding in its own right) * correctly process arbitrily-encoded data as bytes Thoughts? =D0=BD=D0=B0=D0=B1 --pweol33dldl42332 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmRJc1EACgkQvP0LAY0m WPEwjxAAtC6SqRLseK1u1OGthVGHC84TPop+dsovQRJPNLAtIEbuSd9e0B12XK39 o/dBkLevpLTrieq9NP4n1W4pVA3PsIsUXRQ2DRjO+4EYfLzY/6hovx/zkmHO6JAb phnhGsY5bwqtlH6pnBf1X71zgqQ5pDo3Ha3+MCJj2eqQlo5CQYSzsVlcOcimugGR 5UPNpofo0QwrtXITjY/8Rz0OYbw96TFPHjwpKHNDsM2KyEq5DbuG3sq9EMvFpxi6 SkXCP+P1erAGSr6ivmH8ijQRRDT17wgbKMNSI9zwQ6uCJcrPkMPKeMBD8dXwzbgP ZKVf9yQzUgS4iIEn08pJTILt3EVT+/X8aPYgpCBuYpxhT36Cid+00JzGrni8+NiS W5/viDyXvynZTOM+0+d/FuCsNdSvnOuyuxPLvD+250UY5LkR5HSZnthSZnU5XbTY oXX12WrrPYwOFHbwu5uSKYhHJ9Bug/tKHT9QsaK5tEQEb3vIxUWB5p3Z0s0nCqzC zOW/V1/N4JcKYyLIqX32RvwFBfxoWjgK9PzCzwonOx/QVclkBTbXBzH1NMfusRIC SuhzGzvCXFejBHK1od33XmksyhsKvpx4fdAd0LASRbkSGFohAS6oDHI0lGQeDA3n s/4NWrJJUWMqfMxmhJxx/uGJ0B/ZS8ToaicWq1TLFXRl/nZSzkA= =Fg0p -----END PGP SIGNATURE----- --pweol33dldl42332--