From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 9B437385354D for ; Tue, 6 Sep 2022 18:07:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9B437385354D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=nabijaczleweli.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nabijaczleweli.xyz Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 51C013C4; Tue, 6 Sep 2022 20:06:58 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=nabijaczleweli.xyz; s=202205; t=1662487618; bh=/IGJtV8Iw/RRl27umchr542UcgYXe7zeRZawgR2nE/w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Rd8Sq7eQdC9EaPkJQOGRqOWCkXDtNmuyprgrVXcj6qC6/WJKYZYCabFMtc1rdOKDl 2fK5QBcQfq6FMtZOLHNO1jg4ulOf6s3CqHyZBx1ZdRwyatPBzUH150zS3Wm2GmURrq BKExhBS7GDr4/j7PWGvnvySfOERSyOL3rzrMSz7MSI99VXjV7NMR7d9NwB6xdNfECD /hkHfGCH+9tV2qMgz9DGo0VdAA6Dsos8J3Ef3RFh+LnT6JyLGmeDL2SZKm5I2swz7V /6dTmpMsfcGoQ7b0Sgx8Zva4FKksPVeG63TICd+w2Vvr17kiFYusITR9/WPjyeQMsx cU6qf6qc/NEmTOy50+LX0SOo82EC0W4A1LjDkFI6/LGsoXvVVI5vIkvXW+q0ew5/wP NCofl7ockRpGj85zUQnf/FYti0AhGdEYQaf1o7dleS2DhTOndbHe/qBOvH2TmlG/U/ C1tDqsb6EQy6uaubJVgmNdW8N/QkwVc/M76f5v0DRZAmlchrPRTKLope9eJcKX+Ain KlnBdVUAYC8bwp4M2zTUDu2rbR9xVMlYvmvEath2PuunE+aSx07dq6J/Iuuj2VBMNy DDUKd4l3zrSE5FJI4+kFUXdNbXiS90qb12PXlpXaeqZQamxKYZ03tDXeME0A3Nr6EN iHLvCskKpunO7Cj1yCvbOUOI= Date: Tue, 6 Sep 2022 20:06:57 +0200 From: =?utf-8?B?0L3QsNCx?= To: Florian Weimer Cc: libc-alpha@sourceware.org Subject: Re: [PATCH] POSIX locale covers every byte [BZ# 29511] Message-ID: <20220906180657.eo53xia2yqqiv4ri@tarta.nabijaczleweli.xyz> References: <20220830181932.oggrz6f6itrpyi6g@tarta.nabijaczleweli.xyz> <87a67c60x6.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ujbyoq4cyzan5j56" Content-Disposition: inline In-Reply-To: <87a67c60x6.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20220429 X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FROM_SUSPICIOUS_NTLD,GIT_PATCH_0,KAM_INFOUSMEBIZ,PDS_OTHER_BAD_TLD,RDNS_DYNAMIC,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --ujbyoq4cyzan5j56 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! On Tue, Sep 06, 2022 at 04:19:01PM +0200, Florian Weimer wrote: > * =D0=BD=D0=B0=D0=B1 via Libc-alpha: >=20 > > This is a trivial patch, largely duplicating the extant ASCII code > > > > There are two user-facing changes: > > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively: > > (a) is 1-byte, stateless, and contains 256 characters > > (b) which collate in byte order > > (c) the first 128 characters are equivalent to ASCII (like previous) > > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > > changes to the standard; > > in short, this means that mbrtowc() must never fail and must return > > b if b <=3D 0x7F else ab+c for all bytes b > > where c is some constant >=3D0x80 > > and a is a positive integer constant > > > > By strategically picking c=3D we land at the tail-end of the > > Unicode Low Surrogate Area at DC00-DFFF, described as > > > Isolated surrogate code points have no interpretation; > > > consequently, no character code charts or names lists > > > are provided for this range. > > and match musl >=20 > We don't match Python and its surrogateescape encoding (PEP 838). 404? > It > maps invalid bytes in the 0x80=E2=80=A60xff range to U+DC80=E2=80=A6U+DCF= F. (The same as musl.) > It may make > more sense to align with that. With a=3D1 and c=3D, assuming it's as you say, we very much do? $ printf '\x80\xff' | output/elf/ld.so --library-path output/ output/iconv/= iconv_prog -fPOSIX -tUCS4 | hd 00000000 00 00 df 80 00 00 df ff |........| 00000008 > Anyway, regarding mechanics, we'll need a new localedata/charmaps/POSIX > charmap, I think. This charmap then can be tested against the gconv > converter. Hm, the problem with that is tst-tables -> tst-table -> tst-table-from (and -to) convert by constructing a UTF-8 sequence. The problem with this approach is that glibc rejects unpaired surrogates. The output for tst-table-from UTF-8 is: ... 0xED9FBE 0xD7FE 0xED9FBF 0xD7FF 0xEE8080 0xE000 0xEE8081 0xE001 ... i.e. there's a gap for the surrogates; and, indeed, the charmap reads /xed/x9f/xbb HANGUL JONGSEONG PHIEUPH-THIEUTH % /xed/xa0/x80 % /xed/xad/xbf % /xed/xae/x80 % /xed/xaf/xbf % /xed/xb0/x80 % /xed/xbf/xbf .. /xee/x80/x80 with the surrogate range commented-out; this dates back to the inclusion of UTF-8 generator scripts in 2015 (4a4839c94a4c93ffc0d5b95c69a08b02a57007f2), these exclusions are deliberate (grep for surrog in localedata/unicode-gen/utf8_gen.py). Given this limitation, expanding the charmap to ANSI_X3.4-1968 + .. doesn't actually test much: having them as separate codepoints will always fail tests, and dot-notation lines are ignored when generating the comparison tables, so this particular type of test just proves that POSIX is the same as ANSI_X3.4-1968 for the first 128 characters. There's already an exhaustive iconv_prog-based testsuite (cf. additions to iconv/tst-iconv_prog.sh), though. > You should put the new converters into a separate file (not > iconv/gconv_simple.c), then the s390x version will use that > automatically. Oh, of course! Moved to iconv/gconv_posix.c. > > diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX > > index 7ec7f1c577..fc34a6abc1 100644 > > --- a/localedata/locales/POSIX > > +++ b/localedata/locales/POSIX > > @@ -97,6 +97,20 @@ END LC_CTYPE > > LC_COLLATE > > % This is the POSIX Locale definition for the LC_COLLATE category. >=20 > Isn't this just the C locale? Yes, C is defined to be POSIX. > We don't have a separate file for that. Yes, we very obviously do, seeing as this patch edits it? Nothing consumes it AFAICT, but. > > diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c > > index 0f0f55f9ed..f87099bcf5 100644 > > --- a/wcsmbs/wcsmbsload.c > > +++ b/wcsmbs/wcsmbsload.c > > @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc =3D > > .__shlib_handle =3D NULL, > > .__modname =3D NULL, > > .__counter =3D INT_MAX, > > - .__from_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", > > + .__from_name =3D (char *) "POSIX", > > .__to_name =3D (char *) "INTERNAL", > > - .__fct =3D __gconv_transform_ascii_internal, > > - .__btowc_fct =3D __gconv_btwoc_ascii, > > + .__fct =3D __gconv_transform_posix_internal, > > + .__btowc_fct =3D __gconv_btwoc_posix, > > .__init_fct =3D NULL, > > .__end_fct =3D NULL, > > .__min_needed_from =3D 1, > > @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb =3D > > .__modname =3D NULL, > > .__counter =3D INT_MAX, > > .__from_name =3D (char *) "INTERNAL", > > - .__to_name =3D (char *) "ANSI_X3.4-1968//TRANSLIT", > > - .__fct =3D __gconv_transform_internal_ascii, > > + .__to_name =3D (char *) "POSIX", > > + .__fct =3D __gconv_transform_internal_posix, > > .__btowc_fct =3D NULL, > > .__init_fct =3D NULL, > > .__end_fct =3D NULL, >=20 > This makes the comment on __wcsmbs_gconv_fcts_c in the same file > obsolete. Comment fixed. > Thanks, > Florian New patchset in followup. Best, =D0=BD=D0=B0=D0=B1 --ujbyoq4cyzan5j56 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfWlHToQCjFzAxEFjvP0LAY0mWPEFAmMXjEAACgkQvP0LAY0m WPGkCg/9GX6ehwPT1ZXVbX6U3kGKBUc8rEEAj+R7fVnJJnCfPcWWARCyb565Ht1j F66+3XXnuG6Ye40hjU6QYFjqqShW5BpVGQTibPNUW+UaB6YK/sHVIB6Agnc+f5BS mvU95vddiv6A+h9RK6s2PvG6XWxzSmUbsN/Kgsv9JezrCB62jicP8sL/twVG1wJV VWn5xJmdraDWG6hxqygjVjS+aVPyG+nACf/bKGLyWNtKnqmYtVeLkHf/98/MoTDn 1you7bc8PqjW7LuP0SufHD0VIAH5OI6XqCtDn3YGbI4k3wQd6fiVYFCe0Tnh/oYH c/CtxTIaRQU0P8Cz0YZ26MG4EjXc+81zqtpfJG9vsAa1ZZWBCAr5vSzHrx0jZb3Y 1ylf0YVHTyG2+BuXvFomdQqmoS3olNQMrQs6om3QsdGq1runBLH8jL2b/HrNq2Ae D9/U3Bgf9S8GiePbBtre3ypR69MkQUlvW1vKsEoiY23QN62Cld+uyp4VgW7GQp+f YfM+4BupQ8Lx84nhJCeoeGZM8E2DyQ+gK+ESBHjiZM0yaSFlVhSIAGbSBx6tbG1L GDt9ibhNRbsOfscxznBVSamrhla2Syna3oWlTmccjyDvLDjPwQWghPemOpTwXxNf /CkBKjHjwH+TSMIcVNIJeosuh+iXfn9JOYtOXInLYn4ff9cjxvA= =+Rni -----END PGP SIGNATURE----- --ujbyoq4cyzan5j56--