From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 36CF33858031 for ; Wed, 9 Nov 2022 14:20:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 36CF33858031 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668003632; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=H1pVavBq2WmfeQLvWU6mmldHd+6EKi6D0NNFItUc790=; b=bW5Y+YAUrZkWnrVBIh8yTsrGmOfKLCmy897Cr2Hu023T0DrDV1PpoeEfAhXNpAHGo/bKJT +AMFKhQlqafXUKNQpqbsmGtxmIaSEV2R9dCErYm5/OCcftquu1wI5F4CpwK0ce1hdIl5/R IjnBwfLlJknrsIBRYO6Ah7HP8zDmdFc= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-433-mWoJBe-pOnO7rC5uBC1q6A-1; Wed, 09 Nov 2022 09:20:29 -0500 X-MC-Unique: mWoJBe-pOnO7rC5uBC1q6A-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 68ADF280D728; Wed, 9 Nov 2022 14:20:29 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.2.16.36]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 74B95202903F; Wed, 9 Nov 2022 14:20:28 +0000 (UTC) From: Florian Weimer To: =?utf-8?B?0L3QsNCx?= Cc: libc-alpha@sourceware.org, Victor Stinner Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] References: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> Date: Wed, 09 Nov 2022 15:20:26 +0100 In-Reply-To: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> (=?utf-8?B?ItC90LDQsSIncw==?= message of "Wed, 2 Nov 2022 18:17:17 +0100") Message-ID: <874jv8dxat.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: * =D0=BD=D0=B0=D0=B1: > This is a logistically trivial patch, > largely duplicating the extant ASCII code with the error path changed I wouldn't say it's trivial in the commit message. 8-) > There are two user-facing changes: > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" > * mbrtowc() and friends return b if b <=3D 0x7F else +b > > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, > (a) is 1-byte, stateless, and contains 256 characters > (b) they collate in byte order > (c) the first 128 characters are equivalent to ASCII (like previous) > cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of > changes to the standard; > in short, this means that mbrtowc() must never fail and must return > b if b <=3D 0x7F else ab+c for all bytes b > where c is some constant >=3D0x80 > and a is a positive integer constant > > By strategically picking c=3D we land at the tail-end of the > Unicode Low Surrogate Area at DC00-DFFF, described as > > Isolated surrogate code points have no interpretation; > > consequently, no character code charts or names lists > > are provided for this range. > and match musl Sadly this doesn't match Python and PEP 540: >>> b'\x80'.decode('UTF-8', errors=3D'surrogateescape') '\udc80' I believe the implementation translates this to 0xDF80 instead. Not sure what is more important here, musl compatibility or Python compatibility. Cc:ing Victor in case he as comments. I should probably ask on the musl list as well as how this divergence came to pass. This change definitely needs a NEWS entry. The mechanics of the patch look okay to me, just a few nits below. > diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h > index 1c6745043e..45ab1edfad 100644 > --- a/iconv/gconv_int.h > +++ b/iconv/gconv_int.h > @@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, = const char *name2) > =20 > __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal); > __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii); > +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal); > +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix); > __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal); > __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8); > __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal); > @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_interna= l); > only ASCII characters. */ > extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned c= har c); > =20 > +/* Specialized conversion function for a single byte to INTERNAL, > + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the en= d > + of the Low Surrogate Area at [U+DF80, U+DFFF]. */ > +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned c= har c); > + > #endif Missing attribute_hidden. Yes, it's also missing from __gconv_btwoc_ascii. The linker probably papers over it. > =20 > __END_DECLS > diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c > new file mode 100644 > index 0000000000..dcb13fbb43 > --- /dev/null > +++ b/iconv/gconv_posix.c > @@ -0,0 +1,96 @@ > +/* Simple transformations functions. I think this line should say something about surrogate-escape encoding for the POSIX locale. > +#define MIN_NEEDED_INPUT=09MIN_NEEDED_FROM > +#define MIN_NEEDED_OUTPUT=09MIN_NEEDED_TO > +#define LOOPFCT=09=09=09FROM_LOOP > +#define BODY \ > + {=09=09=09=09=09=09=09=09=09 \ > + uint32_t val =3D *((const uint32_t *) inptr);=09=09=09=09 \ > + if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff))= \ > + {=09=09=09=09=09=09=09=09=09 \ > +=09UNICODE_TAG_HANDLER (val, 4);=09=09=09=09=09 \ > +=09STANDARD_TO_LOOP_ERR_HANDLER (4);=09=09=09=09 \ > + }=09=09=09=09=09=09=09=09=09 \ > + else=09=09=09=09=09=09=09=09 \ > + {=09=09=09=09=09=09=09=09=09 \ > +=09if (__glibc_unlikely (val > 0x7f))=09=09=09=09 \ > +=09 val -=3D 0xdf00;=09=09=09=09=09=09 \ > +=09*outptr++ =3D val;=09=09=09=09=09=09 \ > +=09inptr +=3D sizeof (uint32_t);=09=09=09=09=09 \ > + }=09=09=09=09=09=09=09=09=09 \ > + } I suggest to drop the last __glibc_unlikely here because it's input-dependent. > +#define LOOP_NEED_FLAGS > +#include > +#include > diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh > index b3d8bf5110..a24d8d2207 100644 > --- a/iconv/tst-iconv_prog.sh > +++ b/iconv/tst-iconv_prog.sh > @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do > execute_test > check_errtest_result > done > + > +allbytes () > +{ > + for (( i =3D 0; i <=3D 255; i++ )); do > + printf '\'"$(printf "%o" "$i")" > + done > +} > + > +allucs4be () > +{ > + for (( i =3D 0; i <=3D 127; i++ )); do > + printf '\0\0\0\'"$(printf "%o" "$i")" > + done > + for (( i =3D 128; i <=3D 255; i++ )); do > + printf '\0\0\xdf\'"$(printf "%o" "$i")" > + done > +} > + > +check_posix_result () > +{ > + if [ $? -eq 0 ]; then > + result=3DPASS > + else > + result=3DFAIL > + fi > + > + echo "$result: from \"$1\", to: \"$2\"" > + > + if [ "$result" !=3D "PASS" ]; then > + exit 1 > + fi > +} > + > +check_posix_encoding () > +{ > + eval PROG=3D\"$ICONV\" > + allbytes | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be) > + check_posix_result POSIX UCS-4BE > + allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes) > + check_posix_result UCS-4BE POSIX > +} > + > +check_posix_encoding > diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh > index 4207b44175..33a02158ac 100755 > --- a/iconvdata/tst-tables.sh > +++ b/iconvdata/tst-tables.sh > @@ -31,6 +31,7 @@ cat < # Keep this list in the same order as gconv-modules. > # > # charset name table name comment > + POSIX > ASCII ANSI_X3.4-1968 > ISO646-GB BS_4730 > ISO646-CA CSA_Z243.4-1985-1 > diff --git a/inet/tst-idna_name_classify.c b/inet/tst-idna_name_classify.= c > index bfd34eee31..b379481844 100644 > --- a/inet/tst-idna_name_classify.c > +++ b/inet/tst-idna_name_classify.c > @@ -37,11 +37,11 @@ do_test (void) > puts ("info: C locale tests"); > locale_insensitive_tests (); > TEST_COMPARE (__idna_name_classify ("abc\200def"), > - idna_name_encoding_error); > + idna_name_nonascii); > TEST_COMPARE (__idna_name_classify ("abc\200\\def"), > - idna_name_encoding_error); > + idna_name_nonascii_backslash); > TEST_COMPARE (__idna_name_classify ("abc\377def"), > - idna_name_encoding_error); > + idna_name_nonascii); > =20 > puts ("info: en_US.ISO-8859-1 locale tests"); > if (setlocale (LC_CTYPE, "en_US.ISO-8859-1") =3D=3D 0) This seems to be okay, there is further test coverage for idna_name_encoding_error. > diff --git a/locale/tst-C-locale.c b/locale/tst-C-locale.c > index 6bd0367069..f30396ae12 100644 > --- a/locale/tst-C-locale.c > +++ b/locale/tst-C-locale.c > @@ -229,6 +229,75 @@ run_test (const char *locname) > STRTEST (YESSTR, ""); > STRTEST (NOSTR, ""); > =20 > +#define CONVTEST(b, v) \ > + {=09=09=09=09=09=09=09=09=09 \ > + unsigned char bs[] =3D {b, 0};=09=09=09=09=09 \ > + mbstate_t ctx =3D {};=09=09=09=09=09=09=09 \ > + wchar_t wc =3D -1;=09=09=09=09=09=09=09 \ > + size_t sz =3D mbrtowc(&wc, (char *) bs, 1, &ctx);=09=09=09 \ Missing space before '(' (also in other cases below). Not sure if the macros are needed, maybe write one loop for each direction with a condition in it? > diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-= bz25691.c > index 44844e71c3..e66242b58f 100644 > --- a/stdio-common/tst-printf-bz25691.c > +++ b/stdio-common/tst-printf-bz25691.c > @@ -30,6 +30,8 @@ > static int > do_test (void) > { > + setlocale(LC_CTYPE, "C.UTF-8"); > + > mtrace (); > =20 > /* For 's' conversion specifier with 'l' modifier the array must be What's the rationale for this change? If it is really required, you must also update stdio-common/Makefile with a new dependency on $(gen-locales). Thanks, Florian