From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fweimer@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 36CF33858031
	for <libc-alpha@sourceware.org>; Wed,  9 Nov 2022 14:20:33 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 36CF33858031
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1668003632;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=H1pVavBq2WmfeQLvWU6mmldHd+6EKi6D0NNFItUc790=;
	b=bW5Y+YAUrZkWnrVBIh8yTsrGmOfKLCmy897Cr2Hu023T0DrDV1PpoeEfAhXNpAHGo/bKJT
	+AMFKhQlqafXUKNQpqbsmGtxmIaSEV2R9dCErYm5/OCcftquu1wI5F4CpwK0ce1hdIl5/R
	IjnBwfLlJknrsIBRYO6Ah7HP8zDmdFc=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-433-mWoJBe-pOnO7rC5uBC1q6A-1; Wed, 09 Nov 2022 09:20:29 -0500
X-MC-Unique: mWoJBe-pOnO7rC5uBC1q6A-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 68ADF280D728;
	Wed,  9 Nov 2022 14:20:29 +0000 (UTC)
Received: from oldenburg.str.redhat.com (unknown [10.2.16.36])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 74B95202903F;
	Wed,  9 Nov 2022 14:20:28 +0000 (UTC)
From: Florian Weimer <fweimer@redhat.com>
To: =?utf-8?B?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz>
Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>
Subject: Re: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511]
References: <ad6720f44981d53ce50804d4ea3696ca1b7cd0b7.1663768863.git.nabijaczleweli@nabijaczleweli.xyz>
	<969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz>
Date: Wed, 09 Nov 2022 15:20:26 +0100
In-Reply-To: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz>
	(=?utf-8?B?ItC90LDQsSIncw==?= message of "Wed, 2 Nov 2022 18:17:17 +0100")
Message-ID: <874jv8dxat.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

* =D0=BD=D0=B0=D0=B1:

> This is a logistically trivial patch,
> largely duplicating the extant ASCII code with the error path changed

I wouldn't say it's trivial in the commit message. 8-)

> There are two user-facing changes:
>   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
>   * mbrtowc() and friends return b if b <=3D 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
>   (a) is 1-byte, stateless, and contains 256 characters
>   (b) they collate in byte order
>   (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=3D663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
>   b if b <=3D 0x7F else ab+c for all bytes b
>   where c is some constant >=3D0x80
>     and a is a positive integer constant
>
> By strategically picking c=3D<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
>   > Isolated surrogate code points have no interpretation;
>   > consequently, no character code charts or names lists
>   > are provided for this range.
> and match musl

Sadly this doesn't match Python and PEP 540:

>>> b'\x80'.decode('UTF-8', errors=3D'surrogateescape')
'\udc80'

I believe the implementation translates this to 0xDF80 instead.

Not sure what is more important here, musl compatibility or Python
compatibility.  Cc:ing Victor in case he as comments.  I should probably
ask on the musl list as well as how this divergence came to pass.

This change definitely needs a NEWS entry.

The mechanics of the patch look okay to me, just a few nits below.

> diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h
> index 1c6745043e..45ab1edfad 100644
> --- a/iconv/gconv_int.h
> +++ b/iconv/gconv_int.h
> @@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, =
const char *name2)
> =20
>  __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal);
>  __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii);
> +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal);
> +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix);
>  __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal);
>  __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8);
>  __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal);
> @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_interna=
l);
>     only ASCII characters.  */
>  extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned c=
har c);
> =20
> +/* Specialized conversion function for a single byte to INTERNAL,
> +   identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the en=
d
> +   of the Low Surrogate Area at [U+DF80, U+DFFF].  */
> +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned c=
har c);
> +
>  #endif

Missing attribute_hidden.  Yes, it's also missing from
__gconv_btwoc_ascii.  The linker probably papers over it.

> =20
>  __END_DECLS
> diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c
> new file mode 100644
> index 0000000000..dcb13fbb43
> --- /dev/null
> +++ b/iconv/gconv_posix.c
> @@ -0,0 +1,96 @@
> +/* Simple transformations functions.

I think this line should say something about surrogate-escape encoding
for the POSIX locale.

> +#define MIN_NEEDED_INPUT=09MIN_NEEDED_FROM
> +#define MIN_NEEDED_OUTPUT=09MIN_NEEDED_TO
> +#define LOOPFCT=09=09=09FROM_LOOP
> +#define BODY \
> +  {=09=09=09=09=09=09=09=09=09      \
> +    uint32_t val =3D *((const uint32_t *) inptr);=09=09=09=09      \
> +    if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff))=
      \
> +      {=09=09=09=09=09=09=09=09=09      \
> +=09UNICODE_TAG_HANDLER (val, 4);=09=09=09=09=09      \
> +=09STANDARD_TO_LOOP_ERR_HANDLER (4);=09=09=09=09      \
> +      }=09=09=09=09=09=09=09=09=09      \
> +    else=09=09=09=09=09=09=09=09      \
> +      {=09=09=09=09=09=09=09=09=09      \
> +=09if (__glibc_unlikely (val > 0x7f))=09=09=09=09      \
> +=09  val -=3D 0xdf00;=09=09=09=09=09=09      \
> +=09*outptr++ =3D val;=09=09=09=09=09=09      \
> +=09inptr +=3D sizeof (uint32_t);=09=09=09=09=09      \
> +      }=09=09=09=09=09=09=09=09=09      \
> +  }

I suggest to drop the last __glibc_unlikely here because it's
input-dependent.

> +#define LOOP_NEED_FLAGS
> +#include <iconv/loop.c>
> +#include <iconv/skeleton.c>
> diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh
> index b3d8bf5110..a24d8d2207 100644
> --- a/iconv/tst-iconv_prog.sh
> +++ b/iconv/tst-iconv_prog.sh
> @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do
>    execute_test
>    check_errtest_result
>  done
> +
> +allbytes ()
> +{
> +  for (( i =3D 0; i <=3D 255; i++ )); do
> +    printf '\'"$(printf "%o" "$i")"
> +  done
> +}
> +
> +allucs4be ()
> +{
> +  for (( i =3D 0; i <=3D 127; i++ )); do
> +    printf '\0\0\0\'"$(printf "%o" "$i")"
> +  done
> +  for (( i =3D 128; i <=3D 255; i++ )); do
> +    printf '\0\0\xdf\'"$(printf "%o" "$i")"
> +  done
> +}
> +
> +check_posix_result ()
> +{
> +  if [ $? -eq 0 ]; then
> +    result=3DPASS
> +  else
> +    result=3DFAIL
> +  fi
> +
> +  echo "$result: from \"$1\", to: \"$2\""
> +
> +  if [ "$result" !=3D "PASS" ]; then
> +    exit 1
> +  fi
> +}
> +
> +check_posix_encoding ()
> +{
> +  eval PROG=3D\"$ICONV\"
> +  allbytes  | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be)
> +  check_posix_result POSIX UCS-4BE
> +  allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes)
> +  check_posix_result UCS-4BE POSIX
> +}
> +
> +check_posix_encoding
> diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh
> index 4207b44175..33a02158ac 100755
> --- a/iconvdata/tst-tables.sh
> +++ b/iconvdata/tst-tables.sh
> @@ -31,6 +31,7 @@ cat <<EOF |
>    # Keep this list in the same order as gconv-modules.
>    #
>    # charset name    table name          comment
> +  POSIX
>    ASCII             ANSI_X3.4-1968
>    ISO646-GB         BS_4730
>    ISO646-CA         CSA_Z243.4-1985-1
> diff --git a/inet/tst-idna_name_classify.c b/inet/tst-idna_name_classify.=
c
> index bfd34eee31..b379481844 100644
> --- a/inet/tst-idna_name_classify.c
> +++ b/inet/tst-idna_name_classify.c
> @@ -37,11 +37,11 @@ do_test (void)
>    puts ("info: C locale tests");
>    locale_insensitive_tests ();
>    TEST_COMPARE (__idna_name_classify ("abc\200def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii);
>    TEST_COMPARE (__idna_name_classify ("abc\200\\def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii_backslash);
>    TEST_COMPARE (__idna_name_classify ("abc\377def"),
> -                idna_name_encoding_error);
> +                idna_name_nonascii);
> =20
>    puts ("info: en_US.ISO-8859-1 locale tests");
>    if (setlocale (LC_CTYPE, "en_US.ISO-8859-1") =3D=3D 0)

This seems to be okay, there is further test coverage for
idna_name_encoding_error.

> diff --git a/locale/tst-C-locale.c b/locale/tst-C-locale.c
> index 6bd0367069..f30396ae12 100644
> --- a/locale/tst-C-locale.c
> +++ b/locale/tst-C-locale.c
> @@ -229,6 +229,75 @@ run_test (const char *locname)
>    STRTEST (YESSTR, "");
>    STRTEST (NOSTR, "");
> =20
> +#define CONVTEST(b, v) \
> +  {=09=09=09=09=09=09=09=09=09      \
> +    unsigned char bs[] =3D {b, 0};=09=09=09=09=09      \
> +    mbstate_t ctx =3D {};=09=09=09=09=09=09=09      \
> +    wchar_t wc =3D -1;=09=09=09=09=09=09=09      \
> +    size_t sz =3D mbrtowc(&wc, (char *) bs, 1, &ctx);=09=09=09      \

Missing space before '(' (also in other cases below).

Not sure if the macros are needed, maybe write one loop for each
direction with a condition in it?

> diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-=
bz25691.c
> index 44844e71c3..e66242b58f 100644
> --- a/stdio-common/tst-printf-bz25691.c
> +++ b/stdio-common/tst-printf-bz25691.c
> @@ -30,6 +30,8 @@
>  static int
>  do_test (void)
>  {
> +  setlocale(LC_CTYPE, "C.UTF-8");
> +
>    mtrace ();
> =20
>    /* For 's' conversion specifier with 'l' modifier the array must be

What's the rationale for this change?  If it is really required, you
must also update stdio-common/Makefile with a new dependency on
$(gen-locales).

Thanks,
Florian