From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fweimer@redhat.com>
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com
 [205.139.110.120])
 by sourceware.org (Postfix) with ESMTP id 51D4C388F040
 for <libc-alpha@sourceware.org>; Mon, 29 Jun 2020 09:42:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 51D4C388F040
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-454-rdgjAxlBM8i16hxcRr0tkw-1; Mon, 29 Jun 2020 05:42:51 -0400
X-MC-Unique: rdgjAxlBM8i16hxcRr0tkw-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 54804464
 for <libc-alpha@sourceware.org>; Mon, 29 Jun 2020 09:42:50 +0000 (UTC)
Received: from oldenburg2.str.redhat.com (ovpn-112-196.ams2.redhat.com
 [10.36.112.196])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id A173B60BF3;
 Mon, 29 Jun 2020 09:42:46 +0000 (UTC)
From: Florian Weimer <fweimer@redhat.com>
To: Carlos O'Donell via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318)
References: <75d21bd8-2698-2e25-969c-4e086c90abd9@redhat.com>
 <f33fa33f-6206-8825-a870-c2a6a5eff609@redhat.com>
Date: Mon, 29 Jun 2020 11:42:45 +0200
In-Reply-To: <f33fa33f-6206-8825-a870-c2a6a5eff609@redhat.com> (Carlos
 O'Donell via Libc-alpha's message of "Mon, 29 Jun 2020 00:22:48
 -0400")
Message-ID: <877dvq1a3u.fsf@oldenburg2.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Jun 2020 09:42:54 -0000

* Carlos O'Donell via Libc-alpha:

> diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
> index c23e50944f..d89d788a9b 100644
> --- a/locale/programs/charmap.c
> +++ b/locale/programs/charmap.c
> @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,

> @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
>    enum token_t ellipsis = 0;
>    int step = 1;
>  
> +  /* POSIX explicitly requires that ellipsis processing do the
> +     following: "Bytes shall be treated as unsigned octets, and carry
> +     shall be propagated between the bytes as necessary to represent the
> +     range."  It then goes on to say that such a declaration should
> +     never be specified because it creates NULL bytes.  Therefore we
> +     error on this condition (see charmap_new_char).  However this still
> +     leaves a problem for encodings which use less than the full 8-bits,
> +     like UTF-8, and in such encodings you can use an ellipsis to
> +     silently and accidentally create invalid ranges.  In UTF-8 you have
> +     only the first 6-bits of the first byte and if your ellipsis covers
> +     a code point range larger than this 64 code point block the output
> +     is going to be an invalid non-UTF-8 multi-byte sequence.  Thus for
> +     UTF-8 we add a speical ellipsis handling loop that can increment
> +     UTF-8 multi-byte output effectively and for UTF-8 we allow larger
> +     ellipsis ranges without error.  There may still be other encodings
> +     for which the ellipsis will still generate invalid multi-byte
> +     output, but not for UTF-8.  The only alternative would be to call
> +     gconv for each Unicode code point in the loop to convert it to the
> +     appropriate multi-byte output, but that would be slow.  */

Typo: speical


> @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters"));
>    for (cnt = from_nr; cnt <= to_nr; cnt += step)
>      {
>        char *name_end;
> +      unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
>        obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
>  		      prefix_len, from, len1 - prefix_len, cnt);
>        obstack_1grow (ob, '\0');
>        name_end = obstack_finish (ob);
>  
> +      /* Either we have a UTF-8 charmap, and we compute the bytes (see comment
> +	 above), or we have a non-UTF-8 charmap and we follow POSIX rules as
> +	 further below for incrementing the bytes in an ellipsis.  */
> +      if (is_utf8)
> +	{
> +	  int nubytes;
> +
> +	  /* Direclty convert the code point to the UTF-8 encoded bytes.  */
> +	  nubytes = output_utf8_bytes (cnt, 4, ubytes);

Typo: Direclty

There are some overlong linese here, please fix.

> diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
> new file mode 100644
> index 0000000000..70ab2bbac7
> --- /dev/null
> +++ b/localedata/C.UTF-8.in
> @@ -0,0 +1,852388 @@

I do not think it's a good idea to check in this file.  It's large and
it's dormant during regular builds.

Thanks,
Florian