From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by sourceware.org (Postfix) with ESMTP id 51D4C388F040 for ; Mon, 29 Jun 2020 09:42:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 51D4C388F040 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-454-rdgjAxlBM8i16hxcRr0tkw-1; Mon, 29 Jun 2020 05:42:51 -0400 X-MC-Unique: rdgjAxlBM8i16hxcRr0tkw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 54804464 for ; Mon, 29 Jun 2020 09:42:50 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-112-196.ams2.redhat.com [10.36.112.196]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A173B60BF3; Mon, 29 Jun 2020 09:42:46 +0000 (UTC) From: Florian Weimer To: Carlos O'Donell via Libc-alpha Subject: Re: [PATCH 2/2] Add new C.UTF-8 locale (Bug 17318) References: <75d21bd8-2698-2e25-969c-4e086c90abd9@redhat.com> Date: Mon, 29 Jun 2020 11:42:45 +0200 In-Reply-To: (Carlos O'Donell via Libc-alpha's message of "Mon, 29 Jun 2020 00:22:48 -0400") Message-ID: <877dvq1a3u.fsf@oldenburg2.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-12.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jun 2020 09:42:54 -0000 * Carlos O'Donell via Libc-alpha: > diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c > index c23e50944f..d89d788a9b 100644 > --- a/locale/programs/charmap.c > +++ b/locale/programs/charmap.c > @@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result, > @@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet) > enum token_t ellipsis = 0; > int step = 1; > > + /* POSIX explicitly requires that ellipsis processing do the > + following: "Bytes shall be treated as unsigned octets, and carry > + shall be propagated between the bytes as necessary to represent the > + range." It then goes on to say that such a declaration should > + never be specified because it creates NULL bytes. Therefore we > + error on this condition (see charmap_new_char). However this still > + leaves a problem for encodings which use less than the full 8-bits, > + like UTF-8, and in such encodings you can use an ellipsis to > + silently and accidentally create invalid ranges. In UTF-8 you have > + only the first 6-bits of the first byte and if your ellipsis covers > + a code point range larger than this 64 code point block the output > + is going to be an invalid non-UTF-8 multi-byte sequence. Thus for > + UTF-8 we add a speical ellipsis handling loop that can increment > + UTF-8 multi-byte output effectively and for UTF-8 we allow larger > + ellipsis ranges without error. There may still be other encodings > + for which the ellipsis will still generate invalid multi-byte > + output, but not for UTF-8. The only alternative would be to call > + gconv for each Unicode code point in the loop to convert it to the > + appropriate multi-byte output, but that would be slow. */ Typo: speical > @@ -1039,11 +1134,52 @@ hexadecimal range format should use only capital characters")); > for (cnt = from_nr; cnt <= to_nr; cnt += step) > { > char *name_end; > + unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' }; > obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X", > prefix_len, from, len1 - prefix_len, cnt); > obstack_1grow (ob, '\0'); > name_end = obstack_finish (ob); > > + /* Either we have a UTF-8 charmap, and we compute the bytes (see comment > + above), or we have a non-UTF-8 charmap and we follow POSIX rules as > + further below for incrementing the bytes in an ellipsis. */ > + if (is_utf8) > + { > + int nubytes; > + > + /* Direclty convert the code point to the UTF-8 encoded bytes. */ > + nubytes = output_utf8_bytes (cnt, 4, ubytes); Typo: Direclty There are some overlong linese here, please fix. > diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in > new file mode 100644 > index 0000000000..70ab2bbac7 > --- /dev/null > +++ b/localedata/C.UTF-8.in > @@ -0,0 +1,852388 @@ I do not think it's a good idea to check in this file. It's large and it's dormant during regular builds. Thanks, Florian