From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 43DB33947433 for ; Thu, 29 Apr 2021 14:13:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 43DB33947433 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-139-A2Fo8JwQN8yh1aqvnsxq5w-1; Thu, 29 Apr 2021 10:13:50 -0400 X-MC-Unique: A2Fo8JwQN8yh1aqvnsxq5w-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BA0FB89F9A2 for ; Thu, 29 Apr 2021 14:13:48 +0000 (UTC) Received: from oldenburg.str.redhat.com (ovpn-115-124.ams2.redhat.com [10.36.115.124]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2A4265D735; Thu, 29 Apr 2021 14:13:40 +0000 (UTC) From: Florian Weimer To: Carlos O'Donell Cc: libc-alpha@sourceware.org Subject: Re: [PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318) References: <20210428130033.3196848-1-carlos@redhat.com> <20210428130033.3196848-5-carlos@redhat.com> Date: Thu, 29 Apr 2021 16:13:52 +0200 In-Reply-To: <20210428130033.3196848-5-carlos@redhat.com> (Carlos O'Donell's message of "Wed, 28 Apr 2021 09:00:33 -0400") Message-ID: <87eeetfaan.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2021 14:13:54 -0000 * Carlos O'Donell: > We add a new C.UTF-8 locale. This locale is not builtin to glibc, but > is provided as a distinct locale. The locale provides full support > for UTF-8 and this includes full code point sorting via collation > (excludes surrogates). Unfortuantely given the present implementation > in glibc this results in 28MiB of LC_COLLATE data for all possible > Unicode code points. Future improvements may reduce this size. Such > improvements likely require a shortcut for the collation data that > relies on C.UTF-8 single-byte sorting being equivalent to strcmp. > > The new locale is NOT added to SUPPORTED. Minimal test data for > specific code points (minus those not supported by collate-test) is > provided in C.UTF-8.in, and this verifies code point sorting is > working reasonably across the range. > > The next step is to reduce LC_COLLATE to a manageable size before we > enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can > add ~5-7 minutes to the locale testing (collate-test, and xfrm-test > twice) so we don't enable full testing of all code points until we can > parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal > test data passes cleanly. Can you compare this locale with what is in Fedora and Debian, for the non-collaction/CTYPE aspects? Are there other distributions which ship a downstream C.UTF-8 locale? Thanks, Florian