From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-94746-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 47331 invoked by alias); 25 Jul 2018 21:35:17 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 47313 invoked by uid 89); 25 Jul 2018 21:35:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-11.7 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_MANYTO,KAM_SHORT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=speak, Hx-languages-length:4554
X-HELO: mail-qk0-f196.google.com
Return-Path: <carlos@redhat.com>
Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
From: Carlos O'Donell <carlos@redhat.com>
To: GNU C Library <libc-alpha@sourceware.org>,
 Florian Weimer <fweimer@redhat.com>, Rich Felker <dalias@aerifal.cx>,
 Mike Fabian <mfabian@redhat.com>, Zorro Lang <zlang@redhat.com>,
 "Joseph S. Myers" <joseph@codesourcery.com>
References: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com>
Message-ID: <4de6a552-8b4c-ffe0-caf2-0a2d07a908f4@redhat.com>
Date: Wed, 25 Jul 2018 21:35:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.8.0
MIME-Version: 1.0
In-Reply-To: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-07/txt/msg00861.txt.bz2

On 07/19/2018 03:43 PM, Carlos O'Donell wrote:
> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
> the collation data to harmonize with the new version of ISO 14651
> which is derived from Unicode 9.0.0.  This collation update brought
> with it some changes to locales which were not desirable by some
> users, in particular it altered the meaning of the
> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
> for en_US it caused uppercase letters to be matched by [a-z] for the
> first time.  The matching of uppercase letters by [a-z] is something
> which is already known to users of other locales which have this
> property, but this change could cause significant problems to en_US
> and other similar locales that had never had this change before.
> Whether this behaviour is desirable or not is contentious and GNU Awk
> has this to say on the topic:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
> While the POSIX standard also has this further to say: "RE Bracket
> Expression":
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
> "The current standard leaves unspecified the behavior of a range
> expression outside the POSIX locale. ... As noted above, efforts were
> made to resolve the differences, but no solution has been found that
> would be specific enough to allow for portable software while not
> invalidating existing implementations."
> In glibc we implement the requirement of ISO POSIX-2:1993 and use
> collation element order (CEO) to construct the range expression, the
> API internally is __collseq_table_lookup().  The fact that we use CEO
> and also have 4-level weights on each collation rule means that we can
> in practice reorder the collation rules in iso14651_t1_common (the new
> data) to provide consistent range expression resolution *and* the
> weights should maintain the expected total order.  Therefore this
> patch does three things:
> 
> * Reorder the collation rules for the LATIN script in
>   iso14651_t1_common to deinterlace uppercase and lowercase letters in
>   the collation element orders.
> 
> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>   exercise that [a-z] does not match A or Z.
> 
> The reordering of the ISO 14651 data is done in an entirely mechanical
> fashion using the following program attached to the bug:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28
> 
> It is up for discussion if the iso14651_t1_common data should be
> refined further to have 3 very tight collation element ranges that
> include only a-z, A-Z, and 0-9, which would implement the solution
> sought after in:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12
> 
> No regressions on x86_64.
> Verified that removal of the iso14651_t1_common change causes tst-fnmatch
> to regress with:
> 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ...
> 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ---
>  ChangeLog                             |   11 +
>  localedata/Makefile                   |    1 +
>  localedata/en_US.UTF-8.in             | 2159 +++++++++++++++++++++++++++++++++
>  localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
>  posix/tst-fnmatch.input               |  125 +-
>  posix/tst-regexloc.c                  |    8 +-
>  6 files changed, 3224 insertions(+), 1008 deletions(-)
>  create mode 100644 localedata/en_US.UTF-8.in
> 
> I'm suggesting this change immediately for 2.28 to avoid further
> problems with users expectations and sorting with [a-z] and [A-Z] until
> a clearer consensus can be reached for a final solution.
> 
> File attached as .tar.gz to get past spam detectors. There is a lot
> of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
> set that can be sorted with the existing test case infrastructure).
> 

I have committed only the most conservative fix for this issue, which is
to deinterlace the lower and upper case ranges.

I think we are too late to commit rational ranges, and we can do that in
2.29 when it opens. Right now I want to remove the blocker that is causing
regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].

We have consensus that this is the right direction to take a solution,
and if anyone objects, please speak up before I cut the branch on August 1st
(if we can still achieve that and get good machine coverage).

Cheers,
Carlos.