From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 47331 invoked by alias); 25 Jul 2018 21:35:17 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 47313 invoked by uid 89); 25 Jul 2018 21:35:16 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-11.7 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_MANYTO,KAM_SHORT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=speak, Hx-languages-length:4554 X-HELO: mail-qk0-f196.google.com Return-Path: Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). From: Carlos O'Donell To: GNU C Library , Florian Weimer , Rich Felker , Mike Fabian , Zorro Lang , "Joseph S. Myers" References: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com> Message-ID: <4de6a552-8b4c-ffe0-caf2-0a2d07a908f4@redhat.com> Date: Wed, 25 Jul 2018 21:35:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SW-Source: 2018-07/txt/msg00861.txt.bz2 On 07/19/2018 03:43 PM, Carlos O'Donell wrote: > In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of > the collation data to harmonize with the new version of ISO 14651 > which is derived from Unicode 9.0.0. This collation update brought > with it some changes to locales which were not desirable by some > users, in particular it altered the meaning of the > locale-dependent-range regular expression, namely [a-z] and [A-Z], and > for en_US it caused uppercase letters to be matched by [a-z] for the > first time. The matching of uppercase letters by [a-z] is something > which is already known to users of other locales which have this > property, but this change could cause significant problems to en_US > and other similar locales that had never had this change before. > Whether this behaviour is desirable or not is contentious and GNU Awk > has this to say on the topic: > https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html > While the POSIX standard also has this further to say: "RE Bracket > Expression": > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html > "The current standard leaves unspecified the behavior of a range > expression outside the POSIX locale. ... As noted above, efforts were > made to resolve the differences, but no solution has been found that > would be specific enough to allow for portable software while not > invalidating existing implementations." > In glibc we implement the requirement of ISO POSIX-2:1993 and use > collation element order (CEO) to construct the range expression, the > API internally is __collseq_table_lookup(). The fact that we use CEO > and also have 4-level weights on each collation rule means that we can > in practice reorder the collation rules in iso14651_t1_common (the new > data) to provide consistent range expression resolution *and* the > weights should maintain the expected total order. Therefore this > patch does three things: > > * Reorder the collation rules for the LATIN script in > iso14651_t1_common to deinterlace uppercase and lowercase letters in > the collation element orders. > > * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises > strcoll* and strxfrm* and ensures the ISO 14651 collation remains. > > * Add back tests to tst-fnmatch.input and tst-regexloc.c which > exercise that [a-z] does not match A or Z. > > The reordering of the ISO 14651 data is done in an entirely mechanical > fashion using the following program attached to the bug: > https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28 > > It is up for discussion if the iso14651_t1_common data should be > refined further to have 3 very tight collation element ranges that > include only a-z, A-Z, and 0-9, which would implement the solution > sought after in: > https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12 > > No regressions on x86_64. > Verified that removal of the iso14651_t1_common change causes tst-fnmatch > to regress with: > 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) *** > ... > 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) *** > --- > ChangeLog | 11 + > localedata/Makefile | 1 + > localedata/en_US.UTF-8.in | 2159 +++++++++++++++++++++++++++++++++ > localedata/locales/iso14651_t1_common | 1928 ++++++++++++++--------------- > posix/tst-fnmatch.input | 125 +- > posix/tst-regexloc.c | 8 +- > 6 files changed, 3224 insertions(+), 1008 deletions(-) > create mode 100644 localedata/en_US.UTF-8.in > > I'm suggesting this change immediately for 2.28 to avoid further > problems with users expectations and sorting with [a-z] and [A-Z] until > a clearer consensus can be reached for a final solution. > > File attached as .tar.gz to get past spam detectors. There is a lot > of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN > set that can be sorted with the existing test case infrastructure). > I have committed only the most conservative fix for this issue, which is to deinterlace the lower and upper case ranges. I think we are too late to commit rational ranges, and we can do that in 2.29 when it opens. Right now I want to remove the blocker that is causing regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z]. We have consensus that this is the right direction to take a solution, and if anyone objects, please speak up before I cut the branch on August 1st (if we can still achieve that and get good machine coverage). Cheers, Carlos.