From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-94754-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 74475 invoked by alias); 26 Jul 2018 01:20:19 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 74461 invoked by uid 89); 26 Jul 2018 01:20:19 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=BAYES_00,KAM_MANYTO,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.2 spammy=territory, Hx-languages-length:2135
X-HELO: mail-qt0-f195.google.com
Return-Path: <carlos@redhat.com>
Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
To: Florian Weimer <fweimer@redhat.com>,
 GNU C Library <libc-alpha@sourceware.org>, Rich Felker <dalias@aerifal.cx>,
 Mike Fabian <mfabian@redhat.com>, Zorro Lang <zlang@redhat.com>,
 "Joseph S. Myers" <joseph@codesourcery.com>
References: <9d6f47ec-f9eb-ead0-889c-3b9aae66551c@redhat.com>
 <4de6a552-8b4c-ffe0-caf2-0a2d07a908f4@redhat.com>
 <646a94c8-3b25-b65e-7fc7-0637e58cacc1@redhat.com>
From: Carlos O'Donell <carlos@redhat.com>
Message-ID: <1313f0d2-8c64-8ec0-ef09-cd39bd6d4416@redhat.com>
Date: Thu, 26 Jul 2018 01:20:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.8.0
MIME-Version: 1.0
In-Reply-To: <646a94c8-3b25-b65e-7fc7-0637e58cacc1@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-07/txt/msg00869.txt.bz2

On 07/25/2018 06:50 PM, Florian Weimer wrote:
> On 07/25/2018 11:35 PM, Carlos O'Donell wrote:
>> I have committed only the most conservative fix for this issue,
>> which is to deinterlace the lower and upper case ranges.
>> 
>> I think we are too late to commit rational ranges, and we can do
>> that in 2.29 when it opens. Right now I want to remove the blocker
>> that is causing regressions for en_US.UTF-8 scripts that use [a-z],
>> and [A-Z].
> 
> How is this the most conservative fix, relative to glibc 2.27
> upstream?

We have two solutions to fix the regression:

* Revert the entire ISO 14651 udpate.
  - This is 13 commits for just the update.
  - Several more commits for Rafal and Mike's work on locales on top of that.

* Fix the key issue of a-z interleaving with A-Z.

My opinion is that is most conservative to fix the interleaving.

In 2.27 we accepted 297 characters between A-Z.

In 2.28 we accept 2280 characters between A-Z as part of the ISO 14651 update.
 
> [a-z] still matches lots of non-ASCII characters, which it did not
> before.

This is not true, we were already matching 297 characters between A-Z
in 2.27. It has always been the case that we accepted non-ASCII characters
in the range. With the ISO 14651 update the *key* issue was that lowercase
and uppercase were now mixed in collation element ordering, resulting in
surprising matches and failures like the reported xfs test failure where
[a-z] matched "Makefile" and broke their test infrastructure.
 
> When I meant that we left regression-fixing territory, I was talking
> about the locales which had iso14651_t1_common customizations.

OK, so to be clear you think we *should* go forward with rational ranges?

I don't think it's too late, we could commit it tomorrow, it should not
impact machine testing in way.

My v4 fixes all of the locales that either have customizations on
iso14651_t1_common or have their own custom locales. No more locales
remain to be fixed, I tested all of them with tst-fnmatch.input additions
to catch the ones that needed fixing.

Cheers,
Carlos.