Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Carlos O'Donell <carlos@redhat.com>
To: Florian Weimer <fweimer@redhat.com>, libc-alpha@sourceware.org
Subject: Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
Date: Mon, 30 Jul 2018 18:26:00 -0000	[thread overview]
Message-ID: <627cf58a-1cec-a962-2ebb-a97726ac6a1b@redhat.com> (raw)
In-Reply-To: <e4704376-29ec-b65f-0ae8-c73b69173fe9@redhat.com>

On 07/30/2018 01:54 PM, Florian Weimer wrote:
> On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
>> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>>
>>>> This is a WIP, because the number of tests now is too big
>>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>>> expecting all of the locales to be built for testing, and
>>>> then running through all the rational ranges to test
>>>> inclusion of the required datums.
>>>
>>> Let me repeat my suggestion that we should initially fix the locales
>>> with the common collation order, where glibc 2.28 regresses.
>>
>> I do not think it is appropriate to release rational range support on
>> only a subset of the SUPPORTED set of locales. Either we support it on
>> all SUPPORTED locales or we work until we are ready.
>>
>> At present glibc 2.28 does not regress because of commit
>> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
>> uppercase.
>>
>> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
>> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
>> accepted by the range, all because we caught up to Unicode 9.0.0 with
>> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
>> with the next release, and probably always lagging a bit).
> 
> Ahh.Â  So it's more complex and a regression longer in the making.

I'm worried I don't quite follow your statement of "longer in the making,"
but let me summarize what I think you wrote, and tell me if I have
it right.

The regression, from the perspective of en_US, is that [a-z] in master
accepts uppercase ASCII characters, and this breaks user expectations.

This is the only regression I'm considering serious enough to block the
release for and we've fixed it for now.

The regression which you say is "longer in the making" is that at some
point in the past the collation data for en_US contained only ASCII
ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges,
particularly those from a-z, and A-Z began accepting non-ASCII characters.

Thus the regression, from your perspective, happened far in the past.

As far as I can tell the regression has existed since the first import
for en_US which copied LC_COLLATE from en_DK (showing en_DK):
~~~
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  968) <a> <A>;<NONE>;<SMALL>;IGNORE
...
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'>        <Z>;<ACUTE>;<CAPITAL>;IGNORE
~~~
Is this what you mean by "longer in the making?"

I expect that en_US at some point along the way is switched to use the
iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard
to tell exactly if CEO was fully functional, if fnmatch worked as expected,
etc.

Either way this is all a poorly understood and structured solution at this
point, and I hope that in 1 or 2 releases we go from "unusable interface" to
"rational ranges (data)" to "full rational ranges (code point ranges)" and
end up with a sensible portable solution.

>> I don't see an urgent need to get rational range support into 2.28.
>> I was happy to get it in earlier, but now with deeper testing showing
>> that not all locales are working correctly, I'm not happy to see this
>> go out the door. I think it will be ready very shortly, and we can check
>> it in immediately into 2.29, and then continue our work on code point
>> ranges as the next step, which will require even more testing, and
>> internal API cleanup.
> 
> Sounds reasonable.

That sounds great. I will continue to update this patch set and get some
independent checking from your scripts, and my own testing. I also need
to add collation tests for all the locales I touch to ensure that the
reordering is just that, and that it doesn't materially change the collation
sequence (if it does it's a bug). This all adds more coverage to the
SUPPORTED set of languages which is a positive thing.

-- 
Cheers,
Carlos.

next prev parent reply	other threads:[~2018-07-30 18:26 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
2018-07-19 20:39 ` Florian Weimer
2018-07-20 18:49   ` Carlos O'Donell
2018-07-20 19:02     ` Rich Felker
2018-07-20 19:19     ` Florian Weimer
2018-07-20 21:56       ` Carlos O'Donell
2018-07-23 15:11         ` Florian Weimer
2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
2018-07-24 20:45             ` Rafal Luzynski
2018-07-24 20:53               ` Carlos O'Donell
2018-07-24 20:59               ` Carlos O'Donell
2018-07-25 15:44             ` Mike FABIAN
2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
2018-07-25 20:19             ` Florian Weimer
2018-07-25 20:25               ` Carlos O'Donell
2018-07-25 20:31                 ` Florian Weimer
2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
2018-07-26 14:51                       ` Florian Weimer
2018-07-26 14:59                         ` Carlos O'Donell
2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
2018-07-30 17:40                           ` Florian Weimer
2018-07-30 17:45                             ` Carlos O'Donell
2018-07-30 17:54                               ` Florian Weimer
2018-07-30 18:26                                 ` Carlos O'Donell [this message]
2018-07-30 18:34                                   ` Florian Weimer
2018-07-31  2:18                             ` Carlos O'Donell
2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
2018-07-25 21:12                   ` Carlos O'Donell
2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
2018-07-25 22:50   ` Florian Weimer
2018-07-26  1:20     ` Carlos O'Donell
2018-07-26  8:09       ` Andreas Schwab
2018-07-26  9:16         ` Florian Weimer
2018-07-26  1:33 ` Jonathan Nieder
2018-07-26  1:49   ` Carlos O'Donell
2018-07-26  2:16     ` Jonathan Nieder
2018-07-26  3:48       ` Carlos O'Donell
2018-07-26  7:42       ` Florian Weimer
2018-07-26  8:18         ` Andreas Schwab
2018-07-26  9:15           ` Florian Weimer
2018-07-26 13:25           ` Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=627cf58a-1cec-a962-2ebb-a97726ac6a1b@redhat.com \
    --to=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).