public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Carlos O'Donell <carlos@redhat.com>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
	Florian Weimer <fweimer@redhat.com>,
	Rich Felker <dalias@aerifal.cx>, Mike Fabian <mfabian@redhat.com>,
	Zorro Lang <zlang@redhat.com>,
	"Joseph S. Myers" <joseph@codesourcery.com>
Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
Date: Thu, 26 Jul 2018 01:49:00 -0000	[thread overview]
Message-ID: <02c54107-d38d-885c-2f5e-656315667d19@redhat.com> (raw)
In-Reply-To: <20180726013351.GC217613@aiede.svl.corp.google.com>

On 07/25/2018 09:33 PM, Jonathan Nieder wrote:
> Hi,
> 
> Carlos O'Donell wrote:
> 
>> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
>> the collation data to harmonize with the new version of ISO 14651
>> which is derived from Unicode 9.0.0.  This collation update brought
>> with it some changes to locales which were not desirable by some
>> users, in particular it altered the meaning of the
>> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
>> for en_US it caused uppercase letters to be matched by [a-z] for the
>> first time.
> 
> The Debian system where it is most convenient for me to test has
> Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
> letters.  I've always considered that undesirable but I'm confused
> about the described regression.  Did one of Debian's patches to
> localedata cause it to pick up the regression early (by which I mean,
> more than 5 years ago)?

It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.

Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.

>> In glibc we implement the requirement of ISO POSIX-2:1993 and use
>> collation element order (CEO) to construct the range expression, the
>> API internally is __collseq_table_lookup().  The fact that we use CEO
>> and also have 4-level weights on each collation rule means that we can
>> in practice reorder the collation rules in iso14651_t1_common (the new
>> data) to provide consistent range expression resolution *and* the
>> weights should maintain the expected total order.
> [...]
>> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> Cool!  Checking my understanding: does this mean that if I have files
> 
> 	lll
> 	MMM
> 	nnn
> 
> that with this patch,
> 
> 	echo [a-z]*
> 
> would no longer match MMM, and

Correct.

> 
> 	ls | sort
> 
> would continue to sort in the order lll < MMM < nnn?

Yes.

> 
> I wish we had done it 10 years ago. ;-)  Thanks for getting it done.

The rational ranges follow code point order.

The sorting follows collation sequence.

I think this was never an issue because most locales following ISO 14651
were using an old data set which never exhibited this issue. However, thanks
to Mike Fabian's hard work (and no good deed goes unpunished) we have updated
collation all the way to Unicode 9.0.0-era and so encountered this problem.

Cheers,
Carlos.

  reply	other threads:[~2018-07-26  1:49 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 19:43 Carlos O'Donell
2018-07-19 20:39 ` Florian Weimer
2018-07-20 18:49   ` Carlos O'Donell
2018-07-20 19:02     ` Rich Felker
2018-07-20 19:19     ` Florian Weimer
2018-07-20 21:56       ` Carlos O'Donell
2018-07-23 15:11         ` Florian Weimer
2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
2018-07-24 20:45             ` Rafal Luzynski
2018-07-24 20:53               ` Carlos O'Donell
2018-07-24 20:59               ` Carlos O'Donell
2018-07-25 15:44             ` Mike FABIAN
2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
2018-07-25 20:19             ` Florian Weimer
2018-07-25 20:25               ` Carlos O'Donell
2018-07-25 20:31                 ` Florian Weimer
2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
2018-07-26 14:51                       ` Florian Weimer
2018-07-26 14:59                         ` Carlos O'Donell
2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
2018-07-30 17:40                           ` Florian Weimer
2018-07-30 17:45                             ` Carlos O'Donell
2018-07-30 17:54                               ` Florian Weimer
2018-07-30 18:26                                 ` Carlos O'Donell
2018-07-30 18:34                                   ` Florian Weimer
2018-07-31  2:18                             ` Carlos O'Donell
2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
2018-07-25 21:12                   ` Carlos O'Donell
2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
2018-07-25 22:50   ` Florian Weimer
2018-07-26  1:20     ` Carlos O'Donell
2018-07-26  8:09       ` Andreas Schwab
2018-07-26  9:16         ` Florian Weimer
2018-07-26  1:33 ` Jonathan Nieder
2018-07-26  1:49   ` Carlos O'Donell [this message]
2018-07-26  2:16     ` Jonathan Nieder
2018-07-26  3:48       ` Carlos O'Donell
2018-07-26  7:42       ` Florian Weimer
2018-07-26  8:18         ` Andreas Schwab
2018-07-26  9:15           ` Florian Weimer
2018-07-26 13:25           ` Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02c54107-d38d-885c-2f5e-656315667d19@redhat.com \
    --to=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=fweimer@redhat.com \
    --cc=joseph@codesourcery.com \
    --cc=jrnieder@gmail.com \
    --cc=libc-alpha@sourceware.org \
    --cc=mfabian@redhat.com \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).