public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR
@ 2016-10-03 22:56 meta at pobox dot com
  2016-10-03 23:34 ` [Bug localedata/20664] " carlos at redhat dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: meta at pobox dot com @ 2016-10-03 22:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

            Bug ID: 20664
           Summary: Unexpected collation in en_US.UTF-8, different to ICU
                    CLDR
           Product: glibc
           Version: 2.23
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: meta at pobox dot com
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

On Fedora 24 with glibc-2.23.1 I get the following interesting sort behavior:

% echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
-02
+02
-0c

On Mac OS X 10.11 I get less surprising behavior:

% echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
+02
-02
-0c

I've tried to reproduce the first result using
<http://demo.icu-project.org/icu-bin/collation.html> but have not managed to
find a set of options that will do so.

So I'm not sure if it is technically a bug, but I would say that it's at least
unexpected and apparently diverges from ICU & CLDR.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
@ 2016-10-03 23:34 ` carlos at redhat dot com
  2016-12-20 16:00   ` Keld Simonsen
  2016-10-04  0:50 ` meta at pobox dot com
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 11+ messages in thread
From: carlos at redhat dot com @ 2016-10-03 23:34 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2016-10-03
                 CC|                            |carlos at redhat dot com
     Ever confirmed|0                           |1

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
Going forward we want glibc to track CLDR more closely. Therefore if you can
find a glibc version that exhibits meaningful difference between CLDR, then
please file a report, like this one.

However, you have too many moving pieces for us to validate this, for example
sort is not a good test case because it might itself not use glibc's collation
tables for sorting.

Can you construct a test case with strcoll that exhibits this problem?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
  2016-10-03 23:34 ` [Bug localedata/20664] " carlos at redhat dot com
@ 2016-10-04  0:50 ` meta at pobox dot com
  2016-10-04 19:08 ` meta at pobox dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: meta at pobox dot com @ 2016-10-04  0:50 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

--- Comment #2 from mathew <meta at pobox dot com> ---
I originally filed a bug against GNU coreutils, and was told that it's behavior
of strcoll from glibc which coreutils uses for collation. See:

<http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24601>

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
  2016-10-03 23:34 ` [Bug localedata/20664] " carlos at redhat dot com
  2016-10-04  0:50 ` meta at pobox dot com
@ 2016-10-04 19:08 ` meta at pobox dot com
  2016-12-20 14:07 ` carlos at redhat dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: meta at pobox dot com @ 2016-10-04 19:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

--- Comment #3 from mathew <meta at pobox dot com> ---
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>

int main() {
  char *str[4], *temp;
  int i, j, n, c;

  setlocale(LC_ALL, "en_US.UTF-8");

  str[0] = "+00";
  str[1] = "-0c";
  str[2] = "+02";
  str[3] = "-02";

  n = 4;
  for (i = 0; i < n; i++) {
    for (j = 0; j < n - 1; j++) {
      c = strcoll(str[j], str[j + 1]) > 0;
      printf("i = %d j = %d strcoll %s %s = %d\n", i, j, str[i], str[j], c);
      if (c > 0) {
        temp = str[j];
        str[j] = str[j+1];
        str[j+1] = temp;
      }
    }
  }

  printf("\nSorted List:\n");
  for (i = 0; i < n; i++) {
    puts(str[i]);
  }

  return (0);
}

% ./a.out 
i = 0 j = 0 strcoll +00 +00 = 0
i = 0 j = 1 strcoll +00 -0c = 1
i = 0 j = 2 strcoll +00 -0c = 1
i = 1 j = 0 strcoll +02 +00 = 0
i = 1 j = 1 strcoll +02 +02 = 1
i = 1 j = 2 strcoll -02 +02 = 0
i = 2 j = 0 strcoll +02 +00 = 0
i = 2 j = 1 strcoll +02 -02 = 0
i = 2 j = 2 strcoll +02 +02 = 0
i = 3 j = 0 strcoll -0c +00 = 0
i = 3 j = 1 strcoll -0c -02 = 0
i = 3 j = 2 strcoll -0c +02 = 0

Sorted List:
+00
-02
+02
-0c

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (2 preceding siblings ...)
  2016-10-04 19:08 ` meta at pobox dot com
@ 2016-12-20 14:07 ` carlos at redhat dot com
  2016-12-20 16:04 ` keld at keldix dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2016-12-20 14:07 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 23:34 ` [Bug localedata/20664] " carlos at redhat dot com
@ 2016-12-20 16:00   ` Keld Simonsen
  0 siblings, 0 replies; 11+ messages in thread
From: Keld Simonsen @ 2016-12-20 16:00 UTC (permalink / raw)
  To: carlos at redhat dot com; +Cc: libc-locales

On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=20664
> 
> Carlos O'Donell <carlos at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |WAITING
>    Last reconfirmed|                            |2016-10-03
>                  CC|                            |carlos at redhat dot com
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
> Going forward we want glibc to track CLDR more closely. Therefore if you can
> find a glibc version that exhibits meaningful difference between CLDR, then
> please file a report, like this one.
> 
> However, you have too many moving pieces for us to validate this, for example
> sort is not a good test case because it might itself not use glibc's collation
> tables for sorting.
> 
> Can you construct a test case with strcoll that exhibits this problem?

I do not think we should aim at following CLDR closely, but we should minimize
differences. I actually think we should get CLDR to follow us more closely:-)

Bestregards
keld

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (3 preceding siblings ...)
  2016-12-20 14:07 ` carlos at redhat dot com
@ 2016-12-20 16:04 ` keld at keldix dot com
  2016-12-21 21:08 ` carlos at redhat dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: keld at keldix dot com @ 2016-12-20 16:04 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

--- Comment #4 from keld at keldix dot com <keld at keldix dot com> ---
On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=20664
> 
> Carlos O'Donell <carlos at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |WAITING
>    Last reconfirmed|                            |2016-10-03
>                  CC|                            |carlos at redhat dot com
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
> Going forward we want glibc to track CLDR more closely. Therefore if you can
> find a glibc version that exhibits meaningful difference between CLDR, then
> please file a report, like this one.
> 
> However, you have too many moving pieces for us to validate this, for example
> sort is not a good test case because it might itself not use glibc's collation
> tables for sorting.
> 
> Can you construct a test case with strcoll that exhibits this problem?

I do not think we should aim at following CLDR closely, but we should minimize
differences. I actually think we should get CLDR to follow us more closely:-)

Bestregards
keld

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (4 preceding siblings ...)
  2016-12-20 16:04 ` keld at keldix dot com
@ 2016-12-21 21:08 ` carlos at redhat dot com
  2021-10-11 20:18 ` kirelagin at gmail dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2016-12-21 21:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to keld@keldix.com from comment #4)
> On Mon, Oct 03, 2016 at 11:10:56PM +0000, carlos at redhat dot com wrote:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=20664
> > Can you construct a test case with strcoll that exhibits this problem?
> 
> I do not think we should aim at following CLDR closely, but we should
> minimize
> differences. I actually think we should get CLDR to follow us more closely:-)

I certainly agree that harmonization between both projects would be a great
goal. Having the best of both projects would be great. While I say "following
CLDR" what I mean is probably more accurate to say "harmonized with CLDR." So I
will endeavour to use such language in the future.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (5 preceding siblings ...)
  2016-12-21 21:08 ` carlos at redhat dot com
@ 2021-10-11 20:18 ` kirelagin at gmail dot com
  2021-10-11 20:51 ` carlos at redhat dot com
  2021-10-11 21:00 ` kirelagin at gmail dot com
  8 siblings, 0 replies; 11+ messages in thread
From: kirelagin at gmail dot com @ 2021-10-11 20:18 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

Kirill Elagin <kirelagin at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kirelagin at gmail dot com

--- Comment #6 from Kirill Elagin <kirelagin at gmail dot com> ---
I am getting collation results as expected (meaning, no difference between
en_US.UTF-8 and POSIX) for the example strings with glibc 2.32.

Is this issue safe to close?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (6 preceding siblings ...)
  2021-10-11 20:18 ` kirelagin at gmail dot com
@ 2021-10-11 20:51 ` carlos at redhat dot com
  2021-10-11 21:00 ` kirelagin at gmail dot com
  8 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2021-10-11 20:51 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |2.33

--- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Kirill Elagin from comment #6)
> I am getting collation results as expected (meaning, no difference between
> en_US.UTF-8 and POSIX) for the example strings with glibc 2.32.
> 
> Is this issue safe to close?

In glibc 2.32 we upgraded to Unicode 13.0.0, and glibc 2.35 (Feb 2, 2022) will
include Unicode 14.0.0 support. Neither of these updates substantially changed
collation (involved in sort). However, I agree with you that Fedora 34 with
glibc 2.33 that we get matching results:

echo -e "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
+00
+02
-02
-0c

The collation data always had <U002B> < <U002D> which results in + < -. I'm
marking this as RESOLVED/FIXED in glibc 2.33. We can reopen if we run into this
again to determine what is the root cause of the original mis-ordering in 2.32.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug localedata/20664] Unexpected collation in en_US.UTF-8, different to ICU CLDR
  2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
                   ` (7 preceding siblings ...)
  2021-10-11 20:51 ` carlos at redhat dot com
@ 2021-10-11 21:00 ` kirelagin at gmail dot com
  8 siblings, 0 replies; 11+ messages in thread
From: kirelagin at gmail dot com @ 2021-10-11 21:00 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=20664

--- Comment #8 from Kirill Elagin <kirelagin at gmail dot com> ---
Just FTR, the original issue was reported against 2.23 (not 2.32).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-10-11 21:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-03 22:56 [Bug localedata/20664] New: Unexpected collation in en_US.UTF-8, different to ICU CLDR meta at pobox dot com
2016-10-03 23:34 ` [Bug localedata/20664] " carlos at redhat dot com
2016-12-20 16:00   ` Keld Simonsen
2016-10-04  0:50 ` meta at pobox dot com
2016-10-04 19:08 ` meta at pobox dot com
2016-12-20 14:07 ` carlos at redhat dot com
2016-12-20 16:04 ` keld at keldix dot com
2016-12-21 21:08 ` carlos at redhat dot com
2021-10-11 20:18 ` kirelagin at gmail dot com
2021-10-11 20:51 ` carlos at redhat dot com
2021-10-11 21:00 ` kirelagin at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).