public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks
@ 2017-01-29  1:14 mh-sourceware at glandium dot org
  2017-01-29  1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: mh-sourceware at glandium dot org @ 2017-01-29  1:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=21091

            Bug ID: 21091
           Summary: Unexpected collation in ja_JP.UTF-8 probably due to
                    unsupported blocks
           Product: glibc
           Version: 2.24
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: mh-sourceware at glandium dot org
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

I was doing some scripting around some subset of the data in
https://github.com/cjkvi/cjkvi-ids/blob/master/ids.txt. I ended up doing things
like sort | uniq -d, both of which use strcoll.

My system locale is ja_JP.UTF-8, and that led to surprising results. I realize
what my intent was actually not to follow collation rules, but that still left
me wondering if collation was right in glibc.

So I created a LD_PRELOAD library that redirects strcoll to ICU's ucol_strcoll
and compared the outputs. They were very different.

So I dug further, and found that:

All characters in CJK Unified Ideographs Extension A are considered equal.
All characters in CJK Unified Ideographs Extension B are considered equal.
All characters in CJK Unified Ideographs Extension C are considered equal.
All characters in CJK Unified Ideographs Extension D are considered equal.
All characters in CJK Unified Ideographs Extension E are considered equal.
All characters in CJK Radicals Supplement are considered equal.
All characters in Kangxi Radicals are considered equal.
All characters in CJK Strokes are considered equal.
All characters in Enclosed CJK Letters and Months are considered equal.
All characters in CJK Compatibility are considered equal.
All characters in CJK Compatibility Ideographs are considered equal.
All characters in CJK Compatibility Forms are considered equal.
All characters in Enclosed Ideographic Supplement are considered equal.
All characters in CJK Compatibility Ideographs Supplement are considered equal.

More than that, all the characters in the blocks above with codepoints below
0x10000 are considered equals, and all the characters in the blocks above with
codepoints above 0x10000 are considered equal.

All in all, it would seem all unsupported characters in the BMP are equal, and
all unsupported characters in other unicode planes are equal.

With new unicode versions adding new characters, it seems to me it would be
better if unsupported characters were considered different as a general rule.

Obviously, it would be better if the above blocks were supported.

(This is with libc 2.24-9 from Debian)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/21091] Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks
  2017-01-29  1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org
@ 2017-01-29  1:31 ` mh-sourceware at glandium dot org
  2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: mh-sourceware at glandium dot org @ 2017-01-29  1:31 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=21091

--- Comment #1 from Mike Hommey <mh-sourceware at glandium dot org> ---
Mmmm in fact, it looks like some characters in the CJK Unified Ideographs block
are equal to those unsupported characters in other BMP blocks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks
  2017-01-29  1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org
  2017-01-29  1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org
@ 2017-09-03 20:38 ` vapier at gentoo dot org
  2017-10-21  8:20 ` maiku.fabian at gmail dot com
  2018-05-08  9:28 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: vapier at gentoo dot org @ 2017-09-03 20:38 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=21091

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Unexpected collation in     |ja_JP.UTF8: unexpected
                   |ja_JP.UTF-8 probably due to |collation probably due to
                   |unsupported blocks          |unsupported blocks

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks
  2017-01-29  1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org
  2017-01-29  1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org
  2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org
@ 2017-10-21  8:20 ` maiku.fabian at gmail dot com
  2018-05-08  9:28 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-10-21  8:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=21091

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks
  2017-01-29  1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org
                   ` (2 preceding siblings ...)
  2017-10-21  8:20 ` maiku.fabian at gmail dot com
@ 2018-05-08  9:28 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: maiku.fabian at gmail dot com @ 2018-05-08  9:28 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=21091

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at sourceware dot org   |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-05-08  9:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-29  1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org
2017-01-29  1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org
2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org
2017-10-21  8:20 ` maiku.fabian at gmail dot com
2018-05-08  9:28 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).