* [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks @ 2017-01-29 1:14 mh-sourceware at glandium dot org 2017-01-29 1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org ` (3 more replies) 0 siblings, 4 replies; 5+ messages in thread From: mh-sourceware at glandium dot org @ 2017-01-29 1:14 UTC (permalink / raw) To: libc-locales https://sourceware.org/bugzilla/show_bug.cgi?id=21091 Bug ID: 21091 Summary: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks Product: glibc Version: 2.24 Status: UNCONFIRMED Severity: normal Priority: P2 Component: localedata Assignee: unassigned at sourceware dot org Reporter: mh-sourceware at glandium dot org CC: libc-locales at sourceware dot org Target Milestone: --- I was doing some scripting around some subset of the data in https://github.com/cjkvi/cjkvi-ids/blob/master/ids.txt. I ended up doing things like sort | uniq -d, both of which use strcoll. My system locale is ja_JP.UTF-8, and that led to surprising results. I realize what my intent was actually not to follow collation rules, but that still left me wondering if collation was right in glibc. So I created a LD_PRELOAD library that redirects strcoll to ICU's ucol_strcoll and compared the outputs. They were very different. So I dug further, and found that: All characters in CJK Unified Ideographs Extension A are considered equal. All characters in CJK Unified Ideographs Extension B are considered equal. All characters in CJK Unified Ideographs Extension C are considered equal. All characters in CJK Unified Ideographs Extension D are considered equal. All characters in CJK Unified Ideographs Extension E are considered equal. All characters in CJK Radicals Supplement are considered equal. All characters in Kangxi Radicals are considered equal. All characters in CJK Strokes are considered equal. All characters in Enclosed CJK Letters and Months are considered equal. All characters in CJK Compatibility are considered equal. All characters in CJK Compatibility Ideographs are considered equal. All characters in CJK Compatibility Forms are considered equal. All characters in Enclosed Ideographic Supplement are considered equal. All characters in CJK Compatibility Ideographs Supplement are considered equal. More than that, all the characters in the blocks above with codepoints below 0x10000 are considered equals, and all the characters in the blocks above with codepoints above 0x10000 are considered equal. All in all, it would seem all unsupported characters in the BMP are equal, and all unsupported characters in other unicode planes are equal. With new unicode versions adding new characters, it seems to me it would be better if unsupported characters were considered different as a general rule. Obviously, it would be better if the above blocks were supported. (This is with libc 2.24-9 from Debian) -- You are receiving this mail because: You are on the CC list for the bug. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/21091] Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks 2017-01-29 1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org @ 2017-01-29 1:31 ` mh-sourceware at glandium dot org 2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org ` (2 subsequent siblings) 3 siblings, 0 replies; 5+ messages in thread From: mh-sourceware at glandium dot org @ 2017-01-29 1:31 UTC (permalink / raw) To: libc-locales https://sourceware.org/bugzilla/show_bug.cgi?id=21091 --- Comment #1 from Mike Hommey <mh-sourceware at glandium dot org> --- Mmmm in fact, it looks like some characters in the CJK Unified Ideographs block are equal to those unsupported characters in other BMP blocks. -- You are receiving this mail because: You are on the CC list for the bug. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks 2017-01-29 1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org 2017-01-29 1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org @ 2017-09-03 20:38 ` vapier at gentoo dot org 2017-10-21 8:20 ` maiku.fabian at gmail dot com 2018-05-08 9:28 ` maiku.fabian at gmail dot com 3 siblings, 0 replies; 5+ messages in thread From: vapier at gentoo dot org @ 2017-09-03 20:38 UTC (permalink / raw) To: libc-locales https://sourceware.org/bugzilla/show_bug.cgi?id=21091 Mike Frysinger <vapier at gentoo dot org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Unexpected collation in |ja_JP.UTF8: unexpected |ja_JP.UTF-8 probably due to |collation probably due to |unsupported blocks |unsupported blocks -- You are receiving this mail because: You are on the CC list for the bug. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks 2017-01-29 1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org 2017-01-29 1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org 2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org @ 2017-10-21 8:20 ` maiku.fabian at gmail dot com 2018-05-08 9:28 ` maiku.fabian at gmail dot com 3 siblings, 0 replies; 5+ messages in thread From: maiku.fabian at gmail dot com @ 2017-10-21 8:20 UTC (permalink / raw) To: libc-locales https://sourceware.org/bugzilla/show_bug.cgi?id=21091 Mike FABIAN <maiku.fabian at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |maiku.fabian at gmail dot com -- You are receiving this mail because: You are on the CC list for the bug. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug localedata/21091] ja_JP.UTF8: unexpected collation probably due to unsupported blocks 2017-01-29 1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org ` (2 preceding siblings ...) 2017-10-21 8:20 ` maiku.fabian at gmail dot com @ 2018-05-08 9:28 ` maiku.fabian at gmail dot com 3 siblings, 0 replies; 5+ messages in thread From: maiku.fabian at gmail dot com @ 2018-05-08 9:28 UTC (permalink / raw) To: libc-locales https://sourceware.org/bugzilla/show_bug.cgi?id=21091 Mike FABIAN <maiku.fabian at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at sourceware dot org |maiku.fabian at gmail dot com -- You are receiving this mail because: You are on the CC list for the bug. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-05-08 9:28 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-29 1:14 [Bug localedata/21091] New: Unexpected collation in ja_JP.UTF-8 probably due to unsupported blocks mh-sourceware at glandium dot org 2017-01-29 1:31 ` [Bug localedata/21091] " mh-sourceware at glandium dot org 2017-09-03 20:38 ` [Bug localedata/21091] ja_JP.UTF8: unexpected collation " vapier at gentoo dot org 2017-10-21 8:20 ` maiku.fabian at gmail dot com 2018-05-08 9:28 ` maiku.fabian at gmail dot com
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).