* [Bug locale/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
@ 2023-03-28 11:04 ` schwab@linux-m68k.org
2023-03-28 11:06 ` [Bug localedata/30149] " schwab@linux-m68k.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2023-03-28 11:04 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
Andreas Schwab <schwab@linux-m68k.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |maiku.fabian at gmail dot com
--- Comment #1 from Andreas Schwab <schwab@linux-m68k.org> ---
According to the comments in iso14651_t1_common the CJK Ideograph characters
should get weights alghorithmically based on their Unicode codepoints, but that
is not implemented in glibc. zn_CN uses the table from iso14651_t1_pinyin,
which puts a subset of the CJK Ideographs in a specific order.
Not sure why en_US.UTF-8 works differently for you, it should show the same
behaviour as all other non-zn_CN locales (C.UTF-8 is special as it uses strict
Unicode codepoint order).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
@ 2023-03-28 11:06 ` schwab@linux-m68k.org
2023-04-14 13:46 ` infinity0 at pwned dot gg
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2023-03-28 11:06 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
Andreas Schwab <schwab@linux-m68k.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|locale |localedata
CC| |libc-locales at sourceware dot org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
2023-03-28 11:06 ` [Bug localedata/30149] " schwab@linux-m68k.org
@ 2023-04-14 13:46 ` infinity0 at pwned dot gg
2023-04-14 13:56 ` infinity0 at pwned dot gg
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:46 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
--- Comment #2 from infinity0 at pwned dot gg ---
Thanks for the investigation & explanation. However I can confirm again that
en_US.UTF-8 does indeed work as above for me, giving the correct answer unlike
en_GB which gives the incorrect answer as reported. In fact every other locale
works correctly, except en_GB and surprisingly zh_TW as well:
$ for l in /usr/share/locale/*; do echo $(LC_COLLATE=$(basename $l).UTF-8 sort
-u test2.txt | wc -l) $l; done | grep -v ^2
1 /usr/share/locale/en_GB
1 /usr/share/locale/zh_TW
Furthermore, other character I tried are fine, giving the expected "2" even in
en_GB and zh_TW.
Granted, 㐬 and 㒼 rare characters mostly used for linguistic study purposes and
not everyday use, however they are still certainly distinct characters and the
locale and/or sorting is buggy.
Are you able to replicate this? Feel free to ask me to run stuff on my computer
to debug further, in case my results are different from yours.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
` (2 preceding siblings ...)
2023-04-14 13:46 ` infinity0 at pwned dot gg
@ 2023-04-14 13:56 ` infinity0 at pwned dot gg
2023-04-14 16:34 ` carlos at redhat dot com
2023-04-14 20:51 ` infinity0 at pwned dot gg
5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:56 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
--- Comment #3 from infinity0 at pwned dot gg ---
Oh, looks like I am only generating certain locales in /etc/locale.gen.
Uncommenting en_US gives the result 1 as you were saying.
However I'd like to reiterate that even for the broken locales, only a few
characters are affected, which is surprising behaviour and hides the
brokenness.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
` (3 preceding siblings ...)
2023-04-14 13:56 ` infinity0 at pwned dot gg
@ 2023-04-14 16:34 ` carlos at redhat dot com
2023-04-14 20:51 ` infinity0 at pwned dot gg
5 siblings, 0 replies; 7+ messages in thread
From: carlos at redhat dot com @ 2023-04-14 16:34 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |NOTABUG
CC| |carlos at redhat dot com
Status|UNCONFIRMED |RESOLVED
--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to infinity0 from comment #3)
> However I'd like to reiterate that even for the broken locales, only a few
> characters are affected, which is surprising behaviour and hides the
> brokenness.
Just for clarity, the locales aren't broken.
Collation is going to vary by locale.
The collation of CJK ideographs in en_GB and en_US is not defined since the
collation entries do not map the code points to a given weight.
This lack of a weight means that the characters with no weights map to the same
weight and are considered the same for the purposes of collation. This is how
we have always handled collation for characters with unknown ordering. They
collate the same, and so a "unique" removal based on collation will collapse
all such characters.
If would be an RFE to treat all characters without weights (we'd have to
identify them specially) as-if they were sorted by the internal code point
(like in C.UTF-8). This would be special and given them a consistent ordering,
but it certainly a deviation from what we've been doing and would need careful
review.
I'm marking this as RESOLVED/NOTABUG.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
` (4 preceding siblings ...)
2023-04-14 16:34 ` carlos at redhat dot com
@ 2023-04-14 20:51 ` infinity0 at pwned dot gg
5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 20:51 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30149
infinity0 at pwned dot gg changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |UNCONFIRMED
Resolution|NOTABUG |---
--- Comment #5 from infinity0 at pwned dot gg ---
> [..] characters with unknown ordering. They collate the same, and so a "unique" removal based on collation will collapse all such characters.
These characters don't have "unknown ordering", they (quoting above) "should
get weights alghorithmically based on their Unicode codepoints, but that is not
implemented in glibc."
In other words, it's a bug that needs to be fixed. What you mentioned,
"characters with unknown ordering", is a separate issue.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread