[Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
@ 2023-02-20 21:59 infinity0 at pwned dot gg
  2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-02-20 21:59 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

            Bug ID: 30149
           Summary: en_GB thinks "㐬" and "㒼" are the same character, other
                    locales are fine
           Product: glibc
           Version: 2.36
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: infinity0 at pwned dot gg
  Target Milestone: ---

$ cat test2.txt
㐬
㒼

$ for l in C.UTF-8 en_GB.UTF-8 en_US.UTF-8 zh_CN.UTF-8; do echo $(LC_COLLATE=$l
sort -u test2.txt | wc -l) $l; done
2 C.UTF-8
1 en_GB.UTF-8
2 en_US.UTF-8
2 zh_CN.UTF-8

Not sure why en_GB thinks it should be treating Chinese characters specially.

Other characters may be affected, I didn't do a thorough check across all
Unicode characters x locales.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug locale/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
@ 2023-03-28 11:04 ` schwab@linux-m68k.org
  2023-03-28 11:06 ` [Bug localedata/30149] " schwab@linux-m68k.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2023-03-28 11:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

Andreas Schwab <schwab@linux-m68k.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com

--- Comment #1 from Andreas Schwab <schwab@linux-m68k.org> ---
According to the comments in iso14651_t1_common the CJK Ideograph characters
should get weights alghorithmically based on their Unicode codepoints, but that
is not implemented in glibc.  zn_CN uses the table from iso14651_t1_pinyin,
which puts a subset of the CJK Ideographs in a specific order.

Not sure why en_US.UTF-8 works differently for you, it should show the same
behaviour as all other non-zn_CN locales (C.UTF-8 is special as it uses strict
Unicode codepoint order).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
  2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
@ 2023-03-28 11:06 ` schwab@linux-m68k.org
  2023-04-14 13:46 ` infinity0 at pwned dot gg
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: schwab@linux-m68k.org @ 2023-03-28 11:06 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

Andreas Schwab <schwab@linux-m68k.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|locale                      |localedata
                 CC|                            |libc-locales at sourceware dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
  2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
  2023-03-28 11:06 ` [Bug localedata/30149] " schwab@linux-m68k.org
@ 2023-04-14 13:46 ` infinity0 at pwned dot gg
  2023-04-14 13:56 ` infinity0 at pwned dot gg
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:46 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

--- Comment #2 from infinity0 at pwned dot gg ---
Thanks for the investigation & explanation. However I can confirm again that
en_US.UTF-8 does indeed work as above for me, giving the correct answer unlike
en_GB which gives the incorrect answer as reported. In fact every other locale
works correctly, except en_GB and surprisingly zh_TW as well:

$ for l in /usr/share/locale/*; do echo $(LC_COLLATE=$(basename $l).UTF-8 sort
-u test2.txt | wc -l) $l; done | grep -v ^2
1 /usr/share/locale/en_GB
1 /usr/share/locale/zh_TW

Furthermore, other character I tried are fine, giving the expected "2" even in
en_GB and zh_TW.

Granted, 㐬 and 㒼 rare characters mostly used for linguistic study purposes and
not everyday use, however they are still certainly distinct characters and the
locale and/or sorting is buggy.

Are you able to replicate this? Feel free to ask me to run stuff on my computer
to debug further, in case my results are different from yours.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
                   ` (2 preceding siblings ...)
  2023-04-14 13:46 ` infinity0 at pwned dot gg
@ 2023-04-14 13:56 ` infinity0 at pwned dot gg
  2023-04-14 16:34 ` carlos at redhat dot com
  2023-04-14 20:51 ` infinity0 at pwned dot gg
  5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:56 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

--- Comment #3 from infinity0 at pwned dot gg ---
Oh, looks like I am only generating certain locales in /etc/locale.gen.
Uncommenting en_US gives the result 1 as you were saying.

However I'd like to reiterate that even for the broken locales, only a few
characters are affected, which is surprising behaviour and hides the
brokenness.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
                   ` (3 preceding siblings ...)
  2023-04-14 13:56 ` infinity0 at pwned dot gg
@ 2023-04-14 16:34 ` carlos at redhat dot com
  2023-04-14 20:51 ` infinity0 at pwned dot gg
  5 siblings, 0 replies; 7+ messages in thread
From: carlos at redhat dot com @ 2023-04-14 16:34 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |NOTABUG
                 CC|                            |carlos at redhat dot com
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to infinity0 from comment #3)
> However I'd like to reiterate that even for the broken locales, only a few
> characters are affected, which is surprising behaviour and hides the
> brokenness.

Just for clarity, the locales aren't broken.

Collation is going to vary by locale.

The collation of CJK ideographs in en_GB and en_US is not defined since the
collation entries do not map the code points to a given weight.

This lack of a weight means that the characters with no weights map to the same
weight and are considered the same for the purposes of collation. This is how
we have always handled collation for characters with unknown ordering. They
collate the same, and so a "unique" removal based on collation will collapse
all such characters.

If would be an RFE to treat all characters without weights (we'd have to
identify them specially) as-if they were sorted by the internal code point
(like in C.UTF-8). This would be special and given them a consistent ordering,
but it certainly a deviation from what we've been doing and would need careful
review.

I'm marking this as RESOLVED/NOTABUG.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
  2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
                   ` (4 preceding siblings ...)
  2023-04-14 16:34 ` carlos at redhat dot com
@ 2023-04-14 20:51 ` infinity0 at pwned dot gg
  5 siblings, 0 replies; 7+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 20:51 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

infinity0 at pwned dot gg changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|NOTABUG                     |---

--- Comment #5 from infinity0 at pwned dot gg ---
> [..] characters with unknown ordering. They collate the same, and so a "unique" removal based on collation will collapse all such characters.

These characters don't have "unknown ordering", they (quoting above) "should
get weights alghorithmically based on their Unicode codepoints, but that is not
implemented in glibc."

In other words, it's a bug that needs to be fixed. What you mentioned,
"characters with unknown ordering", is a separate issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-14 20:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-20 21:59 [Bug locale/30149] New: en_GB thinks "㐬" and "㒼" are the same character, other locales are fine infinity0 at pwned dot gg
2023-03-28 11:04 ` [Bug locale/30149] " schwab@linux-m68k.org
2023-03-28 11:06 ` [Bug localedata/30149] " schwab@linux-m68k.org
2023-04-14 13:46 ` infinity0 at pwned dot gg
2023-04-14 13:56 ` infinity0 at pwned dot gg
2023-04-14 16:34 ` carlos at redhat dot com
2023-04-14 20:51 ` infinity0 at pwned dot gg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).