[Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine

public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
       [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
@ 2023-03-28 11:06 ` schwab@linux-m68k.org
  2023-04-14 13:46 ` infinity0 at pwned dot gg
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: schwab@linux-m68k.org @ 2023-03-28 11:06 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

Andreas Schwab <schwab@linux-m68k.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|locale                      |localedata
                 CC|                            |libc-locales at sourceware dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
       [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
  2023-03-28 11:06 ` [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine schwab@linux-m68k.org
@ 2023-04-14 13:46 ` infinity0 at pwned dot gg
  2023-04-14 13:56 ` infinity0 at pwned dot gg
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:46 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

--- Comment #2 from infinity0 at pwned dot gg ---
Thanks for the investigation & explanation. However I can confirm again that
en_US.UTF-8 does indeed work as above for me, giving the correct answer unlike
en_GB which gives the incorrect answer as reported. In fact every other locale
works correctly, except en_GB and surprisingly zh_TW as well:

$ for l in /usr/share/locale/*; do echo $(LC_COLLATE=$(basename $l).UTF-8 sort
-u test2.txt | wc -l) $l; done | grep -v ^2
1 /usr/share/locale/en_GB
1 /usr/share/locale/zh_TW

Furthermore, other character I tried are fine, giving the expected "2" even in
en_GB and zh_TW.

Granted, 㐬 and 㒼 rare characters mostly used for linguistic study purposes and
not everyday use, however they are still certainly distinct characters and the
locale and/or sorting is buggy.

Are you able to replicate this? Feel free to ask me to run stuff on my computer
to debug further, in case my results are different from yours.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
       [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
  2023-03-28 11:06 ` [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine schwab@linux-m68k.org
  2023-04-14 13:46 ` infinity0 at pwned dot gg
@ 2023-04-14 13:56 ` infinity0 at pwned dot gg
  2023-04-14 16:34 ` carlos at redhat dot com
  2023-04-14 20:51 ` infinity0 at pwned dot gg
  4 siblings, 0 replies; 5+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 13:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

--- Comment #3 from infinity0 at pwned dot gg ---
Oh, looks like I am only generating certain locales in /etc/locale.gen.
Uncommenting en_US gives the result 1 as you were saying.

However I'd like to reiterate that even for the broken locales, only a few
characters are affected, which is surprising behaviour and hides the
brokenness.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
       [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2023-04-14 13:56 ` infinity0 at pwned dot gg
@ 2023-04-14 16:34 ` carlos at redhat dot com
  2023-04-14 20:51 ` infinity0 at pwned dot gg
  4 siblings, 0 replies; 5+ messages in thread
From: carlos at redhat dot com @ 2023-04-14 16:34 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |NOTABUG
                 CC|                            |carlos at redhat dot com
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to infinity0 from comment #3)
> However I'd like to reiterate that even for the broken locales, only a few
> characters are affected, which is surprising behaviour and hides the
> brokenness.

Just for clarity, the locales aren't broken.

Collation is going to vary by locale.

The collation of CJK ideographs in en_GB and en_US is not defined since the
collation entries do not map the code points to a given weight.

This lack of a weight means that the characters with no weights map to the same
weight and are considered the same for the purposes of collation. This is how
we have always handled collation for characters with unknown ordering. They
collate the same, and so a "unique" removal based on collation will collapse
all such characters.

If would be an RFE to treat all characters without weights (we'd have to
identify them specially) as-if they were sorted by the internal code point
(like in C.UTF-8). This would be special and given them a consistent ordering,
but it certainly a deviation from what we've been doing and would need careful
review.

I'm marking this as RESOLVED/NOTABUG.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine
       [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2023-04-14 16:34 ` carlos at redhat dot com
@ 2023-04-14 20:51 ` infinity0 at pwned dot gg
  4 siblings, 0 replies; 5+ messages in thread
From: infinity0 at pwned dot gg @ 2023-04-14 20:51 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=30149

infinity0 at pwned dot gg changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |UNCONFIRMED
         Resolution|NOTABUG                     |---

--- Comment #5 from infinity0 at pwned dot gg ---
> [..] characters with unknown ordering. They collate the same, and so a "unique" removal based on collation will collapse all such characters.

These characters don't have "unknown ordering", they (quoting above) "should
get weights alghorithmically based on their Unicode codepoints, but that is not
implemented in glibc."

In other words, it's a bug that needs to be fixed. What you mentioned,
"characters with unknown ordering", is a separate issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-04-14 20:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-30149-716@http.sourceware.org/bugzilla/>
2023-03-28 11:06 ` [Bug localedata/30149] en_GB thinks "㐬" and "㒼" are the same character, other locales are fine schwab@linux-m68k.org
2023-04-14 13:46 ` infinity0 at pwned dot gg
2023-04-14 13:56 ` infinity0 at pwned dot gg
2023-04-14 16:34 ` carlos at redhat dot com
2023-04-14 20:51 ` infinity0 at pwned dot gg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).