public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8
@ 2019-11-19 14:13 carlos at redhat dot com
  2019-11-19 14:14 ` [Bug localedata/25206] " carlos at redhat dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: carlos at redhat dot com @ 2019-11-19 14:13 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=25206

            Bug ID: 25206
           Summary: strcoll sort result incorrect locale lv_LV.UTF-8
           Product: glibc
           Version: 2.31
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: carlos at redhat dot com
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Created attachment 12082
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12082&action=edit
lv_LV.UTF-8.in_sorted

In the downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1696729

The claim is made that lv_LV sorting is incorrect.

I have suggested that the following be sorted correctly:
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/lv_LV.UTF-8.in;h=db7e83c77e83183ee88eb9769f82a66c4cb758ab;hb=HEAD

Then we can use this as a reference for discussion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/25206] strcoll sort result incorrect locale lv_LV.UTF-8
  2019-11-19 14:13 [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8 carlos at redhat dot com
@ 2019-11-19 14:14 ` carlos at redhat dot com
  2019-11-19 14:14 ` carlos at redhat dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: carlos at redhat dot com @ 2019-11-19 14:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=25206

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://bugzilla.redhat.com
                   |                            |/show_bug.cgi?id=1696729

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/25206] strcoll sort result incorrect locale lv_LV.UTF-8
  2019-11-19 14:13 [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8 carlos at redhat dot com
  2019-11-19 14:14 ` [Bug localedata/25206] " carlos at redhat dot com
@ 2019-11-19 14:14 ` carlos at redhat dot com
  2019-11-25 11:20 ` digitalfreak at lingonborough dot com
  2024-02-07  8:45 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: carlos at redhat dot com @ 2019-11-19 14:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=25206

--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
Created attachment 12083
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12083&action=edit
lv_LV.UTF-8_with_more_chars_and_removed.in

Additional sorted file.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/25206] strcoll sort result incorrect locale lv_LV.UTF-8
  2019-11-19 14:13 [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8 carlos at redhat dot com
  2019-11-19 14:14 ` [Bug localedata/25206] " carlos at redhat dot com
  2019-11-19 14:14 ` carlos at redhat dot com
@ 2019-11-25 11:20 ` digitalfreak at lingonborough dot com
  2024-02-07  8:45 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: digitalfreak at lingonborough dot com @ 2019-11-25 11:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=25206

Rafal Luzynski <digitalfreak at lingonborough dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |digitalfreak@lingonborough.
                   |                            |com

--- Comment #2 from Rafal Luzynski <digitalfreak at lingonborough dot com> ---
Probably this should be addressed to Agris and I hope he is able to read this
comment.

(In reply to Carlos O'Donell from comment #0)
> Created attachment 12082 [details]
> lv_LV.UTF-8.in_sorted
> 
> In the downstream bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1696729
> [...]

These details look suspicious to me for the following reasons.

1. The quoted rule says that "the string with capital letter is preferred". 
What does it mean "preferred"?  To me it seems it means it should be sorted
first.  This rule would be difficult for us to implement because probably we
would have to reorder all letters.  I'm not sure if this is worth the effort. 
But then the sample sort file lists uppercase letters second.  Which is fine
for me but contradicting the rule.

2. My understanding of the rule quoted above is that it should be applied when
the words differ only in the upper/lower case of the letter.  The letters
should be compared ignoring the case first.  But the sample file sorts all
uppercase words after all lowercase, for example:

a
ab
abc
ad
...
az
azzzxxyz
A
Abc
Az
AB

Is this really what we want?  I think that this rule would be very inconvenient
for the users because they would have to be aware of this rule all the time and
be ready to search for uppercased words always after the respective lowercase
letters.

3. I totally understand and agree with one point.  The letters 'a' and 'ā'
should be separated.  For example, now we have:

a
ā
ab
ācc
add
āfe
ah

but if I understand correctly this should be:

a
ab
add
ah
ā
ācc
āfe

This is understandable and easy to implement but that means that we all were
wrong, by "all" I mean including CLDR.  But in order to confirm that CLDR was
wrong I would like at least to see a ticket filed against CLDR and at least see
no objection at their side.

4. Please note that the current sorting rules for lv_LV distinguish letters 'c'
vs 'č', 'g' vs 'ģ', 'k' vs 'ķ' and several more.  Do I understand correctly
that those letters work fine and the same rule should be applied to 'a' vs 'ā'
and several more characters?

(In reply to Carlos O'Donell from comment #1)
> Created attachment 12083 [details]
> lv_LV.UTF-8_with_more_chars_and_removed.in
> 
> Additional sorted file.

5. This does not look good to me, either.  I mean, adding more test chars is a
good idea but removing other test chars just because they are foreign to
Latvian is not.  The sorting rules must somehow deal with even the most exotic
characters.  This is the reason why we aim at starting the collating rules with
“copy "iso14651_t1"” which aims to include all Unicode characters and only then
we add rules specific for the current language.


TL;DR: If adding a rule to distinguish 'a' vs 'ā' plus several more similar
characters is sufficient then we can easily implement this but the attached
test cases need to be fixed.  Otherwise we'll have to verify if the required
rules are correct.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/25206] strcoll sort result incorrect locale lv_LV.UTF-8
  2019-11-19 14:13 [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8 carlos at redhat dot com
                   ` (2 preceding siblings ...)
  2019-11-25 11:20 ` digitalfreak at lingonborough dot com
@ 2024-02-07  8:45 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-07  8:45 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=25206

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |DUPLICATE
                 CC|                            |maiku.fabian at gmail dot com

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
I will sync with CLDR again.

*** This bug has been marked as a duplicate of bug 23774 ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-02-07  8:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-19 14:13 [Bug localedata/25206] New: strcoll sort result incorrect locale lv_LV.UTF-8 carlos at redhat dot com
2019-11-19 14:14 ` [Bug localedata/25206] " carlos at redhat dot com
2019-11-19 14:14 ` carlos at redhat dot com
2019-11-25 11:20 ` digitalfreak at lingonborough dot com
2024-02-07  8:45 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).