public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/23774] New: lv_LV collates Y/y incorrectly
@ 2018-10-14  8:58 danko at very dot lv
  2018-10-15 12:11 ` [Bug localedata/23774] " maiku.fabian at gmail dot com
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: danko at very dot lv @ 2018-10-14  8:58 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

            Bug ID: 23774
           Summary: lv_LV collates Y/y incorrectly
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: minor
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: danko at very dot lv
                CC: libc-locales at sourceware dot org, maiku.fabian at gmail dot com
  Target Milestone: ---

Commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 introduced this change in the
lv_LV locale:

-<U0079> <i>;<PCL>;<MIN>;IGNORE % y
-<U0059> <i>;<PCL>;<CAP>;IGNORE % Y
+<U0079> <S0069>;<LOWLINE>;<MIN>;IGNORE % y
+<U0059> <S0069>;<LOWLINE>;<CAP>;IGNORE % Y

I don't know what "PCL" meant and whether "Y" was supposed to be "BASE" in the
first place, but "LOWLINE" certainly looks like a bug.

Letter Y is not present in the Latvian alphabet, however it is present in
Latgalian and is located after I, which is what the CLDR rule seems to suggest:

&I<<y<<<Y

I found this by accident while investigating the result of this command on my
system (with LANG being lv_LV.UTF-8)

$ echo abcxyz | grep -Eo '[a-z]+'
abcx
z

I'm sorry if I misunderstood something as I've never worked with either glibc
or CLDR locales directly before.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
@ 2018-10-15 12:11 ` maiku.fabian at gmail dot com
  2018-10-15 12:34 ` maiku.fabian at gmail dot com
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2018-10-15 12:11 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

--- Comment #1 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Danko Alexeyev from comment #0)
> Commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 introduced this change in
> the lv_LV locale:
> 
> -<U0079> <i>;<PCL>;<MIN>;IGNORE % y
> -<U0059> <i>;<PCL>;<CAP>;IGNORE % Y
> +<U0079> <S0069>;<LOWLINE>;<MIN>;IGNORE % y
> +<U0059> <S0069>;<LOWLINE>;<CAP>;IGNORE % Y
> 
> I don't know what "PCL" meant and whether "Y" was supposed to be "BASE" in
> the first place, but "LOWLINE" certainly looks like a bug.

PCL was an old collation symbol which was used in an older version of
the glibc/localedata/locales/iso14651_t1_common file.  It was a second
level collation symbol.

To get the right sort order, replacing it by any existing secondary
collation symbol except "BASE" works fine here.

The current glibc/localedata/locales/iso14651_t1_common contains:

    % Second-level collating symbols

    collating-symbol <BASE>
    collating-symbol <LOWLINE>  % COMBINING LOW LINE
    collating-symbol <PSILI>  % COMBINING COMMA ABOVE
    collating-symbol <DASIA>  % COMBINING REVERSED COMMA ABOVE
    collating-symbol <AIGUT>  % COMBINING ACUTE ACCENT
    ...

<BASE> means base letter, all the  following collation symbols can be used
to indicate secondary differences to base letters. As there is nothing
particularly appropriate for the difference between i and y, it doesn’t
really matter which one is used, so I did choose the first one, LOWLINE.

> Letter Y is not present in the Latvian alphabet, however it is present in
> Latgalian and is located after I, which is what the CLDR rule seems to
> suggest:
> 
> &I<<y<<<Y

This rule means that y is sorted after I *but* only as secondary difference
("<" is a primary difference, "<<" is a secondary difference, "<<<" is a
tertiary difference).
Secondary differences are "accent" differences, i.e. y is treated here
not as a really different letter from I (That would be a primary difference),
but as a "accent" variation of I. Tertiary differences are often used
for upper/lower differences, which is the case here, i.e. the difference
between y and Y is a upper/lower difference.

If you look at the sorting test file for lv_LV.UTF-8:

glibc/localedata/lv_LV.UTF-8.in

you will find that it contains:

i
I
ī
Ī
y
Y
ia
Ia
īa
Īa
ya
Ya
ib
Ib
īb
Īb
yb
Yb

If y were primary different from i, ya would be sorted *after* ib.
But as it is only a secondary difference, the primary difference between 
a and b decides the order for the strings ya and ib.

> I found this by accident while investigating the result of this command on
> my system (with LANG being lv_LV.UTF-8)
> 
> $ echo abcxyz | grep -Eo '[a-z]+'
> abcx
> z
> 
> I'm sorry if I misunderstood something as I've never worked with either
> glibc or CLDR locales directly before.

This fails for other reasons, not because of the use of LOWLINE.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
  2018-10-15 12:11 ` [Bug localedata/23774] " maiku.fabian at gmail dot com
  2018-10-15 12:34 ` maiku.fabian at gmail dot com
@ 2018-10-15 12:34 ` maiku.fabian at gmail dot com
  2018-12-22 19:56 ` rei4dan at gmail dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2018-10-15 12:34 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
See also

https://bugzilla.redhat.com/show_bug.cgi?id=1631472#c3

for a similar case in Swedish.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
  2018-10-15 12:11 ` [Bug localedata/23774] " maiku.fabian at gmail dot com
@ 2018-10-15 12:34 ` maiku.fabian at gmail dot com
  2018-10-15 12:34 ` maiku.fabian at gmail dot com
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2018-10-15 12:34 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |codonell at redhat dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (2 preceding siblings ...)
  2018-10-15 12:34 ` maiku.fabian at gmail dot com
@ 2018-12-22 19:56 ` rei4dan at gmail dot com
  2021-09-08 15:12 ` carlos at redhat dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rei4dan at gmail dot com @ 2018-12-22 19:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Reinis Danne <rei4dan at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rei4dan at gmail dot com

--- Comment #3 from Reinis Danne <rei4dan at gmail dot com> ---
sed-4.6 and grep-3.3 seem to have resolved this particular issue by
implementing rational range interpretation, but [a-ž] and [A-Ž] are buggy.

The former de-interleaves the capital letters for unaccented characters, but
accented capitals are left among the small letters.

Does glibc (2.28) offer alternative collations (or does grep does it)?
As far as I could tell the collation sequence is as specified in the locale:
Using LC_COLLATE=lv_LV.UTF-8
char    strxfrm
i       c2b7010201020101e29b96
I       c2b7010201070101e2afb7
ī       c2b70102140102020101e29bb7
Ī       c2b70102140107020101e2b096
y       c2b701030102
Y       c2b701030107
j       c382010201020101e29c96
J       c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char    strxfrm
i       6b
I       4b
ī       c4ad
Ī       c4ac
y       7b
Y       5b
j       6c
J       4c

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (3 preceding siblings ...)
  2018-12-22 19:56 ` rei4dan at gmail dot com
@ 2021-09-08 15:12 ` carlos at redhat dot com
  2023-12-08 21:19 ` rudolfs.mazurs at gmail dot com
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: carlos at redhat dot com @ 2021-09-08 15:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at redhat dot com
         Resolution|---                         |NOTABUG
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
The notes from Mike indicate that this is not a bug in the glibc locale data
for lv_LV and that we are harmonized with CLDR. I haven't seen further comments
from Danko Alexeyev to refute that. I'm marking this RESOLVED NOTABUG.

To answer Reinis Dane's question, yes you can make alternative collations, but
they must be alternative locales e.g. lv_LV@alt1.utf8 where the "@alt1" to
define a suffice for an alternative locale e.g. lv_LV@alt1 which has distinct
collation from lv_LV.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (4 preceding siblings ...)
  2021-09-08 15:12 ` carlos at redhat dot com
@ 2023-12-08 21:19 ` rudolfs.mazurs at gmail dot com
  2024-02-01 22:10 ` rudolfs.mazurs at gmail dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rudolfs.mazurs at gmail dot com @ 2023-12-08 21:19 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Rudolfs Mazurs <rudolfs.mazurs at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rudolfs.mazurs at gmail dot com

--- Comment #5 from Rudolfs Mazurs <rudolfs.mazurs at gmail dot com> ---
This bug is still relevant. Collation for lv_LV locale should be: i, y, ī, not
i, ī, y.

Faulty behaviour was introduced in the Bug 15537,
0001-lv_LV-locale-fix-collation-BZ-15537.patch

The correct ordering is described in LVS 24:1993 standard, the relevant section
was quoted in https://unicode-org.atlassian.net/browse/CLDR-6475 bug report.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (5 preceding siblings ...)
  2023-12-08 21:19 ` rudolfs.mazurs at gmail dot com
@ 2024-02-01 22:10 ` rudolfs.mazurs at gmail dot com
  2024-02-07  8:44 ` maiku.fabian at gmail dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rudolfs.mazurs at gmail dot com @ 2024-02-01 22:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

--- Comment #6 from Rudolfs Mazurs <rudolfs.mazurs at gmail dot com> ---
This issue was fixed in other CLDR issue:
https://unicode-org.atlassian.net/browse/CLDR-11982

The new collation rules are in the file
https://github.com/unicode-org/cldr/blob/main/common/collation/lv.xml and it's
set to be released in the CLDR version 45.

Should I attempt to make a git patch?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (6 preceding siblings ...)
  2024-02-01 22:10 ` rudolfs.mazurs at gmail dot com
@ 2024-02-07  8:44 ` maiku.fabian at gmail dot com
  2024-02-07  8:45 ` maiku.fabian at gmail dot com
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-07  8:44 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
   Last reconfirmed|                            |2024-02-07
     Ever confirmed|0                           |1
         Resolution|NOTABUG                     |---

--- Comment #7 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Reopen.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (7 preceding siblings ...)
  2024-02-07  8:44 ` maiku.fabian at gmail dot com
@ 2024-02-07  8:45 ` maiku.fabian at gmail dot com
  2024-02-07 15:03 ` maiku.fabian at gmail dot com
  2024-02-08  7:38 ` maiku.fabian at gmail dot com
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-07  8:45 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

--- Comment #8 from Mike FABIAN <maiku.fabian at gmail dot com> ---
*** Bug 25206 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (8 preceding siblings ...)
  2024-02-07  8:45 ` maiku.fabian at gmail dot com
@ 2024-02-07 15:03 ` maiku.fabian at gmail dot com
  2024-02-08  7:38 ` maiku.fabian at gmail dot com
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-07 15:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at sourceware dot org   |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/23774] lv_LV collates Y/y incorrectly
  2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
                   ` (9 preceding siblings ...)
  2024-02-07 15:03 ` maiku.fabian at gmail dot com
@ 2024-02-08  7:38 ` maiku.fabian at gmail dot com
  10 siblings, 0 replies; 12+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-08  7:38 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=23774

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |2.40
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Fixed in glibc master:
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=30a61b1dd98dacbbdcba960e247400b6b2abd8f9

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-02-08  7:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-14  8:58 [Bug localedata/23774] New: lv_LV collates Y/y incorrectly danko at very dot lv
2018-10-15 12:11 ` [Bug localedata/23774] " maiku.fabian at gmail dot com
2018-10-15 12:34 ` maiku.fabian at gmail dot com
2018-10-15 12:34 ` maiku.fabian at gmail dot com
2018-12-22 19:56 ` rei4dan at gmail dot com
2021-09-08 15:12 ` carlos at redhat dot com
2023-12-08 21:19 ` rudolfs.mazurs at gmail dot com
2024-02-01 22:10 ` rudolfs.mazurs at gmail dot com
2024-02-07  8:44 ` maiku.fabian at gmail dot com
2024-02-07  8:45 ` maiku.fabian at gmail dot com
2024-02-07 15:03 ` maiku.fabian at gmail dot com
2024-02-08  7:38 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).