public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong
@ 2004-09-08 23:53 munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-08 23:53 UTC (permalink / raw)
To: glibc-bugs
[root@localhost home]# LC_COLLATE=en_US ls -- 0 a A -a a- aa "a a" a-a "a z"
0 a -a a- A aa a a a-a a z
[root@localhost home]# LC_COLLATE=en_CA ls -- 0 a A -a a- aa "a a" a-a "a z"
0 A a -a a- aa a a a-a a z
[root@localhost home]# LC_COLLATE=da ls -- 0 a A -a a- aa "a a" a-a "a z"
-a 0 A a a a a z a- a-a aa
[root@localhost home]# LC_COLLATE=ar_SA ls -- 0 a A -a a- aa "a a" a-a "a z"
0 A a a a a z aa a- a-a -a
da: (the character "-" has a 1st order sorting value, coming before letters and
numbers; on most other locales "-" is ignored in sorting)
ar_SA: (note how ar_SA handles "-" as a collatable element coming after "z")
--
Summary: The rules in LC_COLLATE are random and sometimes clearly
wrong
Product: glibc
Version: 2.3.3
Status: NEW
Severity: critical
Priority: P2
Component: libc
AssignedTo: gotom at debian dot or dot jp
ReportedBy: munzirtaha at newhorizons dot com dot sa
CC: glibc-bugs at sources dot redhat dot com,pablo at
mandrakesoft dot com
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug libc/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
@ 2004-09-09 16:27 ` gotom at debian dot or dot jp
2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: gotom at debian dot or dot jp @ 2004-09-09 16:27 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From gotom at debian dot or dot jp 2004-09-09 16:27 -------
Please describe what the problem is. At least ISO/IEC defines
some locales (like en_US) collation that says a capital and
small letter is combined; a A b B ... and so on.
BTW, what is locale "da"?
Execute "locale -a" and check whether "da" is available or not.
--
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |WAITING
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug libc/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
@ 2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
2004-09-26 9:33 ` [Bug localedata/374] " drepper at redhat dot com
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-12 18:38 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From munzirtaha at newhorizons dot com dot sa 2004-09-12 18:38 -------
Some hints:
1. There should be no difference between en_US and en_CA.
2. de (sorry not da) sorting is very odd. (the character "-" has a 1st order
sorting value, coming before letters and numbers; on most other locales "-" is
ignored in sorting)
3. ar_SA handles "-" as a collatable element coming after "z". ar_SA defines
LC_COLLATE using an old syntax (with only one level of collating weight); so
maybe this special weight for "-" wasn't intended to be like that; just a
side-effet. Maybe the LC_COLLATE section should be redefined to use the
default one and only redefine (if needed) the sorting of arabic script letters
only.
Thanks to Mr. Pablo of Mandrake for discussing the issue with me. I borrowed
some of his comments.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
@ 2004-09-26 9:33 ` drepper at redhat dot com
2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: drepper at redhat dot com @ 2004-09-26 9:33 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From drepper at redhat dot com 2004-09-26 09:33 -------
This is no valid argumentation.
The rules stem from data worked out by a group of experts on the topic and I
trust them more then any random reporter who thinks s/he knows something.
Either you specify *exactly* which rules in what locale you consider wrong and
you back it up by providing supporting evidence (e.g., from national standards)
or you can go away since nothing will ever be changed without following these
procedures.
--
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|gotom at debian dot or dot |pere at hungry dot com
|jp |
Severity|critical |minor
Status|WAITING |NEW
Component|libc |localedata
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (2 preceding siblings ...)
2004-09-26 9:33 ` [Bug localedata/374] " drepper at redhat dot com
@ 2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
2004-09-30 21:07 ` pablo at mandrakesoft dot com
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-30 20:26 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From munzirtaha at newhorizons dot com dot sa 2004-09-30 20:25 -------
First, I am sorry that you felt as if I was pretending to "know something".
Actually, I am not an expert at all in those issues and hence you need to help
me report it in a better way if this is still not enough.
Second, I am an Arabic native speaker (ar). I am also living in Saudi Arabia
(SA). Also, we don't have our own English and we don't have "national
standards" for English. We follow the known English standards available.
The bug I am going to report here is concerned with locale ar_SA.
If I have a file named "aa" and another named "a z", I would expect the
command "ls" to display them with "aa" before "a z" as it happens when the
locale is en_US, en_CA, en_GB, ... wich is not the case now.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (3 preceding siblings ...)
2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
@ 2004-09-30 21:07 ` pablo at mandrakesoft dot com
2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: pablo at mandrakesoft dot com @ 2004-09-30 21:07 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From pablo at mandrakesoft dot com 2004-09-30 21:07 -------
I think indeed some LC_COLLATE definitions are wrong; like they haven't been
rewritten/updated to benefit of the new (glibc > 2.2) possibilities.
When you look at ar_SA, the LC_COLLATE is defined with lines like:
order_start forward; forward
<U0020> <U0020>
...
<U0030> <U0030>
<U0031> <U0031>
<U0032> <U0032>
....
<U0041> <U0041>;<U0041>
<U0061> <U0041>;<U0061>
...
if you compare with iso14651_t1 (used (maybe completed) by most other locales)
you see things like this instead:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
...
<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1
<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2
...
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
...
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
...
While ar_SA gives for each element only or in some cases two information tokens;
the more modern LC_COLLATE definitions have 4.
You can also see that while in ar_SA the space (<U0020>) is treated the same
as the digits, on the more modern LC_COLLATE definition it is not; in fact the
space is defined as sorting neutral.
The latin letters have information telling if they are uppercase or lowercase
in the modern LC_COLLATE; that information is missing in the definition in ar_SA
da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but
redefines everything itself (instead of including iso14651_t1 and only
redefining what differs); spaces and blanks have 1st order sorting weight, which
seems very strange to me, but even if Danish language sort spaces in such a
peculiar way it is still strange to sort differently the space (0020) and the
non breaking space (00A0), semantically they are the same thing, the difference
is only typographical.
While the sorting of letters is correct (at least for the letters used by a
given language, ar_SA for example happily ignores any latin letter outside of
ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts
"agrave" after the last arabic letter...), the handling of punctuation and
other special symbols should be reviewed imho.
Also, all locales should include iso14651_t1 so that there can be an acceptable
sorting for alphabetic symbols outside the range of the alphabet of the given
locale (in an UTF-8 world you will likely see such things; I get for example
mail from people with names having cacute, ccaron, lstroke, eogonek, etc.
in my language none of those exist, but I expect them to be sorted with
"c", "c", "l", "e" respectively, and not after "z".
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (4 preceding siblings ...)
2004-09-30 21:07 ` pablo at mandrakesoft dot com
@ 2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
2005-01-17 21:42 ` barbier at linuxfr dot org
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-10-01 15:57 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From munzirtaha at newhorizons dot com dot sa 2004-10-01 15:57 -------
Sigh! At last an expert came to the rescue ;)
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (5 preceding siblings ...)
2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
@ 2005-01-17 21:42 ` barbier at linuxfr dot org
2005-01-17 21:43 ` barbier at linuxfr dot org
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 21:42 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 21:42 -------
Created an attachment (id=370)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=370&action=view)
C source file for the tst-strcoll program
This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (6 preceding siblings ...)
2005-01-17 21:42 ` barbier at linuxfr dot org
@ 2005-01-17 21:43 ` barbier at linuxfr dot org
2005-01-17 22:29 ` barbier at linuxfr dot org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 21:43 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 21:43 -------
Created an attachment (id=371)
--> (http://sources.redhat.com/bugzilla/attachment.cgi?id=371&action=view)
C source file for the tst-wcscoll program
This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.
--
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (7 preceding siblings ...)
2005-01-17 21:43 ` barbier at linuxfr dot org
@ 2005-01-17 22:29 ` barbier at linuxfr dot org
2005-10-14 23:02 ` drepper at redhat dot com
2006-04-10 16:56 ` mfabian at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 22:29 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From barbier at linuxfr dot org 2005-01-17 22:29 -------
(From update of attachment 370)
Oops. this patch was for BZ#368
--
What |Removed |Added
----------------------------------------------------------------------------
Attachment #370 is|0 |1
obsolete| |
http://sources.redhat.com/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (8 preceding siblings ...)
2005-01-17 22:29 ` barbier at linuxfr dot org
@ 2005-10-14 23:02 ` drepper at redhat dot com
2006-04-10 16:56 ` mfabian at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: drepper at redhat dot com @ 2005-10-14 23:02 UTC (permalink / raw)
To: glibc-bugs
------- Additional Comments From drepper at redhat dot com 2005-10-14 23:02 -------
If any locale definition should change, send a patch with justification. Just
saying "I don't like it" achieves *nothing*. I'm closing this bug since there
is absolutely no substance here. Locales are only updated if somebody who cares
does the work.
--
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |WONTFIX
http://sourceware.org/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
` (9 preceding siblings ...)
2005-10-14 23:02 ` drepper at redhat dot com
@ 2006-04-10 16:56 ` mfabian at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: mfabian at suse dot de @ 2006-04-10 16:56 UTC (permalink / raw)
To: glibc-bugs
--
What |Removed |Added
----------------------------------------------------------------------------
CC| |mfabian at suse dot de
http://sourceware.org/bugzilla/show_bug.cgi?id=374
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-04-10 16:56 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
2004-09-26 9:33 ` [Bug localedata/374] " drepper at redhat dot com
2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
2004-09-30 21:07 ` pablo at mandrakesoft dot com
2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
2005-01-17 21:42 ` barbier at linuxfr dot org
2005-01-17 21:43 ` barbier at linuxfr dot org
2005-01-17 22:29 ` barbier at linuxfr dot org
2005-10-14 23:02 ` drepper at redhat dot com
2006-04-10 16:56 ` mfabian at suse dot de
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).