public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong
@ 2004-09-08 23:53 munzirtaha at newhorizons dot com dot sa
  2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-08 23:53 UTC (permalink / raw)
  To: glibc-bugs

[root@localhost home]# LC_COLLATE=en_US ls -- 0 a A -a a- aa "a a" a-a "a z"
0  a  -a  a-  A  aa  a a  a-a  a z
[root@localhost home]# LC_COLLATE=en_CA ls -- 0 a A -a a- aa "a a" a-a "a z"
0  A  a  -a  a-  aa  a a  a-a  a z
[root@localhost home]# LC_COLLATE=da ls -- 0 a A -a a- aa "a a" a-a "a z"
-a  0  A  a  a a  a z  a-  a-a  aa
[root@localhost home]# LC_COLLATE=ar_SA ls -- 0 a A -a a- aa "a a" a-a "a z"
0  A  a  a a  a z  aa  a-  a-a  -a

da: (the character "-" has a 1st order sorting value, coming before letters and
numbers; on most other locales "-" is ignored in sorting)
ar_SA: (note how ar_SA handles "-" as a collatable element coming after "z")

-- 
           Summary: The rules in LC_COLLATE are random and sometimes clearly
                    wrong
           Product: glibc
           Version: 2.3.3
            Status: NEW
          Severity: critical
          Priority: P2
         Component: libc
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: munzirtaha at newhorizons dot com dot sa
                CC: glibc-bugs at sources dot redhat dot com,pablo at
                    mandrakesoft dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug libc/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
@ 2004-09-09 16:27 ` gotom at debian dot or dot jp
  2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: gotom at debian dot or dot jp @ 2004-09-09 16:27 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From gotom at debian dot or dot jp  2004-09-09 16:27 -------
Please describe what the problem is.  At least ISO/IEC defines
some locales (like en_US) collation that says a capital and 
small letter is combined; a A b B ... and so on.

BTW, what is locale "da"?
Execute "locale -a" and check whether "da" is available or not.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug libc/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
  2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
@ 2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
  2004-09-26  9:33 ` [Bug localedata/374] " drepper at redhat dot com
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-12 18:38 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From munzirtaha at newhorizons dot com dot sa  2004-09-12 18:38 -------
Some hints: 
1. There should be no difference between en_US and en_CA. 
2. de (sorry not da) sorting is very odd. (the character "-" has a 1st order 
sorting value, coming before letters and numbers; on most other locales "-" is 
ignored in sorting) 
3. ar_SA handles "-" as a collatable element coming after "z". ar_SA defines 
LC_COLLATE using an old syntax (with only one level of collating weight); so 
maybe this special weight for "-" wasn't intended to be like that; just a 
side-effet. Maybe the LC_COLLATE section should be redefined to use the 
default one and only redefine (if needed) the sorting of arabic script letters 
only. 
 
Thanks to Mr. Pablo of Mandrake for discussing the issue with me. I borrowed 
some of his comments. 

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
  2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
  2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
@ 2004-09-26  9:33 ` drepper at redhat dot com
  2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: drepper at redhat dot com @ 2004-09-26  9:33 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper at redhat dot com  2004-09-26 09:33 -------
This is no valid argumentation.

The rules stem from data worked out by a group of experts on the topic and I
trust them more then any random reporter who thinks s/he knows something.

Either you specify *exactly* which rules in what locale you consider wrong and
you back it up by providing supporting evidence (e.g., from national standards)
or you can go away since nothing will ever be changed without following these
procedures.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|gotom at debian dot or dot  |pere at hungry dot com
                   |jp                          |
           Severity|critical                    |minor
             Status|WAITING                     |NEW
          Component|libc                        |localedata


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (2 preceding siblings ...)
  2004-09-26  9:33 ` [Bug localedata/374] " drepper at redhat dot com
@ 2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
  2004-09-30 21:07 ` pablo at mandrakesoft dot com
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-09-30 20:26 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From munzirtaha at newhorizons dot com dot sa  2004-09-30 20:25 -------
First, I am sorry that you felt as if I was pretending to "know something". 
Actually, I am not an expert at all in those issues and hence you need to help 
me report it in a better way if this is still not enough. 
 
Second, I am an Arabic native speaker (ar). I am also living in Saudi Arabia 
(SA). Also, we don't have our own English and we don't have "national 
standards" for English. We follow the known English standards available. 
 
The bug I am going to report here is concerned with locale ar_SA. 
If I have a file named "aa" and another named "a z", I would expect the 
command "ls" to display them with "aa" before "a z" as it happens when the 
locale is en_US, en_CA, en_GB, ... wich is not the case now. 
 

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (3 preceding siblings ...)
  2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
@ 2004-09-30 21:07 ` pablo at mandrakesoft dot com
  2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pablo at mandrakesoft dot com @ 2004-09-30 21:07 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From pablo at mandrakesoft dot com  2004-09-30 21:07 -------
I think indeed some LC_COLLATE definitions are wrong; like they haven't been
rewritten/updated to benefit of the new (glibc > 2.2) possibilities.

When you look at ar_SA, the LC_COLLATE is defined with lines like:

order_start             forward; forward
<U0020> <U0020>
...
<U0030> <U0030>
<U0031> <U0031>
<U0032> <U0032>
....
<U0041> <U0041>;<U0041>
<U0061> <U0041>;<U0061>
...

if you compare with iso14651_t1 (used (maybe completed) by most other locales)
you see things like this instead:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
...
<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1
<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2
...
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
...
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
...

While ar_SA gives for each element only or in some cases two information tokens;
the more modern LC_COLLATE definitions have 4.
You can also see that while in ar_SA the space (<U0020>) is treated the same 
as the digits, on the more modern LC_COLLATE definition it is not; in fact the
space is defined as sorting neutral.
The latin letters have information telling if they are uppercase or lowercase
in the modern LC_COLLATE; that information is missing in the definition in ar_SA

da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but
redefines everything itself (instead of including iso14651_t1 and only
redefining what differs); spaces and blanks have 1st order sorting weight, which
seems very strange to me, but even if Danish language sort spaces in such a
peculiar way it is still strange to sort differently the space (0020) and the
non breaking space (00A0), semantically they are the same thing, the difference
is only typographical.

While the sorting of letters is correct (at least for the letters used by a
given language, ar_SA for example happily ignores any latin letter outside of
ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts
"agrave" after the last arabic letter...), the handling of punctuation and
other special symbols should be reviewed imho.
Also, all locales should include iso14651_t1 so that there can be an acceptable
sorting for alphabetic symbols outside the range of the alphabet of the given
locale (in an UTF-8 world you will likely see such things; I get for example
mail from people with names having cacute, ccaron, lstroke, eogonek, etc.
in my language none of those exist, but I expect them to be sorted with 
"c", "c", "l", "e" respectively, and not after "z".

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (4 preceding siblings ...)
  2004-09-30 21:07 ` pablo at mandrakesoft dot com
@ 2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
  2005-01-17 21:42 ` barbier at linuxfr dot org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: munzirtaha at newhorizons dot com dot sa @ 2004-10-01 15:57 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From munzirtaha at newhorizons dot com dot sa  2004-10-01 15:57 -------
Sigh! At last an expert came to the rescue ;) 

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (5 preceding siblings ...)
  2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
@ 2005-01-17 21:42 ` barbier at linuxfr dot org
  2005-01-17 21:43 ` barbier at linuxfr dot org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 21:42 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From barbier at linuxfr dot org  2005-01-17 21:42 -------
Created an attachment (id=370)
 --> (http://sources.redhat.com/bugzilla/attachment.cgi?id=370&action=view)
C source file for the tst-strcoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (6 preceding siblings ...)
  2005-01-17 21:42 ` barbier at linuxfr dot org
@ 2005-01-17 21:43 ` barbier at linuxfr dot org
  2005-01-17 22:29 ` barbier at linuxfr dot org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 21:43 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From barbier at linuxfr dot org  2005-01-17 21:43 -------
Created an attachment (id=371)
 --> (http://sources.redhat.com/bugzilla/attachment.cgi?id=371&action=view)
C source file for the tst-wcscoll program

This program can only process files composed of lines of 2 UTF-8
characters, some modifications are needed to accept any input.


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (7 preceding siblings ...)
  2005-01-17 21:43 ` barbier at linuxfr dot org
@ 2005-01-17 22:29 ` barbier at linuxfr dot org
  2005-10-14 23:02 ` drepper at redhat dot com
  2006-04-10 16:56 ` mfabian at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: barbier at linuxfr dot org @ 2005-01-17 22:29 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From barbier at linuxfr dot org  2005-01-17 22:29 -------
(From update of attachment 370)
Oops. this patch was for BZ#368


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #370 is|0                           |1
           obsolete|                            |


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (8 preceding siblings ...)
  2005-01-17 22:29 ` barbier at linuxfr dot org
@ 2005-10-14 23:02 ` drepper at redhat dot com
  2006-04-10 16:56 ` mfabian at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: drepper at redhat dot com @ 2005-10-14 23:02 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper at redhat dot com  2005-10-14 23:02 -------
If any locale definition should change, send a patch with justification.  Just
saying "I don't like it" achieves *nothing*.  I'm closing this bug since there
is absolutely no substance here.  Locales are only updated if somebody who cares
does the work.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
  2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
                   ` (9 preceding siblings ...)
  2005-10-14 23:02 ` drepper at redhat dot com
@ 2006-04-10 16:56 ` mfabian at suse dot de
  10 siblings, 0 replies; 12+ messages in thread
From: mfabian at suse dot de @ 2006-04-10 16:56 UTC (permalink / raw)
  To: glibc-bugs



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mfabian at suse dot de


http://sourceware.org/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-04-10 16:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-08 23:53 [Bug libc/374] New: The rules in LC_COLLATE are random and sometimes clearly wrong munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
2004-09-26  9:33 ` [Bug localedata/374] " drepper at redhat dot com
2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
2004-09-30 21:07 ` pablo at mandrakesoft dot com
2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
2005-01-17 21:42 ` barbier at linuxfr dot org
2005-01-17 21:43 ` barbier at linuxfr dot org
2005-01-17 22:29 ` barbier at linuxfr dot org
2005-10-14 23:02 ` drepper at redhat dot com
2006-04-10 16:56 ` mfabian at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).