public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "pablo at mandrakesoft dot com" <sourceware-bugzilla@sources.redhat.com>
To: glibc-bugs@sources.redhat.com
Subject: [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong
Date: Thu, 30 Sep 2004 21:07:00 -0000	[thread overview]
Message-ID: <20040930210746.27693.qmail@sourceware.org> (raw)
In-Reply-To: <20040908235331.374.munzirtaha@newhorizons.com.sa>


------- Additional Comments From pablo at mandrakesoft dot com  2004-09-30 21:07 -------
I think indeed some LC_COLLATE definitions are wrong; like they haven't been
rewritten/updated to benefit of the new (glibc > 2.2) possibilities.

When you look at ar_SA, the LC_COLLATE is defined with lines like:

order_start             forward; forward
<U0020> <U0020>
...
<U0030> <U0030>
<U0031> <U0031>
<U0032> <U0032>
....
<U0041> <U0041>;<U0041>
<U0061> <U0041>;<U0061>
...

if you compare with iso14651_t1 (used (maybe completed) by most other locales)
you see things like this instead:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
...
<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1
<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2
...
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
...
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
...

While ar_SA gives for each element only or in some cases two information tokens;
the more modern LC_COLLATE definitions have 4.
You can also see that while in ar_SA the space (<U0020>) is treated the same 
as the digits, on the more modern LC_COLLATE definition it is not; in fact the
space is defined as sorting neutral.
The latin letters have information telling if they are uppercase or lowercase
in the modern LC_COLLATE; that information is missing in the definition in ar_SA

da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but
redefines everything itself (instead of including iso14651_t1 and only
redefining what differs); spaces and blanks have 1st order sorting weight, which
seems very strange to me, but even if Danish language sort spaces in such a
peculiar way it is still strange to sort differently the space (0020) and the
non breaking space (00A0), semantically they are the same thing, the difference
is only typographical.

While the sorting of letters is correct (at least for the letters used by a
given language, ar_SA for example happily ignores any latin letter outside of
ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts
"agrave" after the last arabic letter...), the handling of punctuation and
other special symbols should be reviewed imho.
Also, all locales should include iso14651_t1 so that there can be an acceptable
sorting for alphabetic symbols outside the range of the alphabet of the given
locale (in an UTF-8 world you will likely see such things; I get for example
mail from people with names having cacute, ccaron, lstroke, eogonek, etc.
in my language none of those exist, but I expect them to be sorted with 
"c", "c", "l", "e" respectively, and not after "z".

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


  parent reply	other threads:[~2004-09-30 21:07 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-08 23:53 [Bug libc/374] New: " munzirtaha at newhorizons dot com dot sa
2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp
2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa
2004-09-26  9:33 ` [Bug localedata/374] " drepper at redhat dot com
2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa
2004-09-30 21:07 ` pablo at mandrakesoft dot com [this message]
2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa
2005-01-17 21:42 ` barbier at linuxfr dot org
2005-01-17 21:43 ` barbier at linuxfr dot org
2005-01-17 22:29 ` barbier at linuxfr dot org
2005-10-14 23:02 ` drepper at redhat dot com
2006-04-10 16:56 ` mfabian at suse dot de

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040930210746.27693.qmail@sourceware.org \
    --to=sourceware-bugzilla@sources.redhat.com \
    --cc=glibc-bugs@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).