public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "pablo at mandrakesoft dot com" <sourceware-bugzilla@sources.redhat.com> To: glibc-bugs@sources.redhat.com Subject: [Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong Date: Thu, 30 Sep 2004 21:07:00 -0000 [thread overview] Message-ID: <20040930210746.27693.qmail@sourceware.org> (raw) In-Reply-To: <20040908235331.374.munzirtaha@newhorizons.com.sa> ------- Additional Comments From pablo at mandrakesoft dot com 2004-09-30 21:07 ------- I think indeed some LC_COLLATE definitions are wrong; like they haven't been rewritten/updated to benefit of the new (glibc > 2.2) possibilities. When you look at ar_SA, the LC_COLLATE is defined with lines like: order_start forward; forward <U0020> <U0020> ... <U0030> <U0030> <U0031> <U0031> <U0032> <U0032> .... <U0041> <U0041>;<U0041> <U0061> <U0041>;<U0061> ... if you compare with iso14651_t1 (used (maybe completed) by most other locales) you see things like this instead: <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> ... <U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0 <U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1 <U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2 ... <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a ... <U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A ... While ar_SA gives for each element only or in some cases two information tokens; the more modern LC_COLLATE definitions have 4. You can also see that while in ar_SA the space (<U0020>) is treated the same as the digits, on the more modern LC_COLLATE definition it is not; in fact the space is defined as sorting neutral. The latin letters have information telling if they are uppercase or lowercase in the modern LC_COLLATE; that information is missing in the definition in ar_SA da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but redefines everything itself (instead of including iso14651_t1 and only redefining what differs); spaces and blanks have 1st order sorting weight, which seems very strange to me, but even if Danish language sort spaces in such a peculiar way it is still strange to sort differently the space (0020) and the non breaking space (00A0), semantically they are the same thing, the difference is only typographical. While the sorting of letters is correct (at least for the letters used by a given language, ar_SA for example happily ignores any latin letter outside of ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts "agrave" after the last arabic letter...), the handling of punctuation and other special symbols should be reviewed imho. Also, all locales should include iso14651_t1 so that there can be an acceptable sorting for alphabetic symbols outside the range of the alphabet of the given locale (in an UTF-8 world you will likely see such things; I get for example mail from people with names having cacute, ccaron, lstroke, eogonek, etc. in my language none of those exist, but I expect them to be sorted with "c", "c", "l", "e" respectively, and not after "z". -- http://sources.redhat.com/bugzilla/show_bug.cgi?id=374 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
next prev parent reply other threads:[~2004-09-30 21:07 UTC|newest] Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top 2004-09-08 23:53 [Bug libc/374] New: " munzirtaha at newhorizons dot com dot sa 2004-09-09 16:27 ` [Bug libc/374] " gotom at debian dot or dot jp 2004-09-12 18:38 ` munzirtaha at newhorizons dot com dot sa 2004-09-26 9:33 ` [Bug localedata/374] " drepper at redhat dot com 2004-09-30 20:26 ` munzirtaha at newhorizons dot com dot sa 2004-09-30 21:07 ` pablo at mandrakesoft dot com [this message] 2004-10-01 15:57 ` munzirtaha at newhorizons dot com dot sa 2005-01-17 21:42 ` barbier at linuxfr dot org 2005-01-17 21:43 ` barbier at linuxfr dot org 2005-01-17 22:29 ` barbier at linuxfr dot org 2005-10-14 23:02 ` drepper at redhat dot com 2006-04-10 16:56 ` mfabian at suse dot de
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20040930210746.27693.qmail@sourceware.org \ --to=sourceware-bugzilla@sources.redhat.com \ --cc=glibc-bugs@sources.redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).