public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug localedata/12051] New: CEO has confusing differences across locales
@ 2010-09-24 12:48 bonzini at gnu dot org
  2010-09-24 12:51 ` [Bug localedata/12051] " bonzini at gnu dot org
  2010-10-04  2:42 ` drepper dot fsp at gmail dot com
  0 siblings, 2 replies; 3+ messages in thread
From: bonzini at gnu dot org @ 2010-09-24 12:48 UTC (permalink / raw)
  To: glibc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2623 bytes --]

According to POSIX 2008, there was a requirement in older POSIX that range
expressions be treated as CEO (collating element order) for all locales.  POSIX
mentions some disadvantages of CEO, but one in particular is omitted---and glibc
has it: even when only considering ASCII characters and a single implementation,
the behavior with respect to case varies across locales: in some locales,
"[a-e]" may match either 'A' or 'E', while in others it will match none.

CEO in glibc is inconsistent for these locales:

  ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
  sl_SI th_TH tr_CY tr_TR

which are the only ones following this model (from cs_CZ):

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

rather than the one in localedata/locales/iso14651_t1_common:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

As an aside, the CEO requirement was specifically relaxed in POSIX 2001, so
glibc is insisting on CEO ordering because of a version of POSIX two editions
ago (without documenting it).  At the same time, other glibc interfaces no
longer comply with the stricter requirements in older POSIX that have since been
relaxed (for example, whether getopt() must include an error message with
"illegal" in the string).  So, there is no reason to tie regex to the older
standard's CEO ordering.

-- 
           Summary: CEO has confusing differences across locales
           Product: glibc
           Version: 2.12
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
        AssignedTo: libc-locales at sources dot redhat dot com
        ReportedBy: bonzini at gnu dot org
                CC: glibc-bugs at sources dot redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=12051

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug localedata/12051] CEO has confusing differences across locales
  2010-09-24 12:48 [Bug localedata/12051] New: CEO has confusing differences across locales bonzini at gnu dot org
@ 2010-09-24 12:51 ` bonzini at gnu dot org
  2010-10-04  2:42 ` drepper dot fsp at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: bonzini at gnu dot org @ 2010-09-24 12:51 UTC (permalink / raw)
  To: glibc-bugs



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eblake at redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=12051

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug localedata/12051] CEO has confusing differences across locales
  2010-09-24 12:48 [Bug localedata/12051] New: CEO has confusing differences across locales bonzini at gnu dot org
  2010-09-24 12:51 ` [Bug localedata/12051] " bonzini at gnu dot org
@ 2010-10-04  2:42 ` drepper dot fsp at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: drepper dot fsp at gmail dot com @ 2010-10-04  2:42 UTC (permalink / raw)
  To: glibc-bugs


------- Additional Comments From drepper dot fsp at gmail dot com  2010-10-04 02:42 -------
This stays as it is.  If individual locale maintainers think the current behavior 
is unintentionally as-is then they can change it.  But in general this is the 
long-implemented behavior and won't be changed.  Collating elements are just not 
really useful outside the POSIX locale or when the locale is guaranteed to stay 
the same.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX


http://sourceware.org/bugzilla/show_bug.cgi?id=12051

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-10-04  2:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-24 12:48 [Bug localedata/12051] New: CEO has confusing differences across locales bonzini at gnu dot org
2010-09-24 12:51 ` [Bug localedata/12051] " bonzini at gnu dot org
2010-10-04  2:42 ` drepper dot fsp at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).