public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug manual/12045] regex range semantics outside of POSIX should be documented
       [not found] <bug-12045-132@http.sourceware.org/bugzilla/>
@ 2012-12-19 10:48 ` schwab@linux-m68k.org
  2014-06-30  8:01 ` fweimer at redhat dot com
  1 sibling, 0 replies; 3+ messages in thread
From: schwab@linux-m68k.org @ 2012-12-19 10:48 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=12045

Andreas Schwab <schwab@linux-m68k.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|drepper.fsp at gmail dot    |unassigned at sourceware
                   |com                         |dot org

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug manual/12045] regex range semantics outside of POSIX should be documented
       [not found] <bug-12045-132@http.sourceware.org/bugzilla/>
  2012-12-19 10:48 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented schwab@linux-m68k.org
@ 2014-06-30  8:01 ` fweimer at redhat dot com
  1 sibling, 0 replies; 3+ messages in thread
From: fweimer at redhat dot com @ 2014-06-30  8:01 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=12045

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug manual/12045] regex range semantics outside of POSIX should be documented
  2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
@ 2010-09-24 12:35 ` bonzini at gnu dot org
  0 siblings, 0 replies; 3+ messages in thread
From: bonzini at gnu dot org @ 2010-09-24 12:35 UTC (permalink / raw)
  To: glibc-bugs-regex

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2679 bytes --]


------- Additional Comments From bonzini at gnu dot org  2010-09-24 12:35 -------
It turns out that regex range semantics for glibc are "CEO".  They _are_
consistent, it's the locale definition files that are not consistent.

I created a file with the 52 uppercase and lowercase letters and did a "sed -n
/[A-Z]/p" on this file.  The results I get are either

this      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or this   51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ

here are the "51" locales:

ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR

These return 51 for both $l and $l.utf8.  Every other locale returns 26 for both
unibyte and multibyte variants.

Locales using glibc's localedata/locales/iso14651_t1_common template return 26.
 This template defines the collation like this:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z]
fails but [a-Z] works).

Instead, the "special" locales above use different sequence, for example in cs_CZ:

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

So, it looks like __collseq_table_lookup is what the POSIX rationale document
calls "CEO".  I'll open a bug on the inconsistencies caused by using CEO.  In
the meanwhile, this bug remains open for the documentation part.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|regex                       |manual
            Summary|regex range semantics       |regex range semantics
                   |outside of POSIX should be  |outside of POSIX should be
                   |documented and consistent   |documented


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-06-30  8:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-12045-132@http.sourceware.org/bugzilla/>
2012-12-19 10:48 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented schwab@linux-m68k.org
2014-06-30  8:01 ` fweimer at redhat dot com
2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).