public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
From: "bonzini at gnu dot org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sources.redhat.com
Subject: [Bug manual/12045] regex range semantics outside of POSIX should be documented
Date: Fri, 24 Sep 2010 12:35:00 -0000	[thread overview]
Message-ID: <20100924123502.20877.qmail@sourceware.org> (raw)
In-Reply-To: <20100921152445.12045.eblake@redhat.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2679 bytes --]


------- Additional Comments From bonzini at gnu dot org  2010-09-24 12:35 -------
It turns out that regex range semantics for glibc are "CEO".  They _are_
consistent, it's the locale definition files that are not consistent.

I created a file with the 52 uppercase and lowercase letters and did a "sed -n
/[A-Z]/p" on this file.  The results I get are either

this      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or this   51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ

here are the "51" locales:

ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR

These return 51 for both $l and $l.utf8.  Every other locale returns 26 for both
unibyte and multibyte variants.

Locales using glibc's localedata/locales/iso14651_t1_common template return 26.
 This template defines the collation like this:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z]
fails but [a-Z] works).

Instead, the "special" locales above use different sequence, for example in cs_CZ:

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

So, it looks like __collseq_table_lookup is what the POSIX rationale document
calls "CEO".  I'll open a bug on the inconsistencies caused by using CEO.  In
the meanwhile, this bug remains open for the documentation part.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|regex                       |manual
            Summary|regex range semantics       |regex range semantics
                   |outside of POSIX should be  |outside of POSIX should be
                   |documented and consistent   |documented


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


  parent reply	other threads:[~2010-09-24 12:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
2010-09-21 15:49 ` bonzini at gnu dot org
2010-09-21 22:18 ` eblake at redhat dot com
2010-09-24 12:35 ` bonzini at gnu dot org [this message]
     [not found] <bug-12045-132@http.sourceware.org/bugzilla/>
2012-12-19 10:48 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented schwab@linux-m68k.org
2014-06-30  8:01 ` fweimer at redhat dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100924123502.20877.qmail@sourceware.org \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs-regex@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).