public inbox for glibc-bugs-regex@sourceware.org help / color / mirror / Atom feed
From: "bonzini at gnu dot org" <sourceware-bugzilla@sourceware.org> To: glibc-bugs-regex@sources.redhat.com Subject: [Bug manual/12045] regex range semantics outside of POSIX should be documented Date: Fri, 24 Sep 2010 12:35:00 -0000 [thread overview] Message-ID: <20100924123502.20877.qmail@sourceware.org> (raw) In-Reply-To: <20100921152445.12045.eblake@redhat.com> [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 2679 bytes --] ------- Additional Comments From bonzini at gnu dot org 2010-09-24 12:35 ------- It turns out that regex range semantics for glibc are "CEO". They _are_ consistent, it's the locale definition files that are not consistent. I created a file with the 52 uppercase and lowercase letters and did a "sed -n /[A-Z]/p" on this file. The results I get are either this 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z or this 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ here are the "51" locales: ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK sl_SI th_TH tr_CY tr_TR These return 51 for both $l and $l.utf8. Every other locale returns 26 for both unibyte and multibyte variants. Locales using glibc's localedata/locales/iso14651_t1_common template return 26. This template defines the collation like this: <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a start lowercase <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á ... <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z ... <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ end lowercase <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A start uppercase <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á ... <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z ... <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ end uppercase (There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z] fails but [a-Z] works). Instead, the "special" locales above use different sequence, for example in cs_CZ: <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041> # A <U0061> <U0041>;<NONE>;<SMALL>;<U0041> # a <U00AA> <U0041>;<NONE>;<U00AA>;<U0041> # ª <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041> # Á <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041> # á ... <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A> # Z <U007A> <U005A>;<NONE>;<SMALL>;<U005A> # z So, it looks like __collseq_table_lookup is what the POSIX rationale document calls "CEO". I'll open a bug on the inconsistencies caused by using CEO. In the meanwhile, this bug remains open for the documentation part. -- What |Removed |Added ---------------------------------------------------------------------------- Component|regex |manual Summary|regex range semantics |regex range semantics |outside of POSIX should be |outside of POSIX should be |documented and consistent |documented http://sourceware.org/bugzilla/show_bug.cgi?id=12045 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
next prev parent reply other threads:[~2010-09-24 12:35 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com 2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com 2010-09-21 15:49 ` bonzini at gnu dot org 2010-09-21 22:18 ` eblake at redhat dot com 2010-09-24 12:35 ` bonzini at gnu dot org [this message] [not found] <bug-12045-132@http.sourceware.org/bugzilla/> 2012-12-19 10:48 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented schwab@linux-m68k.org 2014-06-30 8:01 ` fweimer at redhat dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100924123502.20877.qmail@sourceware.org \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs-regex@sources.redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).