From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22489 invoked by alias); 24 Sep 2010 12:35:18 -0000 Received: (qmail 20878 invoked by uid 48); 24 Sep 2010 12:35:02 -0000 Date: Fri, 24 Sep 2010 12:35:00 -0000 Message-ID: <20100924123502.20877.qmail@sourceware.org> From: "bonzini at gnu dot org" To: glibc-bugs-regex@sources.redhat.com In-Reply-To: <20100921152445.12045.eblake@redhat.com> References: <20100921152445.12045.eblake@redhat.com> Reply-To: sourceware-bugzilla@sourceware.org Subject: [Bug manual/12045] regex range semantics outside of POSIX should be documented X-Bugzilla-Reason: CC Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-regex-owner@sourceware.org X-SW-Source: 2010-09/txt/msg00009.txt.bz2 ------- Additional Comments From bonzini at gnu dot org 2010-09-24 12:35 ------- It turns out that regex range semantics for glibc are "CEO". They _are_ consistent, it's the locale definition files that are not consistent. I created a file with the 52 uppercase and lowercase letters and did a "sed -n /[A-Z]/p" on this file. The results I get are either this 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z or this 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ here are the "51" locales: ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK sl_SI th_TH tr_CY tr_TR These return 51 for both $l and $l.utf8. Every other locale returns 26 for both unibyte and multibyte variants. Locales using glibc's localedata/locales/iso14651_t1_common template return 26. This template defines the collation like this: ;;;IGNORE # 198 a start lowercase ;;;IGNORE # 199 ª ;;;IGNORE # 200 á ... ;;;IGNORE # 507 z ... ;;;IGNORE # 516 Þ end lowercase ;;;IGNORE # 517 A start uppercase ;;;IGNORE # 518 Á ... ;;;IGNORE # 813 Z ... ;;;IGNORE # 824 þ end uppercase (There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z] fails but [a-Z] works). Instead, the "special" locales above use different sequence, for example in cs_CZ: ;;; # A ;;; # a ;;; # ª ;;; # Á ;;; # á ... ;;; # Z ;;; # z So, it looks like __collseq_table_lookup is what the POSIX rationale document calls "CEO". I'll open a bug on the inconsistencies caused by using CEO. In the meanwhile, this bug remains open for the documentation part. -- What |Removed |Added ---------------------------------------------------------------------------- Component|regex |manual Summary|regex range semantics |regex range semantics |outside of POSIX should be |outside of POSIX should be |documented and consistent |documented http://sourceware.org/bugzilla/show_bug.cgi?id=12045 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.