------- Additional Comments From bonzini at gnu dot org 2010-09-24 12:35 ------- It turns out that regex range semantics for glibc are "CEO". They _are_ consistent, it's the locale definition files that are not consistent. I created a file with the 52 uppercase and lowercase letters and did a "sed -n /[A-Z]/p" on this file. The results I get are either this 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z or this 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ here are the "51" locales: ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK sl_SI th_TH tr_CY tr_TR These return 51 for both $l and $l.utf8. Every other locale returns 26 for both unibyte and multibyte variants. Locales using glibc's localedata/locales/iso14651_t1_common template return 26. This template defines the collation like this: ;;;IGNORE # 198 a start lowercase ;;;IGNORE # 199 ª ;;;IGNORE # 200 á ... ;;;IGNORE # 507 z ... ;;;IGNORE # 516 Þ end lowercase ;;;IGNORE # 517 A start uppercase ;;;IGNORE # 518 Á ... ;;;IGNORE # 813 Z ... ;;;IGNORE # 824 þ end uppercase (There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z] fails but [a-Z] works). Instead, the "special" locales above use different sequence, for example in cs_CZ: ;;; # A ;;; # a ;;; # ª ;;; # Á ;;; # á ... ;;; # Z ;;; # z So, it looks like __collseq_table_lookup is what the POSIX rationale document calls "CEO". I'll open a bug on the inconsistencies caused by using CEO. In the meanwhile, this bug remains open for the documentation part. -- What |Removed |Added ---------------------------------------------------------------------------- Component|regex |manual Summary|regex range semantics |regex range semantics |outside of POSIX should be |outside of POSIX should be |documented and consistent |documented http://sourceware.org/bugzilla/show_bug.cgi?id=12045 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.