public inbox for glibc-bugs-regex@sourceware.org help / color / mirror / Atom feed
From: "eblake at redhat dot com" <sourceware-bugzilla@sourceware.org> To: glibc-bugs-regex@sources.redhat.com Subject: [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent Date: Tue, 21 Sep 2010 15:25:00 -0000 [thread overview] Message-ID: <20100921152445.12045.eblake@redhat.com> (raw) This stems from https://bugzilla.redhat.com/show_bug.cgi?id=583011. POSIX 2008 states: (http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html section 9.3.5 bullet 7) "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched." The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ "undefined". A compliant app cannot guarantee what the behavior will be, but the behavior should at least be explainable, and as a QoI point, glibc should document and define this behavior as an extension to POSIX, so that apps relying on glibc can take advantage of this extension for known behavior. However, I was unable to find any documentation of the current glibc rules for how a range expression is interpreted, and what's more, the current implementation is inconsistent with both the POSIX locale and with strcoll. Since POSIX states that the behavior is unspecified, we are entirely at liberty to choose a _sane_ set of rules, rather than a set of rules that is inconsistent with everything else collation-based. In fact, there's _nothing_ in POSIX that requires [A-Z] to match all collation elements that collate between A and Z when outside the POSIX locale, so it would be _just as equally valid_ for [A-Z] to have the same meaning in both POSIX and en_US.UTF-8. In fact, it would be _more_ useful to users, given the number of "bug" reports against bash, sed, grep, gawk, ..., which all boil down to complaints of people using range expressions outside the POSIX locale but expecting POSIX-locale semantics. However, even if you insist that glibc will continue to represent range expressions as a sequence of collation elements that fall between the beginning and end collation element, across all locales, then for QoI you should also fix things to use the same locale collation sequencing as strcoll. This set of sample programs shows the inconsistency in the current regex implementation, where strcoll and re_compile_pattern collate differently: p1: --- #include <stdio.h> #include <string.h> #include <locale.h> int main(int argc, char *argv[]) { setlocale(LC_ALL, ""); printf("%d\n", strcoll(argv[1], argv[2])); return 0; } p2: --- #define _GNU_SOURCE 1 #include <stdio.h> #include <string.h> #include <regex.h> #include <locale.h> int main(int argc, char *argv[]) { struct re_pattern_buffer buf = {0}; const char *err; setlocale(LC_ALL, ""); re_set_syntax(RE_NO_EMPTY_RANGES); if ((err = re_compile_pattern(argv[1], strlen(argv[1]), &buf))) printf("%s\n", err); return 0; } $ LC_ALL=en_US.UTF-8 ./p1 A b -1 $ LC_ALL=en_US.UTF-8 ./p2 '[A-b]' Invalid range end $ LC_ALL=cs_CZ.UTF-8 ./p1 A b -1 $ LC_ALL=cs_CZ.UTF-8 ./p2 '[A-b]' $ That is, since both en_US.UTF-8 and cs_CZ.UTF-8 collate 'A' before 'b' in strcoll(), they should both behave the same when handling the range expression [A-b] in a regex. And that's true whether you go with my desire of treating the range expression the same as the POSIX locale, or stick with the less-intuitive but equally consistent definition of all elements that collate between 'A' and 'b'. Since we have proof that glibc is doing neither behavior, I for one would love to either see glibc documentation explaining why the current behavior is deemed acceptable, or see glibc behavior changed. As a parting note, it was recently suggested on the grep list that maybe glibc should consider documenting the following behavior: [A-Z] - the same range as would be selected in the POSIX locale, for all locales [[.A.]-[.Z.]] - the range of collation elements that fall between A and Z for the given locale That way, users would be able to select between which of two sane interpretations they would like for non-POSIX locale range expressions, while at the same time aiding the large number of scripts that mistakenly used range expressions outside the POSIX locale while assuming POSIX locale semantics. -- Summary: regex range semantics outside of POSIX should be documented and consistent Product: glibc Version: 2.12 Status: NEW Severity: normal Priority: P2 Component: regex AssignedTo: drepper dot fsp at gmail dot com ReportedBy: eblake at redhat dot com CC: glibc-bugs-regex at sources dot redhat dot com,glibc- bugs at sources dot redhat dot com http://sourceware.org/bugzilla/show_bug.cgi?id=12045 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
next reply other threads:[~2010-09-21 15:25 UTC|newest] Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-09-21 15:25 eblake at redhat dot com [this message] 2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com 2010-09-21 15:49 ` bonzini at gnu dot org 2010-09-21 22:18 ` eblake at redhat dot com 2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100921152445.12045.eblake@redhat.com \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs-regex@sources.redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).