public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
From: "eblake at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sources.redhat.com
Subject: [Bug regex/12045] regex range semantics outside of POSIX should be documented and consistent
Date: Tue, 21 Sep 2010 22:18:00 -0000	[thread overview]
Message-ID: <20100921221759.32339.qmail@sourceware.org> (raw)
In-Reply-To: <20100921152445.12045.eblake@redhat.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3895 bytes --]


------- Additional Comments From eblake at redhat dot com  2010-09-21 22:17 -------
Actually, according to POSIX 2008, there was a requirement in older POSIX that
range expressions be treated as CEO (collating element order) for all locales,
but this was specifically relaxed in POSIX 2001.  If glibc is going to insist on
CEO ordering because of a version of POSIX two editions ago, it would be nice to
see that documented.  Then again, other glibc interfaces no longer comply with
the stricter requirements in older POSIX that have since been relaxed (for
example, whether getopt() must include an error message with "illegal" in the
string). so I see no reason to tie regex to the older standard's CEO ordering
either.

XRAT A.9.3.5: 
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html

Historical implementations used native character order to interpret range
expressions. The ISO POSIX-2:1993 standard instead required collating element
order (CEO): the order that collating elements were specified between the
order_start and order_end keywords in the LC_COLLATE category of the current
locale. CEO had some advantages in portability over the native character order,
but it also had some disadvantages:

    * CEO could not feasibly be mimicked in user code, leading to
inconsistencies between POSIX matchers and matchers in popular user programs
like Emacs, ksh, and Perl.
    * CEO caused range expressions to match accented and capitalized letters
contrary to many users' expectations. For example, "[a-e]" typically matched
both 'E' and 'á' but neither 'A' nor 'é' .
    * CEO was not consistent across implementations. In practice, CEO was often
less portable than native character order. For example, it was common for the
CEOs of two implementation-supplied locales to disagree, even if both locales
were named "da_DK" .

Because of these problems, some implementations of regular expressions continued
to use native character order. Others used the collation sequence, which is more
consistent with sorting than either CEO or native order, but which departs
further from the traditional POSIX semantics because it generally requires
"[a-e]" to match either 'A' or 'E' but not both. As a result of this kind of
implementation variation, programmers who wanted to write portable regular
expressions could not rely on the ISO POSIX-2:1993 standard guarantees in practice.

While revising the standard, lengthy consideration was given to proposals to
attack this problem by adding an API for querying the CEO to allow user-mode
matchers, but none of these proposals had implementation experience and none
achieved consensus. Leaving the standard alone was also considered, but rejected
due to the problems described above.

The current standard leaves unspecified the behavior of a range expression
outside the POSIX locale. This makes it clearer that conforming applications
should avoid range expressions outside the POSIX locale, and it allows
implementations and compatible user-mode matchers to interpret range expressions
using native order, CEO, collation sequence, or other, more advanced techniques.
The concerns which led to this change were raised in IEEE PASC interpretation
1003.2 #43 and others, and related to ambiguities in the specification of how
multi-character collating elements should be handled in range expressions. These
ambiguities had led to multiple interpretations of the specification, in
conflicting ways, which led to varying implementations. As noted above, efforts
were made to resolve the differences, but no solution has been found that would
be specific enough to allow for portable software while not invalidating
existing implementations.


-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


  parent reply	other threads:[~2010-09-21 22:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-21 15:25 [Bug regex/12045] New: " eblake at redhat dot com
2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
2010-09-21 15:49 ` bonzini at gnu dot org
2010-09-21 22:18 ` eblake at redhat dot com [this message]
2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100921221759.32339.qmail@sourceware.org \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs-regex@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).