[Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent

public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed

* [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent
@ 2010-09-21 15:25 eblake at redhat dot com
  2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: eblake at redhat dot com @ 2010-09-21 15:25 UTC (permalink / raw)
  To: glibc-bugs-regex

This stems from https://bugzilla.redhat.com/show_bug.cgi?id=583011.

POSIX 2008 states:
(http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html section
9.3.5 bullet 7)

"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence, inclusive.
In other locales, a range expression has unspecified behavior: strictly
conforming applications shall not rely on whether the range expression is
valid, or on the set of collating elements matched."

The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ "undefined". 
A compliant app cannot guarantee what the behavior will be, but the behavior
should at least be explainable, and as a QoI point, glibc should document and
define this behavior as an extension to POSIX, so that apps relying on glibc
can take advantage of this extension for known behavior.  However, I was unable
to find any documentation of the current glibc rules for how a range expression
is interpreted, and what's more, the current implementation is inconsistent with
both the POSIX locale and with strcoll.

Since POSIX states that the behavior is unspecified, we are entirely at liberty
to choose a _sane_ set of rules, rather than a set of rules that is inconsistent
with everything else collation-based.  In fact, there's _nothing_ in POSIX that
requires [A-Z] to match all collation elements that collate between A and Z when
outside the POSIX locale, so it would be _just as equally valid_ for [A-Z] to
have the same meaning in both POSIX and en_US.UTF-8.  In fact, it would be
_more_ useful to users, given the number of "bug" reports against bash, sed,
grep, gawk, ..., which all boil down to complaints of people using range
expressions outside the POSIX locale but expecting POSIX-locale semantics.

However, even if you insist that glibc will continue to represent range
expressions as a sequence of collation elements that fall between the beginning
and end collation element, across all locales, then for QoI you should also fix
things to use the same locale collation sequencing as strcoll.

This set of sample programs shows the inconsistency in the current regex
implementation, where strcoll and re_compile_pattern collate differently:

p1:
---
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  setlocale(LC_ALL, "");
  printf("%d\n", strcoll(argv[1], argv[2]));
  return 0;
}

p2:
---
#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <regex.h>
#include <locale.h>

int main(int argc, char *argv[])
{
  struct re_pattern_buffer buf = {0};
  const char *err;

  setlocale(LC_ALL, "");
  re_set_syntax(RE_NO_EMPTY_RANGES);
  if ((err = re_compile_pattern(argv[1], strlen(argv[1]), &buf)))
    printf("%s\n", err);

  return 0;
}

$ LC_ALL=en_US.UTF-8 ./p1 A b
-1
$ LC_ALL=en_US.UTF-8 ./p2 '[A-b]'
Invalid range end
$ LC_ALL=cs_CZ.UTF-8 ./p1 A b
-1
$ LC_ALL=cs_CZ.UTF-8 ./p2 '[A-b]'
$

That is, since both en_US.UTF-8 and cs_CZ.UTF-8 collate 'A' before 'b' in
strcoll(), they should both behave the same when handling the range expression
[A-b] in a regex.  And that's true whether you go with my desire of treating the
range expression the same as the POSIX locale, or stick with the less-intuitive
but equally consistent definition of all elements that collate between 'A' and
'b'.  Since we have proof that glibc is doing neither behavior, I for one would
love to either see glibc documentation explaining why the current behavior is
deemed acceptable, or see glibc behavior changed.

As a parting note, it was recently suggested on the grep list that maybe glibc
should consider documenting the following behavior:
[A-Z] - the same range as would be selected in the POSIX locale, for all
locales
[[.A.]-[.Z.]] - the range of collation elements that fall between A and Z for
the given locale
That way, users would be able to select between which of two sane
interpretations they would like for non-POSIX locale range expressions, while at
the same time aiding the large number of scripts that mistakenly used range
expressions outside the POSIX locale while assuming POSIX locale semantics.

-- 
           Summary: regex range semantics outside of POSIX should be
                    documented and consistent
           Product: glibc
           Version: 2.12
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: drepper dot fsp at gmail dot com
        ReportedBy: eblake at redhat dot com
                CC: glibc-bugs-regex at sources dot redhat dot com,glibc-
                    bugs at sources dot redhat dot com

http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug regex/12045] regex range semantics outside of POSIX should be documented and consistent
  2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
@ 2010-09-21 15:26 ` eblake at redhat dot com
  2010-09-21 15:49 ` bonzini at gnu dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: eblake at redhat dot com @ 2010-09-21 15:26 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From eblake at redhat dot com  2010-09-21 15:26 -------
Possibly related to the resolution of
http://sources.redhat.com/bugzilla/show_bug.cgi?id=10290

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug regex/12045] regex range semantics outside of POSIX should be documented and consistent
  2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
  2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
@ 2010-09-21 15:49 ` bonzini at gnu dot org
  2010-09-21 22:18 ` eblake at redhat dot com
  2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org
  3 siblings, 0 replies; 5+ messages in thread
From: bonzini at gnu dot org @ 2010-09-21 15:49 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From bonzini at gnu dot org  2010-09-21 15:48 -------
Another similarly confusing example:

$ echo 'ach' | LANG=cs_CZ.UTF-8 sed -n '/a[d-p]/p'
ch
$ echo 'ach' | LANG=cs_CZ.UTF-8 sed -n '/a[^d-p]/p'
ch



-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug regex/12045] regex range semantics outside of POSIX should be documented and consistent
  2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
  2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
  2010-09-21 15:49 ` bonzini at gnu dot org
@ 2010-09-21 22:18 ` eblake at redhat dot com
  2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org
  3 siblings, 0 replies; 5+ messages in thread
From: eblake at redhat dot com @ 2010-09-21 22:18 UTC (permalink / raw)
  To: glibc-bugs-regex

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3895 bytes --]

------- Additional Comments From eblake at redhat dot com  2010-09-21 22:17 -------
Actually, according to POSIX 2008, there was a requirement in older POSIX that
range expressions be treated as CEO (collating element order) for all locales,
but this was specifically relaxed in POSIX 2001.  If glibc is going to insist on
CEO ordering because of a version of POSIX two editions ago, it would be nice to
see that documented.  Then again, other glibc interfaces no longer comply with
the stricter requirements in older POSIX that have since been relaxed (for
example, whether getopt() must include an error message with "illegal" in the
string). so I see no reason to tie regex to the older standard's CEO ordering
either.

XRAT A.9.3.5: 
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html

Historical implementations used native character order to interpret range
expressions. The ISO POSIX-2:1993 standard instead required collating element
order (CEO): the order that collating elements were specified between the
order_start and order_end keywords in the LC_COLLATE category of the current
locale. CEO had some advantages in portability over the native character order,
but it also had some disadvantages:

    * CEO could not feasibly be mimicked in user code, leading to
inconsistencies between POSIX matchers and matchers in popular user programs
like Emacs, ksh, and Perl.
    * CEO caused range expressions to match accented and capitalized letters
contrary to many users' expectations. For example, "[a-e]" typically matched
both 'E' and 'á' but neither 'A' nor 'é' .
    * CEO was not consistent across implementations. In practice, CEO was often
less portable than native character order. For example, it was common for the
CEOs of two implementation-supplied locales to disagree, even if both locales
were named "da_DK" .

Because of these problems, some implementations of regular expressions continued
to use native character order. Others used the collation sequence, which is more
consistent with sorting than either CEO or native order, but which departs
further from the traditional POSIX semantics because it generally requires
"[a-e]" to match either 'A' or 'E' but not both. As a result of this kind of
implementation variation, programmers who wanted to write portable regular
expressions could not rely on the ISO POSIX-2:1993 standard guarantees in practice.

While revising the standard, lengthy consideration was given to proposals to
attack this problem by adding an API for querying the CEO to allow user-mode
matchers, but none of these proposals had implementation experience and none
achieved consensus. Leaving the standard alone was also considered, but rejected
due to the problems described above.

The current standard leaves unspecified the behavior of a range expression
outside the POSIX locale. This makes it clearer that conforming applications
should avoid range expressions outside the POSIX locale, and it allows
implementations and compatible user-mode matchers to interpret range expressions
using native order, CEO, collation sequence, or other, more advanced techniques.
The concerns which led to this change were raised in IEEE PASC interpretation
1003.2 #43 and others, and related to ambiguities in the specification of how
multi-character collating elements should be handled in range expressions. These
ambiguities had led to multiple interpretations of the specification, in
conflicting ways, which led to varying implementations. As noted above, efforts
were made to resolve the differences, but no solution has been found that would
be specific enough to allow for portable software while not invalidating
existing implementations.

-- 

http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug manual/12045] regex range semantics outside of POSIX should be documented
  2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
                   ` (2 preceding siblings ...)
  2010-09-21 22:18 ` eblake at redhat dot com
@ 2010-09-24 12:35 ` bonzini at gnu dot org
  3 siblings, 0 replies; 5+ messages in thread
From: bonzini at gnu dot org @ 2010-09-24 12:35 UTC (permalink / raw)
  To: glibc-bugs-regex

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2679 bytes --]


------- Additional Comments From bonzini at gnu dot org  2010-09-24 12:35 -------
It turns out that regex range semantics for glibc are "CEO".  They _are_
consistent, it's the locale definition files that are not consistent.

I created a file with the 52 uppercase and lowercase letters and did a "sed -n
/[A-Z]/p" on this file.  The results I get are either

this      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or this   51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ

here are the "51" locales:

ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR

These return 51 for both $l and $l.utf8.  Every other locale returns 26 for both
unibyte and multibyte variants.

Locales using glibc's localedata/locales/iso14651_t1_common template return 26.
 This template defines the collation like this:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z]
fails but [a-Z] works).

Instead, the "special" locales above use different sequence, for example in cs_CZ:

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

So, it looks like __collseq_table_lookup is what the POSIX rationale document
calls "CEO".  I'll open a bug on the inconsistencies caused by using CEO.  In
the meanwhile, this bug remains open for the documentation part.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|regex                       |manual
            Summary|regex range semantics       |regex range semantics
                   |outside of POSIX should be  |outside of POSIX should be
                   |documented and consistent   |documented


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-09-24 12:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-21 15:25 [Bug regex/12045] New: regex range semantics outside of POSIX should be documented and consistent eblake at redhat dot com
2010-09-21 15:26 ` [Bug regex/12045] " eblake at redhat dot com
2010-09-21 15:49 ` bonzini at gnu dot org
2010-09-21 22:18 ` eblake at redhat dot com
2010-09-24 12:35 ` [Bug manual/12045] regex range semantics outside of POSIX should be documented bonzini at gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).