public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
@ 2012-03-08  4:59 ` carlos at systemhalted dot org
  2013-07-03 21:25 ` eblake at redhat dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: carlos at systemhalted dot org @ 2012-03-08  4:59 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

Carlos O'Donell <carlos at systemhalted dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlos at systemhalted dot
                   |                            |org
              Flags|examined+                   |

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
  2012-03-08  4:59 ` [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines) carlos at systemhalted dot org
@ 2013-07-03 21:25 ` eblake at redhat dot com
  2013-07-04  0:28 ` bugdal at aerifal dot cx
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: eblake at redhat dot com @ 2013-07-03 21:25 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

Eric Blake <eblake at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eblake at redhat dot com

--- Comment #8 from Eric Blake <eblake at redhat dot com> ---
https://bugzilla.redhat.com/show_bug.cgi?id=876766 is another report of this
bug.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
  2012-03-08  4:59 ` [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines) carlos at systemhalted dot org
  2013-07-03 21:25 ` eblake at redhat dot com
@ 2013-07-04  0:28 ` bugdal at aerifal dot cx
  2013-07-04  3:37 ` carlos at redhat dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: bugdal at aerifal dot cx @ 2013-07-04  0:28 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

Rich Felker <bugdal at aerifal dot cx> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugdal at aerifal dot cx

--- Comment #9 from Rich Felker <bugdal at aerifal dot cx> ---
I think the newly linked Red Hat bug report is more convincing that this needs
to be fixed. What is the justification for the "SUSPENDED" status?

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2013-07-04  0:28 ` bugdal at aerifal dot cx
@ 2013-07-04  3:37 ` carlos at redhat dot com
  2013-07-04  3:38 ` carlos at redhat dot com
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: carlos at redhat dot com @ 2013-07-04  3:37 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|SUSPENDED                   |NEW

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2013-07-04  3:37 ` carlos at redhat dot com
@ 2013-07-04  3:38 ` carlos at redhat dot com
  2013-07-04  7:22 ` bonzini at gnu dot org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: carlos at redhat dot com @ 2013-07-04  3:38 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #10 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Rich Felker from comment #9)
> I think the newly linked Red Hat bug report is more convincing that this
> needs to be fixed. What is the justification for the "SUSPENDED" status?

There is no justification, moved to NEW.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2013-07-04  3:38 ` carlos at redhat dot com
@ 2013-07-04  7:22 ` bonzini at gnu dot org
  2013-07-04  7:35 ` bugdal at aerifal dot cx
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: bonzini at gnu dot org @ 2013-07-04  7:22 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #11 from Paolo Bonzini <bonzini at gnu dot org> ---
The justification for the "suspended" state is that this would be very
complicated to fix and wouldn't really add anything to the quality of the
implementation.

Even if the "(a(b)*)*" case would not be hard to fix, I'm not sure we can say
the same of the backreference testcase in the RH bug ('(a(b)*)*\2' matched
against 'abab') or the more complicated '(a(b)*)*x\1\2' matched against
'abaxa'.

The RH bugzilla was opened by Eric doesn't really add anything to the urgency
of this bug; Fedora bugs that also exist upstream can be closed liberally, and
that's what I did.

The grep bug on Savannah (https://savannah.gnu.org/bugs/?37737) might add
something, but it is not clear if the user actually encountered it in a
real-world usecase.  The same "bug" is present in hardly every regular
expression matcher, and I would suggest that the Austin group gives more leeway
to implementations.  For example, the following rules could work:

- it is undefined _which_ occurrence of the sub-RE is captured by a
parenthesized group (and matched in backreferences) if the subgroup, or any of
its parents, is quantified with + * {};

- the backreference should only match the empty string only if the
corresponding sub-RE can be empty, or if the corresponding parenthesized group,
or any of its parents, is quantified with * or {0,...}.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2013-07-04  7:22 ` bonzini at gnu dot org
@ 2013-07-04  7:35 ` bugdal at aerifal dot cx
  2013-07-04  7:42 ` bonzini at gnu dot org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: bugdal at aerifal dot cx @ 2013-07-04  7:35 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #12 from Rich Felker <bugdal at aerifal dot cx> ---
> The same "bug" is present in hardly every regular expression matcher,

Citation needed. GNU regex is the only implementation I've seen with this bug
that purports to be an implementation of POSIX RE.

The implementation we have in musl, which is based on TRE, passes the tests in
this bug report just fine.

As for the RH bug report, the reason I saw it as adding urgency is that it
shows how this bug can affect whether or not the regex matches at all, as
opposed to "just" a pedantic matter of extraneous data in the resulting
subexpression match array.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2013-07-04  7:35 ` bugdal at aerifal dot cx
@ 2013-07-04  7:42 ` bonzini at gnu dot org
  2013-07-04  7:51 ` bugdal at aerifal dot cx
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: bonzini at gnu dot org @ 2013-07-04  7:42 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #13 from Paolo Bonzini <bonzini at gnu dot org> ---
At the time I reported the bug, I'm pretty sure I checked a few proprietary
Unices.  Also, even though it is not POSIX RE, Perl regular expressions have
the same behavior.  While it is true that backreferences can make this bug
affect the overall outcome of the match, it is still a very weird regular
expression.

Thanks for mentioning musl's matcher, I'll check it out.  Matching
backreferences at a decent speed _and_ obeying the POSIX leftmost/longest rules
is very hard to do.  The glibc matcher unfortunately is very badly documented
and the backreference part of it is basically black magic. :(

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2013-07-04  7:42 ` bonzini at gnu dot org
@ 2013-07-04  7:51 ` bugdal at aerifal dot cx
  2013-07-04  8:04 ` bonzini at gnu dot org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 12+ messages in thread
From: bugdal at aerifal dot cx @ 2013-07-04  7:51 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #14 from Rich Felker <bugdal at aerifal dot cx> ---
On Thu, Jul 04, 2013 at 07:42:51AM +0000, bonzini at gnu dot org wrote:
> Unices.  Also, even though it is not POSIX RE, Perl regular expressions have
> the same behavior.

Perl RE's are so different from POSIX that I would not consider them
in this discussion.

> Thanks for mentioning musl's matcher, I'll check it out.

The upstream source we got it from, TRE, is where the magic came from.
The author has some good documentation on how it works.

> Matching backreferences at a decent speed _and_ obeying the POSIX
> leftmost/longest rules is very hard to do.

Currently the speed is somewhat less than glibc, but not horrible by
any means. A large part of the cost is probably mbtowc. I'm planning
to eventually adapt it to compile the UTF-8 RE directly to a
byte-based state machine with actual character identity decoding only
when matching against character classes, but it's low priority right
now.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2013-07-04  7:51 ` bugdal at aerifal dot cx
@ 2013-07-04  8:04 ` bonzini at gnu dot org
  2013-07-04 15:05 ` eggert at gnu dot org
  2014-01-31  9:15 ` afaq.ahmed at agilosoft dot com
  11 siblings, 0 replies; 12+ messages in thread
From: bonzini at gnu dot org @ 2013-07-04  8:04 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

--- Comment #15 from Paolo Bonzini <bonzini at gnu dot org> ---
I'm curious about the speed on testcases with backreferences.  The
non-backreference part of the bug would be relatively easy to fix in glibc too.

The musl source says backreferences "can be spectacularly expensive", and
looking at the code I expect that to be true.

Unfortunately, GNU sed uses the GNU regex API, not the POSIX one, but it has
some "interesting" examples in its testsuite.  Perhaps you can try those tests
on another sed, compiled against both glibc and musl.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2013-07-04  8:04 ` bonzini at gnu dot org
@ 2013-07-04 15:05 ` eggert at gnu dot org
  2014-01-31  9:15 ` afaq.ahmed at agilosoft dot com
  11 siblings, 0 replies; 12+ messages in thread
From: eggert at gnu dot org @ 2013-07-04 15:05 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=52

Paul Eggert <eggert at gnu dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eggert at gnu dot org

--- Comment #16 from Paul Eggert <eggert at gnu dot org> ---
The same issue is present in Solaris 9 and Solaris 11.1 POSIX grep
(/usr/xpg4/bin/grep), the only non-GNU greps I have easy access to.

$ echo abab | /usr/xpg4/bin/grep '\(a\(b\)*\)*\2'
abab

Perhaps someone should file a bug report with POSIX folks?  It's hard to
believe that the current requirement is intended, if so few implementations
conform.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines)
       [not found] <bug-52-132@http.sourceware.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2013-07-04 15:05 ` eggert at gnu dot org
@ 2014-01-31  9:15 ` afaq.ahmed at agilosoft dot com
  11 siblings, 0 replies; 12+ messages in thread
From: afaq.ahmed at agilosoft dot com @ 2014-01-31  9:15 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=52

Afaq Agilosoft <afaq.ahmed at agilosoft dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |afaq.ahmed at agilosoft dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-01-31  9:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-52-132@http.sourceware.org/bugzilla/>
2012-03-08  4:59 ` [Bug regex/52] Repeated and nested subexpressions (reproducible in most other engines) carlos at systemhalted dot org
2013-07-03 21:25 ` eblake at redhat dot com
2013-07-04  0:28 ` bugdal at aerifal dot cx
2013-07-04  3:37 ` carlos at redhat dot com
2013-07-04  3:38 ` carlos at redhat dot com
2013-07-04  7:22 ` bonzini at gnu dot org
2013-07-04  7:35 ` bugdal at aerifal dot cx
2013-07-04  7:42 ` bonzini at gnu dot org
2013-07-04  7:51 ` bugdal at aerifal dot cx
2013-07-04  8:04 ` bonzini at gnu dot org
2013-07-04 15:05 ` eggert at gnu dot org
2014-01-31  9:15 ` afaq.ahmed at agilosoft dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).