public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/429] New: regex hangs on backreferences
@ 2004-10-07 11:00 jakub at redhat dot com
  2004-10-07 14:17 ` [Bug regex/429] " gotom at debian dot or dot jp
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: jakub at redhat dot com @ 2004-10-07 11:00 UTC (permalink / raw)
  To: glibc-bugs-regex

Just so that it isn't forgotten.
bug-regex11.c still has a few tests commented out that hang glibc regexec. E.g.

#include <regex.h>
#include <stdio.h>

int main ()
{
  regex_t rbuf;
  const char *p;
  int err;
  p = "^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?"
      "\\9\\8\\7\\6\\5\\4\\3\\2\\1$";
  if ((err = regcomp  (&rbuf, p, REG_NOSUB | REG_EXTENDED))) {
    char errstr[300];
    regerror (err, &rbuf, errstr, sizeof (errstr));
    puts (errstr);
    return err;
  }
  return regexec (&rbuf, "civic", 0, NULL, 0);
}

takes really too long.

-- 
           Summary: regex hangs on backreferences
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: gotom at debian dot or dot jp
        ReportedBy: jakub at redhat dot com
                CC: glibc-bugs-regex at sources dot redhat dot com,glibc-
                    bugs at sources dot redhat dot com


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
@ 2004-10-07 14:17 ` gotom at debian dot or dot jp
  2004-11-04  8:43 ` bonzini at gnu dot org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: gotom at debian dot or dot jp @ 2004-10-07 14:17 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From gotom at debian dot or dot jp  2004-10-07 14:17 -------
memorandum for Jakub.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|gotom at debian dot or dot  |jakub at redhat dot com
                   |jp                          |
             Status|NEW                         |ASSIGNED


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
  2004-10-07 14:17 ` [Bug regex/429] " gotom at debian dot or dot jp
@ 2004-11-04  8:43 ` bonzini at gnu dot org
  2004-11-04  9:04 ` bonzini at gnu dot org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: bonzini at gnu dot org @ 2004-11-04  8:43 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From bonzini at gnu dot org  2004-11-04 08:43 -------
It does not hang, it has an incredibly high complexity, because a lot of
OP_BACKREF nodes have to be walked on the epsilon closure in
check_dst_limits_calc_pos, before reaching an OP_CLOSE_SUBEXP or OP_OPEN_SUBEXP.

Some simple-minded optimization can be done in check_dst_limits_calc_pos, see
patch at http://sources.redhat.com/ml/libc-alpha/2004-11/msg00018.html (applying
on top of other patches from me from late October and early November).


-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
  2004-10-07 14:17 ` [Bug regex/429] " gotom at debian dot or dot jp
  2004-11-04  8:43 ` bonzini at gnu dot org
@ 2004-11-04  9:04 ` bonzini at gnu dot org
  2004-11-09  7:55 ` bonzini at gnu dot org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: bonzini at gnu dot org @ 2004-11-04  9:04 UTC (permalink / raw)
  To: glibc-bugs-regex



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |508


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
                   ` (2 preceding siblings ...)
  2004-11-04  9:04 ` bonzini at gnu dot org
@ 2004-11-09  7:55 ` bonzini at gnu dot org
  2004-11-09 15:24 ` jakub at redhat dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: bonzini at gnu dot org @ 2004-11-09  7:55 UTC (permalink / raw)
  To: glibc-bugs-regex



-- 
Bug 429 depends on bug 508, which changed state.

Bug 508 Summary: check_dst_limits is inefficient
http://sources.redhat.com/bugzilla/show_bug.cgi?id=508

           What    |Old Value                   |New Value
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
                   ` (3 preceding siblings ...)
  2004-11-09  7:55 ` bonzini at gnu dot org
@ 2004-11-09 15:24 ` jakub at redhat dot com
  2004-11-09 15:51   ` Paolo Bonzini
  2004-11-09 15:51 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: jakub at redhat dot com @ 2004-11-09 15:24 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From jakub at redhat dot com  2004-11-09 15:24 -------
But even with your patch bug-regex11.c with s/#if 0/#if 1/ eats certainly
more than 10 minutes of CPU time (until I killed it).

Either there is a better algorithm for many backreferences, or we should
consider using NFA for patterns where DFA with backtracing is known to take
too long.

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
                   ` (4 preceding siblings ...)
  2004-11-09 15:24 ` jakub at redhat dot com
@ 2004-11-09 15:51 ` paolo dot bonzini at lu dot unisi dot ch
  2004-11-10 10:53 ` bonzini at gnu dot org
  2004-11-12 12:22 ` bonzini at gnu dot org
  7 siblings, 0 replies; 10+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2004-11-09 15:51 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From paolo dot bonzini at lu dot unisi dot ch  2004-11-09 15:51 -------
Subject: Re:  regex hangs on backreferences

> But even with your patch bug-regex11.c with s/#if 0/#if 1/ eats certainly
> more than 10 minutes of CPU time (until I killed it).

Yes.  I managed to run a 7-backreference version, which took a few minutes
before my patch.  True, it does not make the full testcase feasible yet, but
I'll work on it.

> Either there is a better algorithm for many backreferences, or we should
> consider using NFA for patterns where DFA with backtracing is known to
take
> too long.

I think you can do some kind of caching to cut the number of invocations of
calc_dst_limits_pos.  Its implementation is naive.  Complexity is
exponential in N because every epsilon closure visits N backreferences
without any remote hope of succeeding, because no OP_{OPEN,CLOSE}_SUBEXP is
on the epsilon closure.

I have to figure out exactly how the backref cache enters the game, because
adding something ad hoc in regcomp.c does not seem the right way to fix it.
And also, I want to understand which cases are common both in practice (to
avoid slowing down the common case) and in the worst case: I'd like
factor.sed and dc.sed to be sped up by 10% while fixing this bug.  I
certainly hope to bring it down to O(N) in the number of backreferences,
albeit with a pretty big constant in front of it.

Paolo



-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug regex/429] regex hangs on backreferences
  2004-11-09 15:24 ` jakub at redhat dot com
@ 2004-11-09 15:51   ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2004-11-09 15:51 UTC (permalink / raw)
  To: sourceware-bugzilla, glibc-bugs-regex

> But even with your patch bug-regex11.c with s/#if 0/#if 1/ eats certainly
> more than 10 minutes of CPU time (until I killed it).

Yes.  I managed to run a 7-backreference version, which took a few minutes
before my patch.  True, it does not make the full testcase feasible yet, but
I'll work on it.

> Either there is a better algorithm for many backreferences, or we should
> consider using NFA for patterns where DFA with backtracing is known to
take
> too long.

I think you can do some kind of caching to cut the number of invocations of
calc_dst_limits_pos.  Its implementation is naive.  Complexity is
exponential in N because every epsilon closure visits N backreferences
without any remote hope of succeeding, because no OP_{OPEN,CLOSE}_SUBEXP is
on the epsilon closure.

I have to figure out exactly how the backref cache enters the game, because
adding something ad hoc in regcomp.c does not seem the right way to fix it.
And also, I want to understand which cases are common both in practice (to
avoid slowing down the common case) and in the worst case: I'd like
factor.sed and dc.sed to be sped up by 10% while fixing this bug.  I
certainly hope to bring it down to O(N) in the number of backreferences,
albeit with a pretty big constant in front of it.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
                   ` (5 preceding siblings ...)
  2004-11-09 15:51 ` paolo dot bonzini at lu dot unisi dot ch
@ 2004-11-10 10:53 ` bonzini at gnu dot org
  2004-11-12 12:22 ` bonzini at gnu dot org
  7 siblings, 0 replies; 10+ messages in thread
From: bonzini at gnu dot org @ 2004-11-10 10:53 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From bonzini at gnu dot org  2004-11-10 10:53 -------
Created an attachment (id=272)
 --> (http://sources.redhat.com/bugzilla/attachment.cgi?id=272&action=view)
Fixes exponential behavior in check_dst_limits_calc_pos_1

As promised, some caching does most of the job.  The tests in bug-regex11.c now
complete in a sane amount of time (11 seconds each on a G4 PowerMac), but
unluckily this does not speed up other testcases.

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug regex/429] regex hangs on backreferences
  2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
                   ` (6 preceding siblings ...)
  2004-11-10 10:53 ` bonzini at gnu dot org
@ 2004-11-12 12:22 ` bonzini at gnu dot org
  7 siblings, 0 replies; 10+ messages in thread
From: bonzini at gnu dot org @ 2004-11-12 12:22 UTC (permalink / raw)
  To: glibc-bugs-regex


------- Additional Comments From bonzini at gnu dot org  2004-11-12 12:22 -------
Patch applied.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


http://sources.redhat.com/bugzilla/show_bug.cgi?id=429

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-11-12 12:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-07 11:00 [Bug regex/429] New: regex hangs on backreferences jakub at redhat dot com
2004-10-07 14:17 ` [Bug regex/429] " gotom at debian dot or dot jp
2004-11-04  8:43 ` bonzini at gnu dot org
2004-11-04  9:04 ` bonzini at gnu dot org
2004-11-09  7:55 ` bonzini at gnu dot org
2004-11-09 15:24 ` jakub at redhat dot com
2004-11-09 15:51   ` Paolo Bonzini
2004-11-09 15:51 ` paolo dot bonzini at lu dot unisi dot ch
2004-11-10 10:53 ` bonzini at gnu dot org
2004-11-12 12:22 ` bonzini at gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).