public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/18986] ERE '0|()0|\1|0' causes regexec undefined behavior
       [not found] <bug-18986-132@http.sourceware.org/bugzilla/>
@ 2015-09-20 18:10 ` hanno at hboeck dot de
  2015-11-02 14:32 ` arekm at maven dot pl
  1 sibling, 0 replies; 2+ messages in thread
From: hanno at hboeck dot de @ 2015-09-20 18:10 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=18986

Hanno Boeck <hanno at hboeck dot de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hanno at hboeck dot de

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug regex/18986] ERE '0|()0|\1|0' causes regexec undefined behavior
       [not found] <bug-18986-132@http.sourceware.org/bugzilla/>
  2015-09-20 18:10 ` [Bug regex/18986] ERE '0|()0|\1|0' causes regexec undefined behavior hanno at hboeck dot de
@ 2015-11-02 14:32 ` arekm at maven dot pl
  1 sibling, 0 replies; 2+ messages in thread
From: arekm at maven dot pl @ 2015-11-02 14:32 UTC (permalink / raw)
  To: glibc-bugs-regex

https://sourceware.org/bugzilla/show_bug.cgi?id=18986

Arkadiusz Miskiewicz <arekm at maven dot pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |arekm at maven dot pl

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-700-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Wed Dec 09 14:40:29 2015
Return-Path: <glibc-bugs-regex-return-700-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 54966 invoked by alias); 9 Dec 2015 14:40:28 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 54630 invoked by uid 48); 9 Dec 2015 14:40:21 -0000
From: "alex_y_xu at yahoo dot ca" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19348] New: re_search is incredibly slow when processing '$' on long lines
Date: Wed, 09 Dec 2015 14:40:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: alex_y_xu at yahoo dot ca
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-19348-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00000.txt.bz2
Content-length: 1548

https://sourceware.org/bugzilla/show_bug.cgi?id=19348

            Bug ID: 19348
           Summary: re_search is incredibly slow when processing '$' on
                    long lines
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: alex_y_xu at yahoo dot ca
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

$ echo {1..5000000} > file # adjust based on CPU speed
    $ time sed -e 's/$/stuff/' file >/dev/null # logical way to append to lines
    sed -e 's/$/stuff/' file > /dev/null  2.91s user 0.09s system 99% cpu 3.007
total
    $ time sed -e 's/.*/&stuff/' file >/dev/null
    sed -e 's/.*/&stuff/' file > /dev/null  1.62s user 0.34s system 99% cpu
1.972 total

musl via busybox sed was tested to be 2x faster in the first case than in the
second.

intuitively, this does not make sense. .* should be slower because it needs to
match the entire string whereas $ can skip to the end of the line (since sed
must already find the new line in order to run the commands).

however, glibc spends an inordinate amount of time inside of
check_halt_state_context, re_state_reconstruct, and re_string_context_at,
according to callgrind.

I am unsure whether this qualifies as a glibc bug or how to fix it, but I think
it is useful to have on the record.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-701-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Wed Dec 09 16:43:38 2015
Return-Path: <glibc-bugs-regex-return-701-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 44982 invoked by alias); 9 Dec 2015 16:43:37 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 44731 invoked by uid 48); 9 Dec 2015 16:43:34 -0000
From: "alex_y_xu at yahoo dot ca" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19348] re_search matches $ much slower than .*
Date: Wed, 09 Dec 2015 16:43:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: alex_y_xu at yahoo dot ca
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: short_desc
Message-ID: <bug-19348-132-EDdlkwqKnS@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-19348-132@http.sourceware.org/bugzilla/>
References: <bug-19348-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00001.txt.bz2
Content-length: 493

https://sourceware.org/bugzilla/show_bug.cgi?id=19348

alex_y_xu at yahoo dot ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|re_search is incredibly     |re_search matches $ much
                   |slow when processing '$' on |slower than .*
                   |long lines                  |

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-702-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Fri Dec 18 09:26:19 2015
Return-Path: <glibc-bugs-regex-return-702-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 109039 invoked by alias); 18 Dec 2015 09:26:19 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 98108 invoked by uid 48); 18 Dec 2015 09:26:15 -0000
From: "t.rus76 at ya dot ru" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19376] New: regcomp.c needs to be upgraded to GNU Grep's one
Date: Fri, 18 Dec 2015 09:26:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: 2.22
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: t.rus76 at ya dot ru
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-19376-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00002.txt.bz2
Content-length: 1392

https://sourceware.org/bugzilla/show_bug.cgi?id=19376

            Bug ID: 19376
           Summary: regcomp.c needs to be upgraded to GNU Grep's one
           Product: glibc
           Version: 2.22
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: t.rus76 at ya dot ru
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly

$ echo 'ܫܠܡܐ' > peace
$ egrep '\<[ܐ-ܬ]' peace
grep: Invalid collation character
$ awk /'\<[ܐ-ܬ]'/ peace
ܫܠܡܐ

However when grep is build with ./configure --with-included-regex
it works just fine and there is no REG_ECOLLATE error

$ echo ܫܠܡܐ | src/egrep [ܫ-ܬ]
ܫܠܡܐ
$ echo ܫܠܡܐ | src/egrep [ܒ-ܓ]
$

This is because GNU Grep contains improved version of regcomp.

The bus was found here:
http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian)

It is tested and confirmed also on Gentoo (both glibc and grep are 2.22).


I expect there are other bugs that could be fixed with this upgrade.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-11-02 14:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-18986-132@http.sourceware.org/bugzilla/>
2015-09-20 18:10 ` [Bug regex/18986] ERE '0|()0|\1|0' causes regexec undefined behavior hanno at hboeck dot de
2015-11-02 14:32 ` arekm at maven dot pl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).