public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
From: "arekm at maven dot pl" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/18986] ERE '0|()0|\1|0' causes regexec undefined behavior
Date: Mon, 02 Nov 2015 14:32:00 -0000	[thread overview]
Message-ID: <bug-18986-132-JOl7EGWgGB@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-18986-132@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=18986

Arkadiusz Miskiewicz <arekm at maven dot pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |arekm at maven dot pl

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-700-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Wed Dec 09 14:40:29 2015
Return-Path: <glibc-bugs-regex-return-700-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 54966 invoked by alias); 9 Dec 2015 14:40:28 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 54630 invoked by uid 48); 9 Dec 2015 14:40:21 -0000
From: "alex_y_xu at yahoo dot ca" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19348] New: re_search is incredibly slow when processing '$' on long lines
Date: Wed, 09 Dec 2015 14:40:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: alex_y_xu at yahoo dot ca
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-19348-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00000.txt.bz2
Content-length: 1548

https://sourceware.org/bugzilla/show_bug.cgi?id=19348

            Bug ID: 19348
           Summary: re_search is incredibly slow when processing '$' on
                    long lines
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: alex_y_xu at yahoo dot ca
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

$ echo {1..5000000} > file # adjust based on CPU speed
    $ time sed -e 's/$/stuff/' file >/dev/null # logical way to append to lines
    sed -e 's/$/stuff/' file > /dev/null  2.91s user 0.09s system 99% cpu 3.007
total
    $ time sed -e 's/.*/&stuff/' file >/dev/null
    sed -e 's/.*/&stuff/' file > /dev/null  1.62s user 0.34s system 99% cpu
1.972 total

musl via busybox sed was tested to be 2x faster in the first case than in the
second.

intuitively, this does not make sense. .* should be slower because it needs to
match the entire string whereas $ can skip to the end of the line (since sed
must already find the new line in order to run the commands).

however, glibc spends an inordinate amount of time inside of
check_halt_state_context, re_state_reconstruct, and re_string_context_at,
according to callgrind.

I am unsure whether this qualifies as a glibc bug or how to fix it, but I think
it is useful to have on the record.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-701-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Wed Dec 09 16:43:38 2015
Return-Path: <glibc-bugs-regex-return-701-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 44982 invoked by alias); 9 Dec 2015 16:43:37 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 44731 invoked by uid 48); 9 Dec 2015 16:43:34 -0000
From: "alex_y_xu at yahoo dot ca" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19348] re_search matches $ much slower than .*
Date: Wed, 09 Dec 2015 16:43:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: alex_y_xu at yahoo dot ca
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: short_desc
Message-ID: <bug-19348-132-EDdlkwqKnS@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-19348-132@http.sourceware.org/bugzilla/>
References: <bug-19348-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00001.txt.bz2
Content-length: 493

https://sourceware.org/bugzilla/show_bug.cgi?id=19348

alex_y_xu at yahoo dot ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|re_search is incredibly     |re_search matches $ much
                   |slow when processing '$' on |slower than .*
                   |long lines                  |

-- 
You are receiving this mail because:
You are on the CC list for the bug.
>From glibc-bugs-regex-return-702-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org Fri Dec 18 09:26:19 2015
Return-Path: <glibc-bugs-regex-return-702-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Delivered-To: listarch-glibc-bugs-regex@sources.redhat.com
Received: (qmail 109039 invoked by alias); 18 Dec 2015 09:26:19 -0000
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
Delivered-To: mailing list glibc-bugs-regex@sourceware.org
Received: (qmail 98108 invoked by uid 48); 18 Dec 2015 09:26:15 -0000
From: "t.rus76 at ya dot ru" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sourceware.org
Subject: [Bug regex/19376] New: regcomp.c needs to be upgraded to GNU Grep's one
Date: Fri, 18 Dec 2015 09:26:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Version: 2.22
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: t.rus76 at ya dot ru
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-19376-132@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-12/txt/msg00002.txt.bz2
Content-length: 1392

https://sourceware.org/bugzilla/show_bug.cgi?id=19376

            Bug ID: 19376
           Summary: regcomp.c needs to be upgraded to GNU Grep's one
           Product: glibc
           Version: 2.22
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
          Assignee: unassigned at sourceware dot org
          Reporter: t.rus76 at ya dot ru
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Symptom: GNU Grep does not handle Syriac characters (U+0700 – U+074F) correctly

$ echo 'ܫܠܡܐ' > peace
$ egrep '\<[ܐ-ܬ]' peace
grep: Invalid collation character
$ awk /'\<[ܐ-ܬ]'/ peace
ܫܠܡܐ

However when grep is build with ./configure --with-included-regex
it works just fine and there is no REG_ECOLLATE error

$ echo ܫܠܡܐ | src/egrep [ܫ-ܬ]
ܫܠܡܐ
$ echo ܫܠܡܐ | src/egrep [ܒ-ܓ]
$

This is because GNU Grep contains improved version of regcomp.

The bus was found here:
http://forum.rosalab.ru/viewtopic.php?f=53&t=6219&p=54747 (in Russian)

It is tested and confirmed also on Gentoo (both glibc and grep are 2.22).


I expect there are other bugs that could be fixed with this upgrade.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


      parent reply	other threads:[~2015-11-02 14:32 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-18986-132@http.sourceware.org/bugzilla/>
2015-09-20 18:10 ` hanno at hboeck dot de
2015-11-02 14:32 ` arekm at maven dot pl [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-18986-132-JOl7EGWgGB@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs-regex@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).