From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glibc-bugs-regex-return-522-listarch-glibc-bugs-regex=sources.redhat.com@sourceware.org>
Received: (qmail 14069 invoked by alias); 31 Jan 2012 12:48:12 -0000
Received: (qmail 14050 invoked by uid 22791); 31 Jan 2012 12:48:11 -0000
X-SWARE-Spam-Status: No, hits=-2.8 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO sourceware.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 31 Jan 2012 12:47:58 +0000
From: "leonardo at ngdn dot org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sources.redhat.com
Subject: [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string
Date: Tue, 31 Jan 2012 12:48:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: regex
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: leonardo at ngdn dot org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: drepper.fsp at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-13637-132@http.sourceware.org/bugzilla/>
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs-regex.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-regex-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs-regex@sourceware.org>
List-Help: <mailto:glibc-bugs-regex-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-regex-owner@sourceware.org
X-SW-Source: 2012-01/txt/msg00000.txt.bz2

http://sourceware.org/bugzilla/show_bug.cgi?id=13637

             Bug #: 13637
           Summary: incorrect match in multi-byte (non-UTF8) string
           Product: glibc
           Version: 2.15
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: leonardo@ngdn.org
    Classification: Unclassified


Created attachment 6186
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6186
reg.sh: a script to reproduce the problem

When a special string composed of single and multi-byte characters is passed to
re_search(), the function seems to lose track of which characters are
multi-byte and returns an incorrect match. This seems to be exclusive to the
ja_JP.eucjp locale.

The problem can be reproduced when the following string:

  aaa\xb7\xefa\xbf\xb7\xbd\xe8

... is matched against the pattern:

  \xb7\xbd

The two bytes in the pattern are respectively "the last byte of the second
multi-byte char" and "the first byte of the third multi-byte char" in the
original string.

The number of "a"s prefixed in the original string seems to make all the
difference here. I could only reproduce the problem when exactly 3 or 4 "a"s
are prefixed. I.e., if you remove one "a" from the prefix of the original
string:

  aa\xb7\xefa\xbf\xb7\xbd\xe8

... the problem no longer happens.

I'm attaching a script that reproduces the problem. The 'sed' version I'm using
is compiled with "--without-included-regex", so it should use glibc's regex
functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm
trying to create a self contained program to demonstrate the problem.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.