From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14069 invoked by alias); 31 Jan 2012 12:48:12 -0000 Received: (qmail 14050 invoked by uid 22791); 31 Jan 2012 12:48:11 -0000 X-SWARE-Spam-Status: No, hits=-2.8 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 X-Spam-Check-By: sourceware.org Received: from localhost (HELO sourceware.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 31 Jan 2012 12:47:58 +0000 From: "leonardo at ngdn dot org" To: glibc-bugs-regex@sources.redhat.com Subject: [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string Date: Tue, 31 Jan 2012 12:48:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: regex X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: leonardo at ngdn dot org X-Bugzilla-Status: NEW X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: drepper.fsp at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact glibc-bugs-regex-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-regex-owner@sourceware.org X-SW-Source: 2012-01/txt/msg00000.txt.bz2 http://sourceware.org/bugzilla/show_bug.cgi?id=13637 Bug #: 13637 Summary: incorrect match in multi-byte (non-UTF8) string Product: glibc Version: 2.15 Status: NEW Severity: normal Priority: P2 Component: regex AssignedTo: drepper.fsp@gmail.com ReportedBy: leonardo@ngdn.org Classification: Unclassified Created attachment 6186 --> http://sourceware.org/bugzilla/attachment.cgi?id=6186 reg.sh: a script to reproduce the problem When a special string composed of single and multi-byte characters is passed to re_search(), the function seems to lose track of which characters are multi-byte and returns an incorrect match. This seems to be exclusive to the ja_JP.eucjp locale. The problem can be reproduced when the following string: aaa\xb7\xefa\xbf\xb7\xbd\xe8 ... is matched against the pattern: \xb7\xbd The two bytes in the pattern are respectively "the last byte of the second multi-byte char" and "the first byte of the third multi-byte char" in the original string. The number of "a"s prefixed in the original string seems to make all the difference here. I could only reproduce the problem when exactly 3 or 4 "a"s are prefixed. I.e., if you remove one "a" from the prefix of the original string: aa\xb7\xefa\xbf\xb7\xbd\xe8 ... the problem no longer happens. I'm attaching a script that reproduces the problem. The 'sed' version I'm using is compiled with "--without-included-regex", so it should use glibc's regex functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm trying to create a self contained program to demonstrate the problem. -- Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.