public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string
@ 2012-01-31 12:48 leonardo at ngdn dot org
  2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: leonardo at ngdn dot org @ 2012-01-31 12:48 UTC (permalink / raw)
  To: glibc-bugs-regex

http://sourceware.org/bugzilla/show_bug.cgi?id=13637

             Bug #: 13637
           Summary: incorrect match in multi-byte (non-UTF8) string
           Product: glibc
           Version: 2.15
            Status: NEW
          Severity: normal
          Priority: P2
         Component: regex
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: leonardo@ngdn.org
    Classification: Unclassified


Created attachment 6186
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6186
reg.sh: a script to reproduce the problem

When a special string composed of single and multi-byte characters is passed to
re_search(), the function seems to lose track of which characters are
multi-byte and returns an incorrect match. This seems to be exclusive to the
ja_JP.eucjp locale.

The problem can be reproduced when the following string:

  aaa\xb7\xefa\xbf\xb7\xbd\xe8

... is matched against the pattern:

  \xb7\xbd

The two bytes in the pattern are respectively "the last byte of the second
multi-byte char" and "the first byte of the third multi-byte char" in the
original string.

The number of "a"s prefixed in the original string seems to make all the
difference here. I could only reproduce the problem when exactly 3 or 4 "a"s
are prefixed. I.e., if you remove one "a" from the prefix of the original
string:

  aa\xb7\xefa\xbf\xb7\xbd\xe8

... the problem no longer happens.

I'm attaching a script that reproduces the problem. The 'sed' version I'm using
is compiled with "--without-included-regex", so it should use glibc's regex
functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm
trying to create a self contained program to demonstrate the problem.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-28 16:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
2012-02-10 19:22 ` sbrabec at suse dot cz
2012-02-24 20:25 ` sbrabec at suse dot cz
2012-02-24 20:27 ` sbrabec at suse dot cz
2012-02-27 15:14 ` sbrabec at suse dot cz
2012-02-28 16:43 ` sbrabec at suse dot cz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).