public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed
* [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string
@ 2012-01-31 12:48 leonardo at ngdn dot org
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: leonardo at ngdn dot org @ 2012-01-31 12:48 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
Bug #: 13637
Summary: incorrect match in multi-byte (non-UTF8) string
Product: glibc
Version: 2.15
Status: NEW
Severity: normal
Priority: P2
Component: regex
AssignedTo: drepper.fsp@gmail.com
ReportedBy: leonardo@ngdn.org
Classification: Unclassified
Created attachment 6186
--> http://sourceware.org/bugzilla/attachment.cgi?id=6186
reg.sh: a script to reproduce the problem
When a special string composed of single and multi-byte characters is passed to
re_search(), the function seems to lose track of which characters are
multi-byte and returns an incorrect match. This seems to be exclusive to the
ja_JP.eucjp locale.
The problem can be reproduced when the following string:
aaa\xb7\xefa\xbf\xb7\xbd\xe8
... is matched against the pattern:
\xb7\xbd
The two bytes in the pattern are respectively "the last byte of the second
multi-byte char" and "the first byte of the third multi-byte char" in the
original string.
The number of "a"s prefixed in the original string seems to make all the
difference here. I could only reproduce the problem when exactly 3 or 4 "a"s
are prefixed. I.e., if you remove one "a" from the prefix of the original
string:
aa\xb7\xefa\xbf\xb7\xbd\xe8
... the problem no longer happens.
I'm attaching a script that reproduces the problem. The 'sed' version I'm using
is compiled with "--without-included-regex", so it should use glibc's regex
functions. Unfortunately I can't affirm yet that the bug is not in sed, but I'm
trying to create a self contained program to demonstrate the problem.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
@ 2012-02-10 19:22 ` sbrabec at suse dot cz
2012-02-24 20:25 ` sbrabec at suse dot cz
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-10 19:22 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
--- Comment #1 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-10 19:20:28 UTC ---
Created attachment 6207
--> http://sourceware.org/bugzilla/attachment.cgi?id=6207
glibc-regex-incomplete-char.patch
Proposed fix. There is another bug in sed that triggers infinite loop.
Description:
re_search_internal() inside switch(match_kind) in case 6 finds a possible
match. In case of our false match, verification of match not respecting
multi-byte characters fails and match_regex() returns index of such false
match.
Going deeper, re_search_internal() calls re_string_reconstruct() and that calls
re_string_skip_chars().
re_string_skip_chars() is a I18N specific function that jumps by characters up
to the indexed character. It is a multi-byte character wise function.
In case of correct run, it returns correct index to the next character to
inspect. In case of bug occurrence, __mbrtowc called from there returns -2
(incomplete multi-byte character). Why? It seems to be caused by remain_len
being equal 1, even if there is still 6 bytes to inspect
("\267\357a\277\267\275").
I believe, that remain_len is computed incorrectly:
sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()
remain_len = pstr->len - rawbuf_idx;
pstr->len seems to be length of the remaining part of the string, rawbuf_idx is
the index of the remaining part of the string in the original (raw) string.
I am not quite familiar with the code, but I believe that the expression should
be:
remain_len = pstr->raw_len - rawbuf_idx;
Example:
stop in the first iteration of the re_string_skip_chars()
Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs
= 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>,
wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
__wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2,
valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2,
raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0,
trans = 0x0, word_char = 0x647d88, icase = 0 '\000',
is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000',
offsets_needed = 0 '\000', newline_anchor = 0 '\000',
word_ops_used = 0 '\000', mb_cur_max = 3}
Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>,
mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>,
wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
__wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3,
valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2,
raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0,
trans = 0x0, word_char = 0x647d88, icase = 0 '\000',
is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000',
offsets_needed = 0 '\000', newline_anchor = 0 '\000',
word_ops_used = 0 '\000', mb_cur_max = 3}
If my observation is correct, the bug is not EUC-JP specific.
Bug triggers:
- Charset must be capable to constitute false match on the boundary of two
characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific charset.
- This false match must appear in an exact place near two thirds of the string.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
@ 2012-02-10 19:22 ` sbrabec at suse dot cz
2012-02-10 19:22 ` sbrabec at suse dot cz
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-10 19:22 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
Stanislav Brabec <sbrabec at suse dot cz> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |sbrabec at suse dot cz
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
2012-02-10 19:22 ` sbrabec at suse dot cz
@ 2012-02-24 20:25 ` sbrabec at suse dot cz
2012-02-24 20:27 ` sbrabec at suse dot cz
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-24 20:25 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
--- Comment #2 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-24 20:24:38 UTC ---
Cross references:
libc-alpha mailing list:
http://sourceware.org/ml/libc-alpha/2012-02/msg00206.html
http://sourceware.org/ml/libc-alpha/2012-02/msg00267.html
sed+grep:
sed: http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00016.html
sed testcase:
http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00017.html
grep: http://lists.gnu.org/archive/html/bug-gnu-utils/2012-02/msg00018.html
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
` (2 preceding siblings ...)
2012-02-24 20:25 ` sbrabec at suse dot cz
@ 2012-02-24 20:27 ` sbrabec at suse dot cz
2012-02-27 15:14 ` sbrabec at suse dot cz
2012-02-28 16:43 ` sbrabec at suse dot cz
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-24 20:27 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
Stanislav Brabec <sbrabec at suse dot cz> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #6207|0 |1
is obsolete| |
--- Comment #3 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-24 20:26:54 UTC ---
Created attachment 6244
--> http://sourceware.org/bugzilla/attachment.cgi?id=6244
glibc-regex-incomplete-char.patch
New patch with fixed ChangeLog entry and testcase.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
` (3 preceding siblings ...)
2012-02-24 20:27 ` sbrabec at suse dot cz
@ 2012-02-27 15:14 ` sbrabec at suse dot cz
2012-02-28 16:43 ` sbrabec at suse dot cz
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-27 15:14 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
Stanislav Brabec <sbrabec at suse dot cz> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #6244|0 |1
is obsolete| |
--- Comment #4 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-27 15:13:21 UTC ---
Created attachment 6252
--> http://sourceware.org/bugzilla/attachment.cgi?id=6252
glibc-regex-incomplete-char.patch
Updated patch.
Changes since comment 3:
- Testcase uses test-skeleton.c.
- Uses SBC_MAX and includes "regex_internal.h".
- Setup fastmap before call to re_compile_pattern.
- bug-regex33.6 comment updated: There is one true and one false match.
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
` (4 preceding siblings ...)
2012-02-27 15:14 ` sbrabec at suse dot cz
@ 2012-02-28 16:43 ` sbrabec at suse dot cz
5 siblings, 0 replies; 7+ messages in thread
From: sbrabec at suse dot cz @ 2012-02-28 16:43 UTC (permalink / raw)
To: glibc-bugs-regex
http://sourceware.org/bugzilla/show_bug.cgi?id=13637
Stanislav Brabec <sbrabec at suse dot cz> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--- Comment #5 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-28 16:42:34 UTC ---
Fixed by commit 71b5d1c (patch from the comment 4).
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-02-28 16:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-31 12:48 [Bug regex/13637] New: incorrect match in multi-byte (non-UTF8) string leonardo at ngdn dot org
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
2012-02-10 19:22 ` sbrabec at suse dot cz
2012-02-24 20:25 ` sbrabec at suse dot cz
2012-02-24 20:27 ` sbrabec at suse dot cz
2012-02-27 15:14 ` sbrabec at suse dot cz
2012-02-28 16:43 ` sbrabec at suse dot cz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).