[Bug regex/13637] incorrect match in multi-byte (non-UTF8) string

public inbox for glibc-bugs-regex@sourceware.org
help / color / mirror / Atom feed

From: "sbrabec at suse dot cz" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs-regex@sources.redhat.com
Subject: [Bug regex/13637] incorrect match in multi-byte (non-UTF8) string
Date: Fri, 10 Feb 2012 19:22:00 -0000	[thread overview]
Message-ID: <bug-13637-132-Dn1N2HLh55@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-13637-132@http.sourceware.org/bugzilla/>

http://sourceware.org/bugzilla/show_bug.cgi?id=13637

--- Comment #1 from Stanislav Brabec <sbrabec at suse dot cz> 2012-02-10 19:20:28 UTC ---
Created attachment 6207
  --> http://sourceware.org/bugzilla/attachment.cgi?id=6207
glibc-regex-incomplete-char.patch

Proposed fix. There is another bug in sed that triggers infinite loop.

Description:

re_search_internal() inside switch(match_kind) in case 6 finds a possible
match. In case of our false match, verification of match not respecting
multi-byte characters fails and match_regex() returns index of such false
match.

Going deeper, re_search_internal() calls re_string_reconstruct() and that calls
re_string_skip_chars().

re_string_skip_chars() is a I18N specific function that jumps by characters up
to the indexed character. It is a multi-byte character wise function.

In case of correct run, it returns correct index to the next character to
inspect. In case of bug occurrence, __mbrtowc called from there returns -2
(incomplete multi-byte character). Why? It seems to be caused by remain_len
being equal 1, even if there is still 6 bytes to inspect
("\267\357a\277\267\275").

I believe, that remain_len is computed incorrectly:

sed-4.2.1/lib/regex_internal.c:502 re_string_skip_chars()

      remain_len = pstr->len - rawbuf_idx;

pstr->len seems to be length of the remaining part of the string, rawbuf_idx is
the index of the remaining part of the string in the original (raw) string.

I am not quite familiar with the code, but I believe that the expression should
be:
remain_len = pstr->raw_len - rawbuf_idx;


Example:

stop in the first iteration of the re_string_skip_chars()

Correct case (two leading "a" characters):
rawbuf_idx = 5
*pstr = {
  raw_mbs = 0x6479b0 "aa\267\357a\277\267\275", <incomplete sequence \350>, mbs
= 0x6479b2 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 2, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 9, len = 7, raw_stop = 9, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}

Buggy case (three leading "a" characters):
rawbuf_idx = 6
*pstr = {
  raw_mbs = 0x6479b0 "aaa\267\357a\277\267\275", <incomplete sequence \350>,
mbs = 0x6479b3 "\267\357a\277\267\275", <incomplete sequence \350>, 
  wcs = 0x648190, offsets = 0x0, cur_state = {__count = 0, __value = {
      __wch = 0, __wchb = "\000\000\000"}}, raw_mbs_idx = 3, 
  valid_len = 0, valid_raw_len = 3, bufs_len = 4, cur_idx = 2, 
  raw_len = 10, len = 7, raw_stop = 10, stop = 7, tip_context = 0, 
  trans = 0x0, word_char = 0x647d88, icase = 0 '\000', 
  is_utf8 = 0 '\000', map_notascii = 0 '\000', mbs_allocated = 0 '\000', 
  offsets_needed = 0 '\000', newline_anchor = 0 '\000', 
  word_ops_used = 0 '\000', mb_cur_max = 3}


If my observation is correct, the bug is not EUC-JP specific.

Bug triggers:
- Charset must be capable to constitute false match on the boundary of two
characters. EUC-JP fits this requirement, UTF-8 probably does not.
- There is a true ASCII match that is false match in locale specific charset.
- This false match must appear in an exact place near two thirds of the string.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

next prev parent reply	other threads:[~2012-02-10 19:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-31 12:48 [Bug regex/13637] New: " leonardo at ngdn dot org
2012-02-10 19:22 ` sbrabec at suse dot cz [this message]
2012-02-10 19:22 ` [Bug regex/13637] " sbrabec at suse dot cz
2012-02-24 20:25 ` sbrabec at suse dot cz
2012-02-24 20:27 ` sbrabec at suse dot cz
2012-02-27 15:14 ` sbrabec at suse dot cz
2012-02-28 16:43 ` sbrabec at suse dot cz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-13637-132-Dn1N2HLh55@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs-regex@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).