public inbox for libc-hacker@sourceware.org
 help / color / mirror / Atom feed
From: Jakub Jelinek <jakub@redhat.com>
To: Isamu Hasegawa <isamu@yamato.ibm.com>,
	Ulrich Drepper <drepper@redhat.com>,
	Roland McGrath <roland@redhat.com>
Cc: Glibc hackers <libc-hacker@sources.redhat.com>
Subject: re_string bugs
Date: Wed, 06 Nov 2002 08:45:00 -0000	[thread overview]
Message-ID: <20021106174459.N3451@sunsite.ms.mff.cuni.cz> (raw)

Hi!

There is at least one more use of unitialized data, which may even crash:
tip_context handling.
Can be seen e.g. on Daniel's testcase:

#include <sys/types.h>
#include <regex.h>

int main()
{
  regex_t reg;
  regmatch_t pm[1];
  regcomp (&reg, "man", REG_ICASE);
  return regexec (&reg, "pipenightdreams", 1, pm, 0);
}

Here, re_search_internal calls re_string_allocate with len = 15 and
init_len = 5.
Then the loop in it (doesn't matter if without my today's patch or with it)
skips everything until "ms" at the end, thus match_first is 13 and
re_string_reconstruct is called on it.
re_string_reconstruct calls:
      pstr->tip_context = re_string_context_at (pstr, offset - 1, eflags,
                                                newline);
but mbs[12] is well beyond pstr->valid_len, it is well beyond pstr->buf_len
even, so if unlucky could as well crash, certainly tip_context will be set
incorrectly.
This works only in regexec style searching (ie. start 0, range positive)
and matching if MBS ICASE or MBS translate and input_len for pstr is
bigger than MBS_CUR_MAX, or if mbs points into raw_mbs (ie. non-MBS
no-ICASE no translate).
Backward searching or increasing offset by more than buf_len is broken.
For backwards searching, I'm afraid we need to check last MB_CUR_MAX
chars before raw_mbs + raw_mbs_idx and see what the last multibyte char is.
For UTF-8 this is trivial, just search backwards for first byte with top bit
clear, but for other charsets it may be more difficult.

Another thing I'm not sure is re_string_context_at implementation if MBS:
Assuming all supported MBS locales have newline single byte '\n',
there is IMHO problem with
#define IS_WORD_CHAR(ch) (isalnum (ch) || (ch) == '_')
  c = re_string_byte_at (input, idx);
  if (IS_WORD_CHAR (c))
    return CONTEXT_WORD;
Shouldn't this use re_string_wchar_at and iswalnum for MBS locales?

	Jakub

                 reply	other threads:[~2002-11-06 16:45 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20021106174459.N3451@sunsite.ms.mff.cuni.cz \
    --to=jakub@redhat.com \
    --cc=drepper@redhat.com \
    --cc=isamu@yamato.ibm.com \
    --cc=libc-hacker@sources.redhat.com \
    --cc=roland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).