From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4779 invoked by alias); 6 Nov 2002 16:45:12 -0000 Mailing-List: contact libc-hacker-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-hacker-owner@sources.redhat.com Received: (qmail 4761 invoked from network); 6 Nov 2002 16:45:10 -0000 Received: from unknown (HELO sunsite.mff.cuni.cz) (195.113.19.66) by sources.redhat.com with SMTP; 6 Nov 2002 16:45:10 -0000 Received: (from jakub@localhost) by sunsite.mff.cuni.cz (8.11.6/8.11.6) id gA6GixO21197; Wed, 6 Nov 2002 17:44:59 +0100 Date: Wed, 06 Nov 2002 08:45:00 -0000 From: Jakub Jelinek To: Isamu Hasegawa , Ulrich Drepper , Roland McGrath Cc: Glibc hackers Subject: re_string bugs Message-ID: <20021106174459.N3451@sunsite.ms.mff.cuni.cz> Reply-To: Jakub Jelinek Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i X-SW-Source: 2002-11/txt/msg00015.txt.bz2 Hi! There is at least one more use of unitialized data, which may even crash: tip_context handling. Can be seen e.g. on Daniel's testcase: #include #include int main() { regex_t reg; regmatch_t pm[1]; regcomp (®, "man", REG_ICASE); return regexec (®, "pipenightdreams", 1, pm, 0); } Here, re_search_internal calls re_string_allocate with len = 15 and init_len = 5. Then the loop in it (doesn't matter if without my today's patch or with it) skips everything until "ms" at the end, thus match_first is 13 and re_string_reconstruct is called on it. re_string_reconstruct calls: pstr->tip_context = re_string_context_at (pstr, offset - 1, eflags, newline); but mbs[12] is well beyond pstr->valid_len, it is well beyond pstr->buf_len even, so if unlucky could as well crash, certainly tip_context will be set incorrectly. This works only in regexec style searching (ie. start 0, range positive) and matching if MBS ICASE or MBS translate and input_len for pstr is bigger than MBS_CUR_MAX, or if mbs points into raw_mbs (ie. non-MBS no-ICASE no translate). Backward searching or increasing offset by more than buf_len is broken. For backwards searching, I'm afraid we need to check last MB_CUR_MAX chars before raw_mbs + raw_mbs_idx and see what the last multibyte char is. For UTF-8 this is trivial, just search backwards for first byte with top bit clear, but for other charsets it may be more difficult. Another thing I'm not sure is re_string_context_at implementation if MBS: Assuming all supported MBS locales have newline single byte '\n', there is IMHO problem with #define IS_WORD_CHAR(ch) (isalnum (ch) || (ch) == '_') c = re_string_byte_at (input, idx); if (IS_WORD_CHAR (c)) return CONTEXT_WORD; Shouldn't this use re_string_wchar_at and iswalnum for MBS locales? Jakub