From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20612 invoked by alias); 6 Nov 2002 16:13:38 -0000 Mailing-List: contact libc-hacker-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-hacker-owner@sources.redhat.com Received: (qmail 20595 invoked from network); 6 Nov 2002 16:13:36 -0000 Received: from unknown (HELO sunsite.mff.cuni.cz) (195.113.19.66) by sources.redhat.com with SMTP; 6 Nov 2002 16:13:36 -0000 Received: (from jakub@localhost) by sunsite.mff.cuni.cz (8.11.6/8.11.6) id gA6GD6e17289; Wed, 6 Nov 2002 17:13:06 +0100 Date: Wed, 06 Nov 2002 08:13:00 -0000 From: Jakub Jelinek To: Ulrich Drepper , Roland McGrath , Isamu Hasegawa Cc: Glibc hackers Subject: [PATCH] One more can_be_null fix, optimize regexec/re_search Message-ID: <20021106171306.M3451@sunsite.ms.mff.cuni.cz> Reply-To: Jakub Jelinek Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.2.5.1i X-SW-Source: 2002-11/txt/msg00014.txt.bz2 Hi! Sorry for sending two things in one patch. The first part is making sure the whole regex_t structure is initialized. Although Uli yesterday fixed regcomp clearing can_be_null, re_compile_pattern doesn't initialize it either (and it is listed as regex implementation private variable) and later on re_search_internal uses it uninitialized. Although regs_allocated seems to be only used in the GNU variants, I think it is better to initialize it everywhere. The second part is optimization - the re_search_internal loop to find possible start of matching is very slow, so I've tried to speed it up. Here is how it speeds up tst-regex: Test "[äáâéíîöóôüú]" with 8-bit locale match 1: " Reported by Göran Uddeborg ." match 2: " typo. Reported by Göran Uddeborg ." -elapsed time: 0.029430392 sec +elapsed time: 0.010027121 sec Test "[äáâéíîöóôüú]" with multi-byte locale match 1: " Reported by GA¶ran Uddeborg ." match 2: " typo. Reported by GA¶ran Uddeborg ." -elapsed time: 0.119433648 sec +elapsed time: 0.099460593 sec Test "G.ran" with 8-bit locale match 1: " Reported by Göran Uddeborg ." match 2: " typo. Reported by Göran Uddeborg ." -elapsed time: 0.030021277 sec +elapsed time: 0.009890545 sec Test "G.ran" with multi-byte locale match 1: " Reported by GA¶ran Uddeborg ." match 2: " typo. Reported by GA¶ran Uddeborg ." -elapsed time: 0.163905369 sec +elapsed time: 0.143427098 sec Test "G.\{1\}ran" with 8-bit locale match 1: " Reported by Göran Uddeborg ." match 2: " typo. Reported by Göran Uddeborg ." -elapsed time: 0.029824000 sec +elapsed time: 0.010527323 sec Test "G.\{1\}ran" with multi-byte locale match 1: " Reported by GA¶ran Uddeborg ." match 2: " typo. Reported by GA¶ran Uddeborg ." -elapsed time: 0.164186502 sec +elapsed time: 0.143455197 sec Test "G.*ran" with 8-bit locale match 1: " Generalized by allowing file name transformation. Do not" match 2: " Reported by Göran Uddeborg ." match 3: " typo. Reported by Göran Uddeborg ." -elapsed time: 0.030943234 sec +elapsed time: 0.011773340 sec Test "G.*ran" with multi-byte locale match 1: " Generalized by allowing file name transformation. Do not" match 2: " Reported by GA¶ran Uddeborg ." match 3: " typo. Reported by GA¶ran Uddeborg ." -elapsed time: 0.171320491 sec +elapsed time: 0.152224724 sec Test "[äáâ]" with 8-bit locale -elapsed time: 0.027261718 sec +elapsed time: 0.008808948 sec Test "[äáâ]" with multi-byte locale -elapsed time: 0.116793876 sec +elapsed time: 0.098257414 sec I've measured also: time sed.glibc.regex -n 's/Jakub/jakub/p' /usr/src/libc/ChangeLog.13 > /dev/null which went from 0.051s down to 0.029s on average to see whether it matters for real-world apps too. Passes glibc make check and sed with glibc regex make check. But I think we need more testcases, e.g. for re_search searching backwards, can_be_null matching at the end of string, etc. Also, type == COMPLEX_BRACKET in re_compile_fastmap_iter was IMHO redundant, it was something like: ... #ifdef RE_ENABLE_I18N else if (type == COMPLEX_BRACKET) { ... } #endif /* RE_ENABLE_I18N */ else if (type == END_OF_RE || type == OP_PERIOD #ifdef RE_ENABLE_I18N || type == COMPLEX_BRACKET #endif /* RE_ENABLE_I18N */ ) ... ie. type cannot be COMPLEX_BRACKET in the second if. Also, I think re_string_reconstruct was not called when doing a MBS ICASE/translate search backwards. 2002-11-06 Jakub Jelinek * posix/regcomp.c (re_compile_pattern): Don't set regs_allocated here. (regcomp): Don't set can_be_null here. (re_comp): Clear whole re_comp_buf with the exception of fastmap. (re_compile_internal): Clear can_be_null, set regs_allocated. * posix/regcomp.c (re_set_fastmap): New function. (re_compile_fastmap_iter): Use it. Remove redundant type == COMPLEX_BRACKET check. * posix/regexec.c (re_search_internal): Optimize searching with fastmap. Call re_string_reconstruct even if match_first is smaller than raw_mbs_idx. --- libc/posix/regcomp.c.jj 2002-11-05 23:09:17.000000000 +0100 +++ libc/posix/regcomp.c 2002-11-06 14:01:37.000000000 +0100 @@ -267,10 +267,6 @@ re_compile_pattern (pattern, length, buf { reg_errcode_t ret; - /* GNU code is written to assume at least RE_NREGS registers will be set - (and at least one extra will be -1). */ - bufp->regs_allocated = REGS_UNALLOCATED; - /* And GNU code determines whether or not to get register information by passing null for the REGS argument to re_match, etc., not by setting no_sub. */ @@ -339,6 +335,16 @@ re_compile_fastmap (bufp) weak_alias (__re_compile_fastmap, re_compile_fastmap) #endif +static inline void +re_set_fastmap (fastmap, icase, ch) + char *fastmap; + int icase, ch; +{ + fastmap[ch] = 1; + if (icase) + fastmap[tolower (ch)] = 1; +} + /* Helper function for re_compile_fastmap. Compile fastmap for the initial_state INIT_STATE. */ @@ -349,21 +355,21 @@ re_compile_fastmap_iter (bufp, init_stat char *fastmap; { re_dfa_t *dfa = (re_dfa_t *) bufp->buffer; - int node_cnt; + int node_cnt, icase = (MB_CUR_MAX == 1 && (bufp->syntax & RE_ICASE)); for (node_cnt = 0; node_cnt < init_state->nodes.nelem; ++node_cnt) { int node = init_state->nodes.elems[node_cnt]; re_token_type_t type = dfa->nodes[node].type; if (type == CHARACTER) - fastmap[dfa->nodes[node].opr.c] = 1; + re_set_fastmap (fastmap, icase, dfa->nodes[node].opr.c); else if (type == SIMPLE_BRACKET) { int i, j, ch; for (i = 0, ch = 0; i < BITSET_UINTS; ++i) for (j = 0; j < UINT_BITS; ++j, ++ch) if (dfa->nodes[node].opr.sbcset[i] & (1 << j)) - fastmap[ch] = 1; + re_set_fastmap (fastmap, icase, ch); } #ifdef RE_ENABLE_I18N else if (type == COMPLEX_BRACKET) @@ -388,28 +394,24 @@ re_compile_fastmap_iter (bufp, init_stat for (i = 0, ch = 0; i < BITSET_UINTS; ++i) for (j = 0; j < UINT_BITS; ++j, ++ch) if (table[ch] < 0) - fastmap[ch] = 1; + re_set_fastmap (fastmap, icase, ch); } # else if (MB_CUR_MAX > 1) for (i = 0; i < SBC_MAX; ++i) if (__btowc (i) == WEOF) - fastmap[i] = 1; + re_set_fastmap (fastmap, icase, i); # endif /* not _LIBC */ } for (i = 0; i < cset->nmbchars; ++i) { char buf[256]; wctomb (buf, cset->mbchars[i]); - fastmap[*(unsigned char *) buf] = 1; + re_set_fastmap (fastmap, icase, *(unsigned char *) buf); } } #endif /* RE_ENABLE_I18N */ - else if (type == END_OF_RE || type == OP_PERIOD -#ifdef RE_ENABLE_I18N - || type == COMPLEX_BRACKET -#endif /* RE_ENABLE_I18N */ - ) + else if (type == END_OF_RE || type == OP_PERIOD) { memset (fastmap, '\1', sizeof (char) * SBC_MAX); if (type == END_OF_RE) @@ -468,7 +470,6 @@ regcomp (preg, pattern, cflags) preg->buffer = NULL; preg->allocated = 0; preg->used = 0; - preg->can_be_null = 0; /* Try to allocate space for the fastmap. */ preg->fastmap = re_malloc (char, SBC_MAX); @@ -668,8 +669,7 @@ re_comp (s) fastmap = re_comp_buf.fastmap; re_comp_buf.fastmap = NULL; __regfree (&re_comp_buf); - re_comp_buf.buffer = NULL; - re_comp_buf.allocated = 0; + memset (&re_comp_buf, 0, sizeof (re_comp_buf)); re_comp_buf.fastmap = fastmap; } @@ -726,6 +726,8 @@ re_compile_internal (preg, pattern, leng preg->not_bol = preg->not_eol = 0; preg->used = 0; preg->re_nsub = 0; + preg->can_be_null = 0; + preg->regs_allocated = REGS_UNALLOCATED; /* Initialize the dfa. */ dfa = (re_dfa_t *) preg->buffer; --- libc/posix/regexec.c.jj 2002-11-05 23:10:27.000000000 +0100 +++ libc/posix/regexec.c 2002-11-06 15:52:07.000000000 +0100 @@ -575,9 +575,10 @@ re_search_internal (preg, string, length re_string_t input; int left_lim, right_lim, incr; int fl_longest_match, match_first, match_last = -1; + int fast_translate, sb; re_match_context_t mctx; - char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate) - ? preg->fastmap : NULL); + char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate + && range && !preg->can_be_null) ? preg->fastmap : NULL); /* Check if the DFA haven't been compiled. */ if (BE (preg->used == 0 || dfa->init_state == NULL @@ -626,65 +627,108 @@ re_search_internal (preg, string, length incr = (range < 0) ? -1 : 1; left_lim = (range < 0) ? start + range : start; right_lim = (range < 0) ? start : start + range; + sb = MB_CUR_MAX == 1; + fast_translate = sb || !(preg->syntax & RE_ICASE || preg->translate); for (;;) { /* At first get the current byte from input string. */ - int ch; - if (MB_CUR_MAX > 1 && (preg->syntax & RE_ICASE || preg->translate)) - { - /* In this case, we can't determin easily the current byte, - since it might be a component byte of a multibyte character. - Then we use the constructed buffer instead. */ - /* If MATCH_FIRST is out of the valid range, reconstruct the - buffers. */ - if (input.raw_mbs_idx + input.valid_len <= match_first) - re_string_reconstruct (&input, match_first, eflags, - preg->newline_anchor); - /* If MATCH_FIRST is out of the buffer, leave it as '\0'. - Note that MATCH_FIRST must not be smaller than 0. */ - ch = ((match_first >= length) ? 0 - : re_string_byte_at (&input, match_first - input.raw_mbs_idx)); - } - else - { - /* We apply translate/conversion manually, since it is trivial - in this case. */ - /* If MATCH_FIRST is out of the buffer, leave it as '\0'. - Note that MATCH_FIRST must not be smaller than 0. */ - ch = (match_first < length) ? (unsigned char)string[match_first] : 0; - /* Apply translation if we need. */ - ch = preg->translate ? preg->translate[ch] : ch; - /* In case of case insensitive mode, convert to upper case. */ - ch = ((preg->syntax & RE_ICASE) && islower (ch)) ? toupper (ch) : ch; - } - - /* Eliminate inappropriate one by fastmap. */ - if (preg->can_be_null || fastmap == NULL || fastmap[ch]) - { - /* Reconstruct the buffers so that the matcher can assume that - the matching starts from the begining of the buffer. */ - re_string_reconstruct (&input, match_first, eflags, - preg->newline_anchor); + if (fastmap) + { + if (BE (fast_translate, 1)) + { + unsigned RE_TRANSLATE_TYPE t + = (unsigned RE_TRANSLATE_TYPE) preg->translate; + if (BE (range >= 0, 1)) + { + if (BE (t != NULL, 0)) + { + while (BE (match_first < right_lim, 1) + && !fastmap[t[(unsigned char) string[match_first]]]) + ++match_first; + } + else + { + while (BE (match_first < right_lim, 1) + && !fastmap[(unsigned char) string[match_first]]) + ++match_first; + } + if (BE (match_first == right_lim, 0)) + { + int ch = match_first >= length + ? 0 : (unsigned char) string[match_first]; + if (!fastmap[t ? t[ch] : ch]) + break; + } + } + else + { + while (match_first >= left_lim) + { + int ch = match_first >= length + ? 0 : (unsigned char) string[match_first]; + if (fastmap[t ? t[ch] : ch]) + break; + --match_first; + } + if (match_first < left_lim) + break; + } + } + else + { + int ch; + + do + { + /* In this case, we can't determine easily the current byte, + since it might be a component byte of a multibyte + character. Then we use the constructed buffer + instead. */ + /* If MATCH_FIRST is out of the valid range, reconstruct the + buffers. */ + if (input.raw_mbs_idx + input.valid_len <= match_first + || match_first < input.raw_mbs_idx) + re_string_reconstruct (&input, match_first, eflags, + preg->newline_anchor); + /* If MATCH_FIRST is out of the buffer, leave it as '\0'. + Note that MATCH_FIRST must not be smaller than 0. */ + ch = ((match_first >= length) ? 0 + : re_string_byte_at (&input, + match_first - input.raw_mbs_idx)); + if (fastmap[ch]) + break; + match_first += incr; + } + while (match_first >= left_lim && match_first <= right_lim); + if (! fastmap[ch]) + break; + } + } + + /* Reconstruct the buffers so that the matcher can assume that + the matching starts from the begining of the buffer. */ + re_string_reconstruct (&input, match_first, eflags, + preg->newline_anchor); #ifdef RE_ENABLE_I18N - /* Eliminate it when it is a component of a multibyte character - and isn't the head of a multibyte character. */ - if (MB_CUR_MAX == 1 || re_string_first_byte (&input, 0)) + /* Eliminate it when it is a component of a multibyte character + and isn't the head of a multibyte character. */ + if (sb || re_string_first_byte (&input, 0)) #endif - { - /* It seems to be appropriate one, then use the matcher. */ - /* We assume that the matching starts from 0. */ - mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0; - match_last = check_matching (preg, &mctx, 0, fl_longest_match); - if (match_last != -1) - { - if (BE (match_last == -2, 0)) - return REG_ESPACE; - else - break; /* We found a matching. */ - } - } - } + { + /* It seems to be appropriate one, then use the matcher. */ + /* We assume that the matching starts from 0. */ + mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0; + match_last = check_matching (preg, &mctx, 0, fl_longest_match); + if (match_last != -1) + { + if (BE (match_last == -2, 0)) + return REG_ESPACE; + else + break; /* We found a matching. */ + } + } + /* Update counter. */ match_first += incr; if (match_first < left_lim || right_lim < match_first) Jakub