* [PATCH] One more can_be_null fix, optimize regexec/re_search
@ 2002-11-06 8:13 Jakub Jelinek
2002-11-06 12:01 ` Ulrich Drepper
0 siblings, 1 reply; 2+ messages in thread
From: Jakub Jelinek @ 2002-11-06 8:13 UTC (permalink / raw)
To: Ulrich Drepper, Roland McGrath, Isamu Hasegawa; +Cc: Glibc hackers
Hi!
Sorry for sending two things in one patch.
The first part is making sure the whole regex_t structure is initialized.
Although Uli yesterday fixed regcomp clearing can_be_null,
re_compile_pattern doesn't initialize it either (and it is listed as regex
implementation private variable) and later on re_search_internal uses it
uninitialized. Although regs_allocated seems to be only used
in the GNU variants, I think it is better to initialize it everywhere.
The second part is optimization - the re_search_internal loop to find
possible start of matching is very slow, so I've tried to speed it up.
Here is how it speeds up tst-regex:
Test "[äáâéÃîöóôüú]" with 8-bit locale
match 1: " Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.029430392 sec
+elapsed time: 0.010027121 sec
Test "[äáâéÃîöóôüú]" with multi-byte locale
match 1: " Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.119433648 sec
+elapsed time: 0.099460593 sec
Test "G.ran" with 8-bit locale
match 1: " Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.030021277 sec
+elapsed time: 0.009890545 sec
Test "G.ran" with multi-byte locale
match 1: " Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.163905369 sec
+elapsed time: 0.143427098 sec
Test "G.\{1\}ran" with 8-bit locale
match 1: " Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.029824000 sec
+elapsed time: 0.010527323 sec
Test "G.\{1\}ran" with multi-byte locale
match 1: " Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
match 2: " typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.164186502 sec
+elapsed time: 0.143455197 sec
Test "G.*ran" with 8-bit locale
match 1: " Generalized by allowing file name transformation. Do not"
match 2: " Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
match 3: " typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.030943234 sec
+elapsed time: 0.011773340 sec
Test "G.*ran" with multi-byte locale
match 1: " Generalized by allowing file name transformation. Do not"
match 2: " Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
match 3: " typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.171320491 sec
+elapsed time: 0.152224724 sec
Test "[äáâ]" with 8-bit locale
-elapsed time: 0.027261718 sec
+elapsed time: 0.008808948 sec
Test "[äáâ]" with multi-byte locale
-elapsed time: 0.116793876 sec
+elapsed time: 0.098257414 sec
I've measured also:
time sed.glibc.regex -n 's/Jakub/jakub/p' /usr/src/libc/ChangeLog.13 > /dev/null
which went from 0.051s down to 0.029s on average to see whether it matters
for real-world apps too.
Passes glibc make check and sed with glibc regex make check.
But I think we need more testcases, e.g. for re_search searching backwards,
can_be_null matching at the end of string, etc.
Also, type == COMPLEX_BRACKET in re_compile_fastmap_iter was IMHO redundant,
it was something like:
...
#ifdef RE_ENABLE_I18N
else if (type == COMPLEX_BRACKET)
{
...
}
#endif /* RE_ENABLE_I18N */
else if (type == END_OF_RE || type == OP_PERIOD
#ifdef RE_ENABLE_I18N
|| type == COMPLEX_BRACKET
#endif /* RE_ENABLE_I18N */
)
...
ie. type cannot be COMPLEX_BRACKET in the second if.
Also, I think re_string_reconstruct was not called
when doing a MBS ICASE/translate search backwards.
2002-11-06 Jakub Jelinek <jakub@redhat.com>
* posix/regcomp.c (re_compile_pattern): Don't set regs_allocated
here.
(regcomp): Don't set can_be_null here.
(re_comp): Clear whole re_comp_buf with the exception of fastmap.
(re_compile_internal): Clear can_be_null, set regs_allocated.
* posix/regcomp.c (re_set_fastmap): New function.
(re_compile_fastmap_iter): Use it. Remove redundant type ==
COMPLEX_BRACKET check.
* posix/regexec.c (re_search_internal): Optimize searching with
fastmap. Call re_string_reconstruct even if match_first is
smaller than raw_mbs_idx.
--- libc/posix/regcomp.c.jj 2002-11-05 23:09:17.000000000 +0100
+++ libc/posix/regcomp.c 2002-11-06 14:01:37.000000000 +0100
@@ -267,10 +267,6 @@ re_compile_pattern (pattern, length, buf
{
reg_errcode_t ret;
- /* GNU code is written to assume at least RE_NREGS registers will be set
- (and at least one extra will be -1). */
- bufp->regs_allocated = REGS_UNALLOCATED;
-
/* And GNU code determines whether or not to get register information
by passing null for the REGS argument to re_match, etc., not by
setting no_sub. */
@@ -339,6 +335,16 @@ re_compile_fastmap (bufp)
weak_alias (__re_compile_fastmap, re_compile_fastmap)
#endif
+static inline void
+re_set_fastmap (fastmap, icase, ch)
+ char *fastmap;
+ int icase, ch;
+{
+ fastmap[ch] = 1;
+ if (icase)
+ fastmap[tolower (ch)] = 1;
+}
+
/* Helper function for re_compile_fastmap.
Compile fastmap for the initial_state INIT_STATE. */
@@ -349,21 +355,21 @@ re_compile_fastmap_iter (bufp, init_stat
char *fastmap;
{
re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
- int node_cnt;
+ int node_cnt, icase = (MB_CUR_MAX == 1 && (bufp->syntax & RE_ICASE));
for (node_cnt = 0; node_cnt < init_state->nodes.nelem; ++node_cnt)
{
int node = init_state->nodes.elems[node_cnt];
re_token_type_t type = dfa->nodes[node].type;
if (type == CHARACTER)
- fastmap[dfa->nodes[node].opr.c] = 1;
+ re_set_fastmap (fastmap, icase, dfa->nodes[node].opr.c);
else if (type == SIMPLE_BRACKET)
{
int i, j, ch;
for (i = 0, ch = 0; i < BITSET_UINTS; ++i)
for (j = 0; j < UINT_BITS; ++j, ++ch)
if (dfa->nodes[node].opr.sbcset[i] & (1 << j))
- fastmap[ch] = 1;
+ re_set_fastmap (fastmap, icase, ch);
}
#ifdef RE_ENABLE_I18N
else if (type == COMPLEX_BRACKET)
@@ -388,28 +394,24 @@ re_compile_fastmap_iter (bufp, init_stat
for (i = 0, ch = 0; i < BITSET_UINTS; ++i)
for (j = 0; j < UINT_BITS; ++j, ++ch)
if (table[ch] < 0)
- fastmap[ch] = 1;
+ re_set_fastmap (fastmap, icase, ch);
}
# else
if (MB_CUR_MAX > 1)
for (i = 0; i < SBC_MAX; ++i)
if (__btowc (i) == WEOF)
- fastmap[i] = 1;
+ re_set_fastmap (fastmap, icase, i);
# endif /* not _LIBC */
}
for (i = 0; i < cset->nmbchars; ++i)
{
char buf[256];
wctomb (buf, cset->mbchars[i]);
- fastmap[*(unsigned char *) buf] = 1;
+ re_set_fastmap (fastmap, icase, *(unsigned char *) buf);
}
}
#endif /* RE_ENABLE_I18N */
- else if (type == END_OF_RE || type == OP_PERIOD
-#ifdef RE_ENABLE_I18N
- || type == COMPLEX_BRACKET
-#endif /* RE_ENABLE_I18N */
- )
+ else if (type == END_OF_RE || type == OP_PERIOD)
{
memset (fastmap, '\1', sizeof (char) * SBC_MAX);
if (type == END_OF_RE)
@@ -468,7 +470,6 @@ regcomp (preg, pattern, cflags)
preg->buffer = NULL;
preg->allocated = 0;
preg->used = 0;
- preg->can_be_null = 0;
/* Try to allocate space for the fastmap. */
preg->fastmap = re_malloc (char, SBC_MAX);
@@ -668,8 +669,7 @@ re_comp (s)
fastmap = re_comp_buf.fastmap;
re_comp_buf.fastmap = NULL;
__regfree (&re_comp_buf);
- re_comp_buf.buffer = NULL;
- re_comp_buf.allocated = 0;
+ memset (&re_comp_buf, 0, sizeof (re_comp_buf));
re_comp_buf.fastmap = fastmap;
}
@@ -726,6 +726,8 @@ re_compile_internal (preg, pattern, leng
preg->not_bol = preg->not_eol = 0;
preg->used = 0;
preg->re_nsub = 0;
+ preg->can_be_null = 0;
+ preg->regs_allocated = REGS_UNALLOCATED;
/* Initialize the dfa. */
dfa = (re_dfa_t *) preg->buffer;
--- libc/posix/regexec.c.jj 2002-11-05 23:10:27.000000000 +0100
+++ libc/posix/regexec.c 2002-11-06 15:52:07.000000000 +0100
@@ -575,9 +575,10 @@ re_search_internal (preg, string, length
re_string_t input;
int left_lim, right_lim, incr;
int fl_longest_match, match_first, match_last = -1;
+ int fast_translate, sb;
re_match_context_t mctx;
- char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate)
- ? preg->fastmap : NULL);
+ char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate
+ && range && !preg->can_be_null) ? preg->fastmap : NULL);
/* Check if the DFA haven't been compiled. */
if (BE (preg->used == 0 || dfa->init_state == NULL
@@ -626,65 +627,108 @@ re_search_internal (preg, string, length
incr = (range < 0) ? -1 : 1;
left_lim = (range < 0) ? start + range : start;
right_lim = (range < 0) ? start : start + range;
+ sb = MB_CUR_MAX == 1;
+ fast_translate = sb || !(preg->syntax & RE_ICASE || preg->translate);
for (;;)
{
/* At first get the current byte from input string. */
- int ch;
- if (MB_CUR_MAX > 1 && (preg->syntax & RE_ICASE || preg->translate))
- {
- /* In this case, we can't determin easily the current byte,
- since it might be a component byte of a multibyte character.
- Then we use the constructed buffer instead. */
- /* If MATCH_FIRST is out of the valid range, reconstruct the
- buffers. */
- if (input.raw_mbs_idx + input.valid_len <= match_first)
- re_string_reconstruct (&input, match_first, eflags,
- preg->newline_anchor);
- /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
- Note that MATCH_FIRST must not be smaller than 0. */
- ch = ((match_first >= length) ? 0
- : re_string_byte_at (&input, match_first - input.raw_mbs_idx));
- }
- else
- {
- /* We apply translate/conversion manually, since it is trivial
- in this case. */
- /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
- Note that MATCH_FIRST must not be smaller than 0. */
- ch = (match_first < length) ? (unsigned char)string[match_first] : 0;
- /* Apply translation if we need. */
- ch = preg->translate ? preg->translate[ch] : ch;
- /* In case of case insensitive mode, convert to upper case. */
- ch = ((preg->syntax & RE_ICASE) && islower (ch)) ? toupper (ch) : ch;
- }
-
- /* Eliminate inappropriate one by fastmap. */
- if (preg->can_be_null || fastmap == NULL || fastmap[ch])
- {
- /* Reconstruct the buffers so that the matcher can assume that
- the matching starts from the begining of the buffer. */
- re_string_reconstruct (&input, match_first, eflags,
- preg->newline_anchor);
+ if (fastmap)
+ {
+ if (BE (fast_translate, 1))
+ {
+ unsigned RE_TRANSLATE_TYPE t
+ = (unsigned RE_TRANSLATE_TYPE) preg->translate;
+ if (BE (range >= 0, 1))
+ {
+ if (BE (t != NULL, 0))
+ {
+ while (BE (match_first < right_lim, 1)
+ && !fastmap[t[(unsigned char) string[match_first]]])
+ ++match_first;
+ }
+ else
+ {
+ while (BE (match_first < right_lim, 1)
+ && !fastmap[(unsigned char) string[match_first]])
+ ++match_first;
+ }
+ if (BE (match_first == right_lim, 0))
+ {
+ int ch = match_first >= length
+ ? 0 : (unsigned char) string[match_first];
+ if (!fastmap[t ? t[ch] : ch])
+ break;
+ }
+ }
+ else
+ {
+ while (match_first >= left_lim)
+ {
+ int ch = match_first >= length
+ ? 0 : (unsigned char) string[match_first];
+ if (fastmap[t ? t[ch] : ch])
+ break;
+ --match_first;
+ }
+ if (match_first < left_lim)
+ break;
+ }
+ }
+ else
+ {
+ int ch;
+
+ do
+ {
+ /* In this case, we can't determine easily the current byte,
+ since it might be a component byte of a multibyte
+ character. Then we use the constructed buffer
+ instead. */
+ /* If MATCH_FIRST is out of the valid range, reconstruct the
+ buffers. */
+ if (input.raw_mbs_idx + input.valid_len <= match_first
+ || match_first < input.raw_mbs_idx)
+ re_string_reconstruct (&input, match_first, eflags,
+ preg->newline_anchor);
+ /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
+ Note that MATCH_FIRST must not be smaller than 0. */
+ ch = ((match_first >= length) ? 0
+ : re_string_byte_at (&input,
+ match_first - input.raw_mbs_idx));
+ if (fastmap[ch])
+ break;
+ match_first += incr;
+ }
+ while (match_first >= left_lim && match_first <= right_lim);
+ if (! fastmap[ch])
+ break;
+ }
+ }
+
+ /* Reconstruct the buffers so that the matcher can assume that
+ the matching starts from the begining of the buffer. */
+ re_string_reconstruct (&input, match_first, eflags,
+ preg->newline_anchor);
#ifdef RE_ENABLE_I18N
- /* Eliminate it when it is a component of a multibyte character
- and isn't the head of a multibyte character. */
- if (MB_CUR_MAX == 1 || re_string_first_byte (&input, 0))
+ /* Eliminate it when it is a component of a multibyte character
+ and isn't the head of a multibyte character. */
+ if (sb || re_string_first_byte (&input, 0))
#endif
- {
- /* It seems to be appropriate one, then use the matcher. */
- /* We assume that the matching starts from 0. */
- mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
- match_last = check_matching (preg, &mctx, 0, fl_longest_match);
- if (match_last != -1)
- {
- if (BE (match_last == -2, 0))
- return REG_ESPACE;
- else
- break; /* We found a matching. */
- }
- }
- }
+ {
+ /* It seems to be appropriate one, then use the matcher. */
+ /* We assume that the matching starts from 0. */
+ mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
+ match_last = check_matching (preg, &mctx, 0, fl_longest_match);
+ if (match_last != -1)
+ {
+ if (BE (match_last == -2, 0))
+ return REG_ESPACE;
+ else
+ break; /* We found a matching. */
+ }
+ }
+
/* Update counter. */
match_first += incr;
if (match_first < left_lim || right_lim < match_first)
Jakub
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] One more can_be_null fix, optimize regexec/re_search
2002-11-06 8:13 [PATCH] One more can_be_null fix, optimize regexec/re_search Jakub Jelinek
@ 2002-11-06 12:01 ` Ulrich Drepper
0 siblings, 0 replies; 2+ messages in thread
From: Ulrich Drepper @ 2002-11-06 12:01 UTC (permalink / raw)
To: Jakub Jelinek; +Cc: Roland McGrath, Isamu Hasegawa, Glibc hackers
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jakub Jelinek wrote:
> The first part is making sure the whole regex_t structure is initialized.
> [...]
I've put the patch in after tweeking it to take Isamu's memory handling
changes into account. Thanks,
- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQE9yXVD2ijCOnn/RHQRAmfyAJ9TvthffRiyqswWlLUIp7zy6TiIjwCfRkqs
9EngdTjErZ9/rSKzzsFa6SY=
=bzSi
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2002-11-06 20:01 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-06 8:13 [PATCH] One more can_be_null fix, optimize regexec/re_search Jakub Jelinek
2002-11-06 12:01 ` Ulrich Drepper
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).