public inbox for libc-hacker@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] One more can_be_null fix, optimize regexec/re_search
@ 2002-11-06  8:13 Jakub Jelinek
  2002-11-06 12:01 ` Ulrich Drepper
  0 siblings, 1 reply; 2+ messages in thread
From: Jakub Jelinek @ 2002-11-06  8:13 UTC (permalink / raw)
  To: Ulrich Drepper, Roland McGrath, Isamu Hasegawa; +Cc: Glibc hackers

Hi!

Sorry for sending two things in one patch.
The first part is making sure the whole regex_t structure is initialized.
Although Uli yesterday fixed regcomp clearing can_be_null,
re_compile_pattern doesn't initialize it either (and it is listed as regex
implementation private variable) and later on re_search_internal uses it
uninitialized. Although regs_allocated seems to be only used
in the GNU variants, I think it is better to initialize it everywhere.

The second part is optimization - the re_search_internal loop to find
possible start of matching is very slow, so I've tried to speed it up.

Here is how it speeds up tst-regex:

 Test "[äáâéíîöóôüú]" with 8-bit locale
 match 1: "     Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.029430392 sec
+elapsed time: 0.010027121 sec

 Test "[äáâéíîöóôüú]" with multi-byte locale
 match 1: "     Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.119433648 sec
+elapsed time: 0.099460593 sec

 Test "G.ran" with 8-bit locale
 match 1: "     Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.030021277 sec
+elapsed time: 0.009890545 sec

 Test "G.ran" with multi-byte locale
 match 1: "     Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.163905369 sec
+elapsed time: 0.143427098 sec

 Test "G.\{1\}ran" with 8-bit locale
 match 1: "     Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.029824000 sec
+elapsed time: 0.010527323 sec

 Test "G.\{1\}ran" with multi-byte locale
 match 1: "     Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
 match 2: "     typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.164186502 sec
+elapsed time: 0.143455197 sec

 Test "G.*ran" with 8-bit locale
 match 1: "     Generalized by allowing file name transformation.  Do not"
 match 2: "     Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
 match 3: "     typo. Reported by Göran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.030943234 sec
+elapsed time: 0.011773340 sec

 Test "G.*ran" with multi-byte locale
 match 1: "     Generalized by allowing file name transformation.  Do not"
 match 2: "     Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
 match 3: "     typo. Reported by GA¶ran Uddeborg <goeran@uddeborg.pp.se>."
-elapsed time: 0.171320491 sec
+elapsed time: 0.152224724 sec

 Test "[äáâ]" with 8-bit locale
-elapsed time: 0.027261718 sec
+elapsed time: 0.008808948 sec

 Test "[äáâ]" with multi-byte locale
-elapsed time: 0.116793876 sec
+elapsed time: 0.098257414 sec

I've measured also:

time sed.glibc.regex -n 's/Jakub/jakub/p' /usr/src/libc/ChangeLog.13 > /dev/null

which went from 0.051s down to 0.029s on average to see whether it matters
for real-world apps too.

Passes glibc make check and sed with glibc regex make check.
But I think we need more testcases, e.g. for re_search searching backwards,
can_be_null matching at the end of string, etc.

Also, type == COMPLEX_BRACKET in re_compile_fastmap_iter was IMHO redundant,
it was something like:
...
#ifdef RE_ENABLE_I18N
      else if (type == COMPLEX_BRACKET)
	{
...
	}
#endif /* RE_ENABLE_I18N */
      else if (type == END_OF_RE || type == OP_PERIOD
#ifdef RE_ENABLE_I18N
               || type == COMPLEX_BRACKET
#endif /* RE_ENABLE_I18N */
              )
...
ie. type cannot be COMPLEX_BRACKET in the second if.
Also, I think re_string_reconstruct was not called
when doing a MBS ICASE/translate search backwards.

2002-11-06  Jakub Jelinek  <jakub@redhat.com>

	* posix/regcomp.c (re_compile_pattern): Don't set regs_allocated
	here.
	(regcomp): Don't set can_be_null here.
	(re_comp): Clear whole re_comp_buf with the exception of fastmap.
	(re_compile_internal): Clear can_be_null, set regs_allocated.

	* posix/regcomp.c (re_set_fastmap): New function.
	(re_compile_fastmap_iter): Use it.  Remove redundant type ==
	COMPLEX_BRACKET check.
	* posix/regexec.c (re_search_internal): Optimize searching with
	fastmap.  Call re_string_reconstruct even if match_first is
	smaller than raw_mbs_idx.

--- libc/posix/regcomp.c.jj	2002-11-05 23:09:17.000000000 +0100
+++ libc/posix/regcomp.c	2002-11-06 14:01:37.000000000 +0100
@@ -267,10 +267,6 @@ re_compile_pattern (pattern, length, buf
 {
   reg_errcode_t ret;
 
-  /* GNU code is written to assume at least RE_NREGS registers will be set
-     (and at least one extra will be -1).  */
-  bufp->regs_allocated = REGS_UNALLOCATED;
-
   /* And GNU code determines whether or not to get register information
      by passing null for the REGS argument to re_match, etc., not by
      setting no_sub.  */
@@ -339,6 +335,16 @@ re_compile_fastmap (bufp)
 weak_alias (__re_compile_fastmap, re_compile_fastmap)
 #endif
 
+static inline void
+re_set_fastmap (fastmap, icase, ch)
+     char *fastmap;
+     int icase, ch;
+{
+  fastmap[ch] = 1;
+  if (icase)
+    fastmap[tolower (ch)] = 1;
+}
+
 /* Helper function for re_compile_fastmap.
    Compile fastmap for the initial_state INIT_STATE.  */
 
@@ -349,21 +355,21 @@ re_compile_fastmap_iter (bufp, init_stat
      char *fastmap;
 {
   re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
-  int node_cnt;
+  int node_cnt, icase = (MB_CUR_MAX == 1 && (bufp->syntax & RE_ICASE));
   for (node_cnt = 0; node_cnt < init_state->nodes.nelem; ++node_cnt)
     {
       int node = init_state->nodes.elems[node_cnt];
       re_token_type_t type = dfa->nodes[node].type;
 
       if (type == CHARACTER)
-        fastmap[dfa->nodes[node].opr.c] = 1;
+	re_set_fastmap (fastmap, icase, dfa->nodes[node].opr.c);
       else if (type == SIMPLE_BRACKET)
         {
           int i, j, ch;
           for (i = 0, ch = 0; i < BITSET_UINTS; ++i)
             for (j = 0; j < UINT_BITS; ++j, ++ch)
               if (dfa->nodes[node].opr.sbcset[i] & (1 << j))
-                fastmap[ch] = 1;
+                re_set_fastmap (fastmap, icase, ch);
         }
 #ifdef RE_ENABLE_I18N
       else if (type == COMPLEX_BRACKET)
@@ -388,28 +394,24 @@ re_compile_fastmap_iter (bufp, init_stat
                   for (i = 0, ch = 0; i < BITSET_UINTS; ++i)
                     for (j = 0; j < UINT_BITS; ++j, ++ch)
                       if (table[ch] < 0)
-                        fastmap[ch] = 1;
+                        re_set_fastmap (fastmap, icase, ch);
                 }
 # else
               if (MB_CUR_MAX > 1)
                 for (i = 0; i < SBC_MAX; ++i)
                   if (__btowc (i) == WEOF)
-                    fastmap[i] = 1;
+		    re_set_fastmap (fastmap, icase, i);
 # endif /* not _LIBC */
             }
           for (i = 0; i < cset->nmbchars; ++i)
             {
               char buf[256];
               wctomb (buf, cset->mbchars[i]);
-              fastmap[*(unsigned char *) buf] = 1;
+	      re_set_fastmap (fastmap, icase, *(unsigned char *) buf);
             }
         }
 #endif /* RE_ENABLE_I18N */
-      else if (type == END_OF_RE || type == OP_PERIOD
-#ifdef RE_ENABLE_I18N
-               || type == COMPLEX_BRACKET
-#endif /* RE_ENABLE_I18N */
-              )
+      else if (type == END_OF_RE || type == OP_PERIOD)
         {
           memset (fastmap, '\1', sizeof (char) * SBC_MAX);
           if (type == END_OF_RE)
@@ -468,7 +470,6 @@ regcomp (preg, pattern, cflags)
   preg->buffer = NULL;
   preg->allocated = 0;
   preg->used = 0;
-  preg->can_be_null = 0;
 
   /* Try to allocate space for the fastmap.  */
   preg->fastmap = re_malloc (char, SBC_MAX);
@@ -668,8 +669,7 @@ re_comp (s)
       fastmap = re_comp_buf.fastmap;
       re_comp_buf.fastmap = NULL;
       __regfree (&re_comp_buf);
-      re_comp_buf.buffer = NULL;
-      re_comp_buf.allocated = 0;
+      memset (&re_comp_buf, 0, sizeof (re_comp_buf));
       re_comp_buf.fastmap = fastmap;
     }
 
@@ -726,6 +726,8 @@ re_compile_internal (preg, pattern, leng
   preg->not_bol = preg->not_eol = 0;
   preg->used = 0;
   preg->re_nsub = 0;
+  preg->can_be_null = 0;
+  preg->regs_allocated = REGS_UNALLOCATED;
 
   /* Initialize the dfa.  */
   dfa = (re_dfa_t *) preg->buffer;
--- libc/posix/regexec.c.jj	2002-11-05 23:10:27.000000000 +0100
+++ libc/posix/regexec.c	2002-11-06 15:52:07.000000000 +0100
@@ -575,9 +575,10 @@ re_search_internal (preg, string, length
   re_string_t input;
   int left_lim, right_lim, incr;
   int fl_longest_match, match_first, match_last = -1;
+  int fast_translate, sb;
   re_match_context_t mctx;
-  char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate)
-                   ? preg->fastmap : NULL);
+  char *fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate
+		    && range && !preg->can_be_null) ? preg->fastmap : NULL);
 
   /* Check if the DFA haven't been compiled.  */
   if (BE (preg->used == 0 || dfa->init_state == NULL
@@ -626,65 +627,108 @@ re_search_internal (preg, string, length
   incr = (range < 0) ? -1 : 1;
   left_lim = (range < 0) ? start + range : start;
   right_lim = (range < 0) ? start : start + range;
+  sb = MB_CUR_MAX == 1;
+  fast_translate = sb || !(preg->syntax & RE_ICASE || preg->translate);
 
   for (;;)
     {
       /* At first get the current byte from input string.  */
-      int ch;
-      if (MB_CUR_MAX > 1 && (preg->syntax & RE_ICASE || preg->translate))
-        {
-          /* In this case, we can't determin easily the current byte,
-             since it might be a component byte of a multibyte character.
-             Then we use the constructed buffer instead.  */
-          /* If MATCH_FIRST is out of the valid range, reconstruct the
-             buffers.  */
-          if (input.raw_mbs_idx + input.valid_len <= match_first)
-            re_string_reconstruct (&input, match_first, eflags,
-                                   preg->newline_anchor);
-          /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
-             Note that MATCH_FIRST must not be smaller than 0.  */
-          ch = ((match_first >= length) ? 0
-                : re_string_byte_at (&input, match_first - input.raw_mbs_idx));
-        }
-      else
-        {
-          /* We apply translate/conversion manually, since it is trivial
-             in this case.  */
-          /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
-             Note that MATCH_FIRST must not be smaller than 0.  */
-          ch = (match_first < length) ? (unsigned char)string[match_first] : 0;
-          /* Apply translation if we need.  */
-          ch = preg->translate ? preg->translate[ch] : ch;
-          /* In case of case insensitive mode, convert to upper case.  */
-          ch = ((preg->syntax & RE_ICASE) && islower (ch)) ? toupper (ch) : ch;
-        }
-
-      /* Eliminate inappropriate one by fastmap.  */
-      if (preg->can_be_null || fastmap == NULL || fastmap[ch])
-        {
-          /* Reconstruct the buffers so that the matcher can assume that
-             the matching starts from the begining of the buffer.  */
-          re_string_reconstruct (&input, match_first, eflags,
-                                 preg->newline_anchor);
+      if (fastmap)
+	{
+	  if (BE (fast_translate, 1))
+	    {
+	      unsigned RE_TRANSLATE_TYPE t
+		= (unsigned RE_TRANSLATE_TYPE) preg->translate;
+	      if (BE (range >= 0, 1))
+		{
+		  if (BE (t != NULL, 0))
+		    {
+		      while (BE (match_first < right_lim, 1)
+			     && !fastmap[t[(unsigned char) string[match_first]]])
+			++match_first;
+		    }
+		  else
+		    {
+		      while (BE (match_first < right_lim, 1)
+			     && !fastmap[(unsigned char) string[match_first]])
+			++match_first;
+		    }
+		  if (BE (match_first == right_lim, 0))
+		    {
+		      int ch = match_first >= length
+			       ? 0 : (unsigned char) string[match_first];
+		      if (!fastmap[t ? t[ch] : ch])
+			break;
+		    }
+		}
+	      else
+		{
+		  while (match_first >= left_lim)
+		    {
+		      int ch = match_first >= length
+			       ? 0 : (unsigned char) string[match_first];
+		      if (fastmap[t ? t[ch] : ch])
+			break;
+		      --match_first;
+		    }
+		  if (match_first < left_lim)
+		    break;
+		}
+	    }
+	  else
+	    {
+	      int ch;
+
+	      do
+		{
+		  /* In this case, we can't determine easily the current byte,
+		     since it might be a component byte of a multibyte
+		     character.  Then we use the constructed buffer
+		     instead.  */
+		  /* If MATCH_FIRST is out of the valid range, reconstruct the
+		     buffers.  */
+		  if (input.raw_mbs_idx + input.valid_len <= match_first
+		      || match_first < input.raw_mbs_idx)
+		    re_string_reconstruct (&input, match_first, eflags,
+					   preg->newline_anchor);
+		  /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
+		     Note that MATCH_FIRST must not be smaller than 0.  */
+		  ch = ((match_first >= length) ? 0
+		       : re_string_byte_at (&input,
+					    match_first - input.raw_mbs_idx));
+		  if (fastmap[ch])
+		    break;
+		  match_first += incr;
+		}
+	      while (match_first >= left_lim && match_first <= right_lim);
+	      if (! fastmap[ch])
+		break;
+	    }
+	}
+
+      /* Reconstruct the buffers so that the matcher can assume that
+         the matching starts from the begining of the buffer.  */
+      re_string_reconstruct (&input, match_first, eflags,
+			     preg->newline_anchor);
 #ifdef RE_ENABLE_I18N
-          /* Eliminate it when it is a component of a multibyte character
-             and isn't the head of a multibyte character.  */
-          if (MB_CUR_MAX == 1 || re_string_first_byte (&input, 0))
+      /* Eliminate it when it is a component of a multibyte character
+         and isn't the head of a multibyte character.  */
+      if (sb || re_string_first_byte (&input, 0))
 #endif
-            {
-              /* It seems to be appropriate one, then use the matcher.  */
-              /* We assume that the matching starts from 0.  */
-              mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
-              match_last = check_matching (preg, &mctx, 0, fl_longest_match);
-              if (match_last != -1)
-                {
-                  if (BE (match_last == -2, 0))
-                    return REG_ESPACE;
-                  else
-                    break; /* We found a matching.  */
-                }
-            }
-        }
+	{
+	  /* It seems to be appropriate one, then use the matcher.  */
+	  /* We assume that the matching starts from 0.  */
+	  mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
+	  match_last = check_matching (preg, &mctx, 0, fl_longest_match);
+	  if (match_last != -1)
+	    {
+	      if (BE (match_last == -2, 0))
+		return REG_ESPACE;
+	      else
+		break; /* We found a matching.  */
+	    }
+	}
+
       /* Update counter.  */
       match_first += incr;
       if (match_first < left_lim || right_lim < match_first)

	Jakub

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] One more can_be_null fix, optimize regexec/re_search
  2002-11-06  8:13 [PATCH] One more can_be_null fix, optimize regexec/re_search Jakub Jelinek
@ 2002-11-06 12:01 ` Ulrich Drepper
  0 siblings, 0 replies; 2+ messages in thread
From: Ulrich Drepper @ 2002-11-06 12:01 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Roland McGrath, Isamu Hasegawa, Glibc hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jakub Jelinek wrote:

> The first part is making sure the whole regex_t structure is initialized.
> [...]

I've put the patch in after tweeking it to take Isamu's memory handling
changes into account.  Thanks,

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9yXVD2ijCOnn/RHQRAmfyAJ9TvthffRiyqswWlLUIp7zy6TiIjwCfRkqs
9EngdTjErZ9/rSKzzsFa6SY=
=bzSi
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-11-06 20:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-06  8:13 [PATCH] One more can_be_null fix, optimize regexec/re_search Jakub Jelinek
2002-11-06 12:01 ` Ulrich Drepper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).