public inbox for libc-hacker@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] Optimize also UTF-8 mb chars in regexs, a small optimization for regexec.c
@ 2003-11-17 13:47 Jakub Jelinek
  2003-11-18  7:25 ` Ulrich Drepper
  0 siblings, 1 reply; 2+ messages in thread
From: Jakub Jelinek @ 2003-11-17 13:47 UTC (permalink / raw)
  To: Ulrich Drepper, Roland McGrath; +Cc: Glibc hackers

Hi!

Two changes:
1) Both b\xc3\xa4r and b\xc3\xa4*r can be optimized quite easily
   (I knew about the former one, didn't know it didn't need any changes
    for the latter one as in that case the parse tree has already
    DUPop's ->left pointing to CONCAT tree of the bytes of the multi-byte
    character, so if just mb_partial flag is cleared, it ought to work
    right)
2) small optimization tweak in regexec.c - it is not necessary to compute
   nrules in the expected case of elem_len <= 1 && char_len <= 1
   or for OP_PERIOD

Todo:
1) Stepan Kasal suggested to optimize . in optimize_utf8 as well
   (probably adding new OP_UTF8_PERIOD node could do the job, will
   try to implement it and benchmark it.  Another alternative would
   be to replace OP_PERIOD with the actual description of UTF-8
   character (something like
[00-7f]|([c2-df]|e0[a0-bf]|([e1-ef]|f0[90-bf]|([f1-f7]|f8[88-bf]|([f9-fb]|fc[84-bf]|fd[80-bf])[80-bf])[80-bf])[80-bf])[80-bf]
   where 00-ff are actually chars in hex and () are only used for grouping
   and not for register matching.  Also, 00 would have to be removed if
   RE_DOT_NOT_NULL and '\n' would have to be removed if !RE_DOT_NEWLINE)
   and use unmodified regexec.c
2) perhaps we could get better code by caching at least nrules
   in dfa/re_string_t
3) oprofiling the code and see where are hot spots

2003-11-17  Jakub Jelinek  <jakub@redhat.com>

	* posix/regcomp.c (optimize_utf8): Optimize multi-byte chars as
	well.
	* posix/bug-regex20.c (tests): Add new tests.  Multi-byte char
	followed by dup operator is expected to be optimized.

	* posix/regexec.c (check_node_accept_bytes): Move nrules and j
	variables to the block where they are only used, initialize
	nrules only immediately before using it.

--- libc/posix/regcomp.c.jj	2003-11-13 22:43:39.000000000 +0100
+++ libc/posix/regcomp.c	2003-11-15 14:07:49.000000000 +0100
@@ -964,17 +964,14 @@ static void
 optimize_utf8 (dfa)
      re_dfa_t *dfa;
 {
-  int node, i;
+  int node, i, mb_chars = 0;
 
   for (node = 0; node < dfa->nodes_len; ++node)
     switch (dfa->nodes[node].type)
       {
       case CHARACTER:
-        /* Chars >= 0x80 are optimizable in some cases (e.g. when not
-	   followed by DUP operator, not in bracket etc.).
-	   For now punt on them all.  */
 	if (dfa->nodes[node].opr.c >= 0x80)
-	  return;
+	  mb_chars = 1;
 	break;
       case ANCHOR:
 	switch (dfa->nodes[node].opr.idx)
@@ -1009,6 +1006,12 @@ optimize_utf8 (dfa)
 	return;
       }
 
+  if (mb_chars)
+    for (node = 0; node < dfa->nodes_len; ++node)
+      if (dfa->nodes[node].type == CHARACTER
+	  && dfa->nodes[node].opr.c >= 0x80)
+	dfa->nodes[node].mb_partial = 0;
+
   /* The search can be in single byte locale.  */
   dfa->mb_cur_max = 1;
   dfa->is_utf8 = 0;
--- libc/posix/bug-regex20.c.jj	2003-11-13 22:43:39.000000000 +0100
+++ libc/posix/bug-regex20.c	2003-11-15 14:33:29.000000000 +0100
@@ -43,15 +43,35 @@ static struct
      \xe2\x80\x94	EM DASH  */
   /* Should be optimized.  */
   {RE_SYNTAX_POSIX_BASIC, "foo", "b\xc3\xa4rfoob\xc3\xa4z", 4, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4z", "b\xc3\xa4rfoob\xc3\xa4z", 7, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4*z", "b\xc3\xa4rfoob\xc3\xa4z", 7, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4*z", "b\xc3\xa4rfoobz", 7, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4\\+z",
+   "b\xc3\xa4rfoob\xc3\xa4\xc3\xa4z", 7, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4\\?z", "b\xc3\xa4rfoob\xc3\xa4z", 7, 1},
+  {RE_SYNTAX_POSIX_BASIC, "b\xc3\xa4\\{1,2\\}z",
+   "b\xc3\xa4rfoob\xc3\xa4z", 7, 1},
   {RE_SYNTAX_POSIX_BASIC, "^x\\|xy*z$", "\xc3\xb6xyyz", 2, 1},
   {RE_SYNTAX_POSIX_BASIC, "^x\\\\y\\{6\\}z\\+", "x\\yyyyyyzz\xc3\xb6", 0, 1},
   {RE_SYNTAX_POSIX_BASIC, "^x\\\\y\\{2,36\\}z\\+", "x\\yzz\xc3\xb6", -1, 1},
   {RE_SYNTAX_POSIX_BASIC, "^x\\\\y\\{,3\\}z\\+", "x\\yyyzz\xc3\xb6", 0, 1},
+  {RE_SYNTAX_POSIX_BASIC, "^x\\|x\xc3\xa4*z$",
+   "\xc3\xb6x\xc3\xa4\xc3\xa4z", 2, 1},
+  {RE_SYNTAX_POSIX_BASIC, "^x\\\\\xc3\x84\\{6\\}z\\+",
+   "x\\\xc3\x84\xc3\x84\xc3\x84\xc3\x84\xc3\x84\xc3\x84zz\xc3\xb6", 0, 1},
+  {RE_SYNTAX_POSIX_BASIC, "^x\\\\\xc3\x84\\{2,36\\}z\\+",
+   "x\\\xc3\x84zz\xc3\xb6", -1, 1},
+  {RE_SYNTAX_POSIX_BASIC, "^x\\\\\xc3\x84\\{,3\\}z\\+",
+   "x\\\xc3\x84\xc3\x84\xc3\x84zz\xc3\xb6", 0, 1},
   {RE_SYNTAX_POSIX_BASIC, "x[C]y", "axCy", 1, 1},
   {RE_SYNTAX_POSIX_BASIC, "x[ABC]y", "axCy", 1, 1},
   {RE_SYNTAX_POSIX_BASIC, "\\`x\\|z\\'", "x\xe2\x80\x94", 0, 1},
   {RE_SYNTAX_POSIX_BASIC, "\\(xy\\)z\\1a\\1", "\xe2\x80\x94xyzxyaxy\xc3\x84", 3, 1},
   {RE_SYNTAX_POSIX_BASIC, "xy\\?z", "\xc3\x84xz\xc3\xb6", 2, 1},
+  {RE_SYNTAX_POSIX_BASIC, "\\`\xc3\x84\\|z\\'", "\xc3\x84\xe2\x80\x94", 0, 1},
+  {RE_SYNTAX_POSIX_BASIC, "\\(x\xc3\x84\\)z\\1\x61\\1",
+   "\xe2\x80\x94x\xc3\x84zx\xc3\x84\x61x\xc3\x84\xc3\x96", 3, 1},
+  {RE_SYNTAX_POSIX_BASIC, "x\xc3\x96\\?z", "\xc3\x84xz\xc3\xb6", 2, 1},
   {RE_SYNTAX_POSIX_EXTENDED, "foo", "b\xc3\xa4rfoob\xc3\xa4z", 4, 1},
   {RE_SYNTAX_POSIX_EXTENDED, "^x|xy*z$", "\xc3\xb6xyyz", 2, 1},
   {RE_SYNTAX_POSIX_EXTENDED, "^x\\\\y{6}z+", "x\\yyyyyyzz\xc3\xb6", 0, 1},
@@ -64,7 +84,6 @@ static struct
   {RE_SYNTAX_POSIX_EXTENDED, "xy?z", "\xc3\x84xz\xc3\xb6", 2, 1},
   /* Should not be optimized.  */
   {RE_SYNTAX_POSIX_BASIC, "x.y", "ax\xe2\x80\x94yz", 1, 0},
-  {RE_SYNTAX_POSIX_BASIC, "x\xc3\x96*y", "ax\xc3\x96\xc3\x96yz", 1, 0},
   {RE_SYNTAX_POSIX_BASIC, "x[\xc3\x84\xc3\xa4]y", "ax\xc3\xa4y", 1, 0},
   {RE_SYNTAX_POSIX_BASIC, "x[A-Z,]y", "axCy", 1, 0},
   {RE_SYNTAX_POSIX_BASIC, "x[^y]z", "ax\xe2\x80\x94z", 1, 0},
@@ -77,7 +96,6 @@ static struct
   {RE_SYNTAX_POSIX_BASIC, "a\\wz", "a\xc3\x84z", 0, 0},
   {RE_SYNTAX_POSIX_BASIC, "x\\Wz", "\xc3\x96x\xe2\x80\x94z", 2, 0},
   {RE_SYNTAX_POSIX_EXTENDED, "x.y", "ax\xe2\x80\x94yz", 1, 0},
-  {RE_SYNTAX_POSIX_EXTENDED, "x\xc3\x96*y", "ax\xc3\x96\xc3\x96yz", 1, 0},
   {RE_SYNTAX_POSIX_EXTENDED, "x[\xc3\x84\xc3\xa4]y", "ax\xc3\xa4y", 1, 0},
   {RE_SYNTAX_POSIX_EXTENDED, "x[A-Z,]y", "axCy", 1, 0},
   {RE_SYNTAX_POSIX_EXTENDED, "x[^y]z", "ax\xe2\x80\x94z", 1, 0},
--- libc/posix/regexec.c.jj	2003-11-12 21:33:35.000000000 +0100
+++ libc/posix/regexec.c	2003-11-17 13:10:29.000000000 +0100
@@ -3484,10 +3484,6 @@ check_node_accept_bytes (preg, node_idx,
   int elem_len = re_string_elem_size_at (input, str_idx);
   int char_len = re_string_char_size_at (input, str_idx);
   int i;
-# ifdef _LIBC
-  int j;
-  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
-# endif /* _LIBC */
   if (elem_len <= 1 && char_len <= 1)
     return 0;
   if (node->type == OP_PERIOD)
@@ -3506,6 +3502,8 @@ check_node_accept_bytes (preg, node_idx,
 # ifdef _LIBC
       const unsigned char *pin = ((char *) re_string_get_buffer (input)
 				  + str_idx);
+      int j;
+      uint32_t nrules;
 # endif /* _LIBC */
       int match_len = 0;
       wchar_t wc = ((cset->nranges || cset->nchar_classes || cset->nmbchars)
@@ -3530,6 +3528,7 @@ check_node_accept_bytes (preg, node_idx,
 	}
 
 # ifdef _LIBC
+      nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
       if (nrules != 0)
 	{
 	  unsigned int in_collseq = 0;

	Jakub

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] Optimize also UTF-8 mb chars in regexs, a small optimization for regexec.c
  2003-11-17 13:47 [PATCH] Optimize also UTF-8 mb chars in regexs, a small optimization for regexec.c Jakub Jelinek
@ 2003-11-18  7:25 ` Ulrich Drepper
  0 siblings, 0 replies; 2+ messages in thread
From: Ulrich Drepper @ 2003-11-18  7:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Glibc hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jakub Jelinek wrote:

> 2003-11-17  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* posix/regcomp.c (optimize_utf8): Optimize multi-byte chars as
> 	well.
> 	* posix/bug-regex20.c (tests): Add new tests.  Multi-byte char
> 	followed by dup operator is expected to be optimized.
> 
> 	* posix/regexec.c (check_node_accept_bytes): Move nrules and j
> 	variables to the block where they are only used, initialize
> 	nrules only immediately before using it.

I've applied the patch.  Thanks,

- -- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/uclF2ijCOnn/RHQRAr1kAJ9n7bkBUWIHjzIkd7F3TFbklG+SwQCgiOV/
MNc8a5hpPBalNzflLLhDXd4=
=Gwct
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-11-18  7:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-17 13:47 [PATCH] Optimize also UTF-8 mb chars in regexs, a small optimization for regexec.c Jakub Jelinek
2003-11-18  7:25 ` Ulrich Drepper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).