[PATCH v2] Improve fnmatch performance.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v2] Improve fnmatch performance.
@ 2015-05-13  9:48 Ondřej Bílka
  2015-05-13 16:48 ` Paul Eggert
  0 siblings, 1 reply; 6+ messages in thread
From: Ondřej Bílka @ 2015-05-13  9:48 UTC (permalink / raw)
  To: libc-alpha; +Cc: eggert


Hi, this is revised improvement of fnmatch. It improves queries of
locate command around three times for UTF locale on x64 by finding start
of pattern with fast strstr.

It uses fast locale type check as introduced in strdiff.

How to synchronize this with gnulib? Only implementation specific detail
is utf8 detection.

There are several possible improvements. Main idea is reject strings as
fast as possible. If string is matched then likely you will do some
expensive operation with it.

This for simplicity just tries to find starting quarted of strstr.
it doesn't apply on patterns that start with special character.
Natural extension would be parse pattern, find quartet sequence that
must occur and do same strstr test. Second would be use whole characters
in casefold strstr with nonascii UTF.

Passes test, OK to commit this?

	* posix/fnmatch.c (fnmatch): Improve performance.

diff --git a/posix/fnmatch.c b/posix/fnmatch.c
index a707847..7152055 100644
--- a/posix/fnmatch.c
+++ b/posix/fnmatch.c
@@ -333,7 +333,48 @@ fnmatch (pattern, string, flags)
      int flags;
 {
 # if HANDLE_MULTIBYTE
-  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
+
+  struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
+  uint_fast32_t encoding =
+    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+
+  /* ASCII with \+/.*?[{(@! excluded.  */
+  static unsigned char normal[256] = {
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+ 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+ };
+
+  if (encoding == !__cet_other)
+    {
+      char start[16];
+      char *string2;
+      size_t i;
+      for (i = 0; i < 4 && normal[(unsigned char) pattern[i]]; i++)
+        start[i] = pattern[i];
+      start[i] = 0;
+      if (flags & FNM_CASEFOLD)
+        string2 = strcasestr (string, start);
+      else  
+        string2 = strstr (string, start);
+      if (!string2)
+        return FNM_NOMATCH;
+    }
+
+  if (MB_CUR_MAX != 1)
     {
       mbstate_t ps;
       size_t n;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Improve fnmatch performance.
  2015-05-13  9:48 [PATCH v2] Improve fnmatch performance Ondřej Bílka
@ 2015-05-13 16:48 ` Paul Eggert
  2015-06-09 14:47   ` [PATCH v3] " Ondřej Bílka
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Eggert @ 2015-05-13 16:48 UTC (permalink / raw)
  To: Ondřej Bílka, libc-alpha

OndÅ™ej BÃlka wrote:
> How to synchronize this with gnulib? Only implementation specific detail
> is utf8 detection.

It could be something like this:

  #if _LIBC
   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
   uint_fast32_t encoding =
     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
   bool is_utf8 = encoding == !__cet_other;
  #else
   bool is_utf8 = STRCASEEQ (locale_charset (),
                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
  #endif

We should package this sort of thing up and make it easier to use, but that 
could be another day.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3] Improve fnmatch performance.
  2015-05-13 16:48 ` Paul Eggert
@ 2015-06-09 14:47   ` Ondřej Bílka
  2015-06-16 17:34     ` [PING][PATCH " Ondřej Bílka
  0 siblings, 1 reply; 6+ messages in thread
From: Ondřej Bílka @ 2015-06-09 14:47 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

On Wed, May 13, 2015 at 09:48:27AM -0700, Paul Eggert wrote:
> OndÅ™ej BÃlka wrote:
> >How to synchronize this with gnulib? Only implementation specific detail
> >is utf8 detection.
> 
> It could be something like this:
> 
>  #if _LIBC
>   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
>   uint_fast32_t encoding =
>     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
>   bool is_utf8 = encoding == !__cet_other;
>  #else
>   bool is_utf8 = STRCASEEQ (locale_charset (),
>                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
>  #endif
> 
> We should package this sort of thing up and make it easier to use,
> but that could be another day.

Yes I will use that pattern, it needs to change details like that it
also works for single-byte encodings.

I also removed expect on MB_CUR_MAX as unicode is widespread.

Also I now return directly match when entire pattern is normal and
FNM_PERIOD or FNM_FILE_NAME wasn't set which could also help performance
a bit.

Then I could allow nonascii characters to start pattern unless its utf8
and you have FNM_CASEFOLD, would it be better to add two tables or check
for testing these?

	* posix/fnmatch.c (fnmatch): Improve performance.

diff --git a/posix/fnmatch.c b/posix/fnmatch.c
index fd85efa..4c32992 100644
--- a/posix/fnmatch.c
+++ b/posix/fnmatch.c
@@ -131,6 +131,13 @@ extern int fnmatch (const char *pattern, const char *string, int flags);
 #   define ISWCTYPE(WC, WT)	iswctype (WC, WT)
 #  endif
 
+#  ifdef _LIBC
+#   define STRCASESTR		__strcasestr
+#  else
+#   define STRCASESTR		strcasestr
+#  endif
+
+
 #  if (HAVE_MBSTATE_T && HAVE_MBSRTOWCS) || _LIBC
 /* In this case we are implementing the multibyte character handling.  */
 #   define HANDLE_MULTIBYTE	1
@@ -332,8 +339,62 @@ fnmatch (pattern, string, flags)
      const char *string;
      int flags;
 {
+
+  /* ASCII with \+/.*?[{(@! excluded.  */
+  static unsigned char normal[256] = {
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+ 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+ };
+#  if _LIBC
+   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
+   uint_fast32_t encoding =
+     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+   bool fast_encoding = (encoding != __cet_other);
+#  else
+#   if HANDLE_MULTIBYTE
+   bool is_utf8 = STRCASEEQ (locale_charset (),
+                             "UTF-8", 'U','T','F','-','8',0,0,0,0);
+   bool fast_encoding = (MB_CUR_MAX == 1) || is_utf;
+#   else
+   bool fast_encoding = true;
+#   endif
+#  endif
+
+  if (fast_encoding)
+    {
+      char start[8];
+      char *string2;
+      size_t i;
+      for (i = 0; i < 7 && normal[(unsigned char) pattern[i]]; i++)
+        start[i] = pattern[i];
+      start[i] = 0;
+      if (flags & FNM_CASEFOLD)
+        string2 = STRCASESTR (string, start);
+      else  
+        string2 = strstr (string, start);
+      if (!string2)
+        return FNM_NOMATCH;
+ 
+      if (pattern[i] == '\0' && (flags & (FNM_FILE_NAME | FNM_PERIOD)) == 0)
+        return 0; 
+    }
+
 # if HANDLE_MULTIBYTE
-  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
+  if (MB_CUR_MAX != 1)
     {
       mbstate_t ps;
       size_t n;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING][PATCH v3] Improve fnmatch performance.
  2015-06-09 14:47   ` [PATCH v3] " Ondřej Bílka
@ 2015-06-16 17:34     ` Ondřej Bílka
  2015-08-12  5:41       ` [PING^2][PATCH " Ondřej Bílka
  0 siblings, 1 reply; 6+ messages in thread
From: Ondřej Bílka @ 2015-06-16 17:34 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

On Tue, Jun 09, 2015 at 04:27:48PM +0200, OndÅ™ej BÃlka wrote:
> On Wed, May 13, 2015 at 09:48:27AM -0700, Paul Eggert wrote:
> > OndÅ™ej BÃlka wrote:
> > >How to synchronize this with gnulib? Only implementation specific detail
> > >is utf8 detection.
> > 
> > It could be something like this:
> > 
> >  #if _LIBC
> >   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> >   uint_fast32_t encoding =
> >     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> >   bool is_utf8 = encoding == !__cet_other;
> >  #else
> >   bool is_utf8 = STRCASEEQ (locale_charset (),
> >                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
> >  #endif
> > 
> > We should package this sort of thing up and make it easier to use,
> > but that could be another day.
> 
> Yes I will use that pattern, it needs to change details like that it
> also works for single-byte encodings.
> 
> I also removed expect on MB_CUR_MAX as unicode is widespread.
> 
> Also I now return directly match when entire pattern is normal and
> FNM_PERIOD or FNM_FILE_NAME wasn't set which could also help performance
> a bit.
> 
> Then I could allow nonascii characters to start pattern unless its utf8
> and you have FNM_CASEFOLD, would it be better to add two tables or check
> for testing these?
> 
> 	* posix/fnmatch.c (fnmatch): Improve performance.
> 
> diff --git a/posix/fnmatch.c b/posix/fnmatch.c
> index fd85efa..4c32992 100644
> --- a/posix/fnmatch.c
> +++ b/posix/fnmatch.c
> @@ -131,6 +131,13 @@ extern int fnmatch (const char *pattern, const char *string, int flags);
>  #   define ISWCTYPE(WC, WT)	iswctype (WC, WT)
>  #  endif
>  
> +#  ifdef _LIBC
> +#   define STRCASESTR		__strcasestr
> +#  else
> +#   define STRCASESTR		strcasestr
> +#  endif
> +
> +
>  #  if (HAVE_MBSTATE_T && HAVE_MBSRTOWCS) || _LIBC
>  /* In this case we are implementing the multibyte character handling.  */
>  #   define HANDLE_MULTIBYTE	1
> @@ -332,8 +339,62 @@ fnmatch (pattern, string, flags)
>       const char *string;
>       int flags;
>  {
> +
> +  /* ASCII with \+/.*?[{(@! excluded.  */
> +  static unsigned char normal[256] = {
> + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> + 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
> + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> + };
> +#  if _LIBC
> +   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> +   uint_fast32_t encoding =
> +     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> +   bool fast_encoding = (encoding != __cet_other);
> +#  else
> +#   if HANDLE_MULTIBYTE
> +   bool is_utf8 = STRCASEEQ (locale_charset (),
> +                             "UTF-8", 'U','T','F','-','8',0,0,0,0);
> +   bool fast_encoding = (MB_CUR_MAX == 1) || is_utf;
> +#   else
> +   bool fast_encoding = true;
> +#   endif
> +#  endif
> +
> +  if (fast_encoding)
> +    {
> +      char start[8];
> +      char *string2;
> +      size_t i;
> +      for (i = 0; i < 7 && normal[(unsigned char) pattern[i]]; i++)
> +        start[i] = pattern[i];
> +      start[i] = 0;
> +      if (flags & FNM_CASEFOLD)
> +        string2 = STRCASESTR (string, start);
> +      else  
> +        string2 = strstr (string, start);
> +      if (!string2)
> +        return FNM_NOMATCH;
> + 
> +      if (pattern[i] == '\0' && (flags & (FNM_FILE_NAME | FNM_PERIOD)) == 0)
> +        return 0; 
> +    }
> +
>  # if HANDLE_MULTIBYTE
> -  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
> +  if (MB_CUR_MAX != 1)
>      {
>        mbstate_t ps;
>        size_t n;

-- 

We need a licensed electrician to replace the light bulbs in the computer room.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING^2][PATCH v3] Improve fnmatch performance.
  2015-06-16 17:34     ` [PING][PATCH " Ondřej Bílka
@ 2015-08-12  5:41       ` Ondřej Bílka
  2015-08-19  9:32         ` [PING^3][PATCH " Ondřej Bílka
  0 siblings, 1 reply; 6+ messages in thread
From: Ondřej Bílka @ 2015-08-12  5:41 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

ping
On Tue, Jun 16, 2015 at 07:30:17PM +0200, OndÅ™ej BÃlka wrote:
> On Tue, Jun 09, 2015 at 04:27:48PM +0200, OndÅ™ej BÃlka wrote:
> > On Wed, May 13, 2015 at 09:48:27AM -0700, Paul Eggert wrote:
> > > OndÅ™ej BÃlka wrote:
> > > >How to synchronize this with gnulib? Only implementation specific detail
> > > >is utf8 detection.
> > > 
> > > It could be something like this:
> > > 
> > >  #if _LIBC
> > >   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> > >   uint_fast32_t encoding =
> > >     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> > >   bool is_utf8 = encoding == !__cet_other;
> > >  #else
> > >   bool is_utf8 = STRCASEEQ (locale_charset (),
> > >                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
> > >  #endif
> > > 
> > > We should package this sort of thing up and make it easier to use,
> > > but that could be another day.
> > 
> > Yes I will use that pattern, it needs to change details like that it
> > also works for single-byte encodings.
> > 
> > I also removed expect on MB_CUR_MAX as unicode is widespread.
> > 
> > Also I now return directly match when entire pattern is normal and
> > FNM_PERIOD or FNM_FILE_NAME wasn't set which could also help performance
> > a bit.
> > 
> > Then I could allow nonascii characters to start pattern unless its utf8
> > and you have FNM_CASEFOLD, would it be better to add two tables or check
> > for testing these?
> > 
> > 	* posix/fnmatch.c (fnmatch): Improve performance.
> > 
> > diff --git a/posix/fnmatch.c b/posix/fnmatch.c
> > index fd85efa..4c32992 100644
> > --- a/posix/fnmatch.c
> > +++ b/posix/fnmatch.c
> > @@ -131,6 +131,13 @@ extern int fnmatch (const char *pattern, const char *string, int flags);
> >  #   define ISWCTYPE(WC, WT)	iswctype (WC, WT)
> >  #  endif
> >  
> > +#  ifdef _LIBC
> > +#   define STRCASESTR		__strcasestr
> > +#  else
> > +#   define STRCASESTR		strcasestr
> > +#  endif
> > +
> > +
> >  #  if (HAVE_MBSTATE_T && HAVE_MBSRTOWCS) || _LIBC
> >  /* In this case we are implementing the multibyte character handling.  */
> >  #   define HANDLE_MULTIBYTE	1
> > @@ -332,8 +339,62 @@ fnmatch (pattern, string, flags)
> >       const char *string;
> >       int flags;
> >  {
> > +
> > +  /* ASCII with \+/.*?[{(@! excluded.  */
> > +  static unsigned char normal[256] = {
> > + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> > + 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
> > + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> > + };
> > +#  if _LIBC
> > +   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> > +   uint_fast32_t encoding =
> > +     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> > +   bool fast_encoding = (encoding != __cet_other);
> > +#  else
> > +#   if HANDLE_MULTIBYTE
> > +   bool is_utf8 = STRCASEEQ (locale_charset (),
> > +                             "UTF-8", 'U','T','F','-','8',0,0,0,0);
> > +   bool fast_encoding = (MB_CUR_MAX == 1) || is_utf;
> > +#   else
> > +   bool fast_encoding = true;
> > +#   endif
> > +#  endif
> > +
> > +  if (fast_encoding)
> > +    {
> > +      char start[8];
> > +      char *string2;
> > +      size_t i;
> > +      for (i = 0; i < 7 && normal[(unsigned char) pattern[i]]; i++)
> > +        start[i] = pattern[i];
> > +      start[i] = 0;
> > +      if (flags & FNM_CASEFOLD)
> > +        string2 = STRCASESTR (string, start);
> > +      else  
> > +        string2 = strstr (string, start);
> > +      if (!string2)
> > +        return FNM_NOMATCH;
> > + 
> > +      if (pattern[i] == '\0' && (flags & (FNM_FILE_NAME | FNM_PERIOD)) == 0)
> > +        return 0; 
> > +    }
> > +
> >  # if HANDLE_MULTIBYTE
> > -  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
> > +  if (MB_CUR_MAX != 1)
> >      {
> >        mbstate_t ps;
> >        size_t n;
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PING^3][PATCH v3] Improve fnmatch performance.
  2015-08-12  5:41       ` [PING^2][PATCH " Ondřej Bílka
@ 2015-08-19  9:32         ` Ondřej Bílka
  0 siblings, 0 replies; 6+ messages in thread
From: Ondřej Bílka @ 2015-08-19  9:32 UTC (permalink / raw)
  To: Paul Eggert; +Cc: libc-alpha

ping
On Wed, Aug 12, 2015 at 07:41:20AM +0200, OndÅ™ej BÃlka wrote:
> ping
> On Tue, Jun 16, 2015 at 07:30:17PM +0200, OndÅ™ej BÃlka wrote:
> > On Tue, Jun 09, 2015 at 04:27:48PM +0200, OndÅ™ej BÃlka wrote:
> > > On Wed, May 13, 2015 at 09:48:27AM -0700, Paul Eggert wrote:
> > > > OndÅ™ej BÃlka wrote:
> > > > >How to synchronize this with gnulib? Only implementation specific detail
> > > > >is utf8 detection.
> > > > 
> > > > It could be something like this:
> > > > 
> > > >  #if _LIBC
> > > >   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> > > >   uint_fast32_t encoding =
> > > >     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> > > >   bool is_utf8 = encoding == !__cet_other;
> > > >  #else
> > > >   bool is_utf8 = STRCASEEQ (locale_charset (),
> > > >                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
> > > >  #endif
> > > > 
> > > > We should package this sort of thing up and make it easier to use,
> > > > but that could be another day.
> > > 
> > > Yes I will use that pattern, it needs to change details like that it
> > > also works for single-byte encodings.
> > > 
> > > I also removed expect on MB_CUR_MAX as unicode is widespread.
> > > 
> > > Also I now return directly match when entire pattern is normal and
> > > FNM_PERIOD or FNM_FILE_NAME wasn't set which could also help performance
> > > a bit.
> > > 
> > > Then I could allow nonascii characters to start pattern unless its utf8
> > > and you have FNM_CASEFOLD, would it be better to add two tables or check
> > > for testing these?
> > > 
> > > 	* posix/fnmatch.c (fnmatch): Improve performance.
> > > 
> > > diff --git a/posix/fnmatch.c b/posix/fnmatch.c
> > > index fd85efa..4c32992 100644
> > > --- a/posix/fnmatch.c
> > > +++ b/posix/fnmatch.c
> > > @@ -131,6 +131,13 @@ extern int fnmatch (const char *pattern, const char *string, int flags);
> > >  #   define ISWCTYPE(WC, WT)	iswctype (WC, WT)
> > >  #  endif
> > >  
> > > +#  ifdef _LIBC
> > > +#   define STRCASESTR		__strcasestr
> > > +#  else
> > > +#   define STRCASESTR		strcasestr
> > > +#  endif
> > > +
> > > +
> > >  #  if (HAVE_MBSTATE_T && HAVE_MBSRTOWCS) || _LIBC
> > >  /* In this case we are implementing the multibyte character handling.  */
> > >  #   define HANDLE_MULTIBYTE	1
> > > @@ -332,8 +339,62 @@ fnmatch (pattern, string, flags)
> > >       const char *string;
> > >       int flags;
> > >  {
> > > +
> > > +  /* ASCII with \+/.*?[{(@! excluded.  */
> > > +  static unsigned char normal[256] = {
> > > + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> > > + 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
> > > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
> > > + 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
> > > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
> > > + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > > + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> > > + };
> > > +#  if _LIBC
> > > +   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
> > > +   uint_fast32_t encoding =
> > > +     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> > > +   bool fast_encoding = (encoding != __cet_other);
> > > +#  else
> > > +#   if HANDLE_MULTIBYTE
> > > +   bool is_utf8 = STRCASEEQ (locale_charset (),
> > > +                             "UTF-8", 'U','T','F','-','8',0,0,0,0);
> > > +   bool fast_encoding = (MB_CUR_MAX == 1) || is_utf;
> > > +#   else
> > > +   bool fast_encoding = true;
> > > +#   endif
> > > +#  endif
> > > +
> > > +  if (fast_encoding)
> > > +    {
> > > +      char start[8];
> > > +      char *string2;
> > > +      size_t i;
> > > +      for (i = 0; i < 7 && normal[(unsigned char) pattern[i]]; i++)
> > > +        start[i] = pattern[i];
> > > +      start[i] = 0;
> > > +      if (flags & FNM_CASEFOLD)
> > > +        string2 = STRCASESTR (string, start);
> > > +      else  
> > > +        string2 = strstr (string, start);
> > > +      if (!string2)
> > > +        return FNM_NOMATCH;
> > > + 
> > > +      if (pattern[i] == '\0' && (flags & (FNM_FILE_NAME | FNM_PERIOD)) == 0)
> > > +        return 0; 
> > > +    }
> > > +
> > >  # if HANDLE_MULTIBYTE
> > > -  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
> > > +  if (MB_CUR_MAX != 1)
> > >      {
> > >        mbstate_t ps;
> > >        size_t n;
> > 

-- 

Quantum dynamics are affecting the transistors

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-08-19  9:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-13  9:48 [PATCH v2] Improve fnmatch performance Ondřej Bílka
2015-05-13 16:48 ` Paul Eggert
2015-06-09 14:47   ` [PATCH v3] " Ondřej Bílka
2015-06-16 17:34     ` [PING][PATCH " Ondřej Bílka
2015-08-12  5:41       ` [PING^2][PATCH " Ondřej Bílka
2015-08-19  9:32         ` [PING^3][PATCH " Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).