[PATCH 1/3] Use generic strchrnul for strchr

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 1/3] Use generic strchrnul for strchr
@ 2015-05-26 18:48 Ondřej Bílka
  2015-05-26 18:51 ` [PATCH 2/3] Optimize strchrnul with unrolling, better header and unaligned loads Ondřej Bílka
  2015-05-26 18:55 ` [RFC PATCH 3/3] Exploit that strchr needle is mostly ascii Ondřej Bílka
  0 siblings, 2 replies; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-26 18:48 UTC (permalink / raw)
  To: libc-alpha

Hi, as I wrote about strchr optimized implementations here is first
part that removes duplicating implementations and letting gcc do job by
inlining strchrnul. 

	* string/strchr.c: Use strchrnul implementation.
	* string/strchrnul.c: Add inclusion guard.

---
 string/strchr.c    | 156 +++--------------------------------------------------
 string/strchrnul.c |   3 ++
 2 files changed, 11 insertions(+), 148 deletions(-)

diff --git a/string/strchr.c b/string/strchr.c
index 5f90075..f7327ad 100644
--- a/string/strchr.c
+++ b/string/strchr.c
@@ -25,158 +25,18 @@
 
 #undef strchr
 
+
+/* Include strchrnul and let gcc inline it.  */
+#define AS_STRCHRNUL
+#define STRCHRNUL strchrnul_i
+#include "string/strchrnul.c"
+
 /* Find the first occurrence of C in S.  */
 char *
 strchr (const char *s, int c_in)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c)
-      return (void *) char_ptr;
-    else if (*char_ptr == '\0')
-      return NULL;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  switch (sizeof (longword))
-    {
-    case 4: magic_bits = 0x7efefeffL; break;
-    case 8: magic_bits = ((0x7efefefeL << 16) << 16) | 0xfefefeffL; break;
-    default:
-      abort ();
-    }
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	    }
-	}
-    }
-
-  return NULL;
+  char *c = strchnul_i (s, c_in);
+  return *c ? c : NULL;
 }
 
 #ifdef weak_alias
diff --git a/string/strchrnul.c b/string/strchrnul.c
index 2678f1d..bbe24cd 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -32,6 +32,9 @@
 #endif
 
 /* Find the first occurrence of C in S or the final NUL byte.  */
+#ifdef AS_STRCHRNUL
+static __always_inline
+#endif
 char *
 STRCHRNUL (s, c_in)
      const char *s;
-- 
1.8.4.rc3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/3] Optimize strchrnul with unrolling, better header and unaligned loads
  2015-05-26 18:48 [PATCH 1/3] Use generic strchrnul for strchr Ondřej Bílka
@ 2015-05-26 18:51 ` Ondřej Bílka
  2015-05-28 17:45   ` Joseph Myers
  2015-05-26 18:55 ` [RFC PATCH 3/3] Exploit that strchr needle is mostly ascii Ondřej Bílka
  1 sibling, 1 reply; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-26 18:51 UTC (permalink / raw)
  To: libc-alpha

Now this is real optimization of strchrnul.
This should be faster thanks to unrolling.

Second improvement is for practical strings. Most of these have match in
less than 32 bytes so we should avoid loop there.

If unaligned loads are available they should be used, I found that one
bottleneck on first checks that was too often false because it loaded
just few bytes.

Are these architectures only ones? I know that x64 forgotten to add
header so there may be others.

sysdeps/m68k/m680x0/m68020/bits/string.h:#define _STRING_ARCH_unaligned 1
sysdeps/s390/bits/string.h:#define _STRING_ARCH_unaligned       1
sysdeps/x86/bits/string.h:#define _STRING_ARCH_unaligned        1


Also checking start byte-by-byte is ineffective a better is just mask
bytes before start. It relies now on fast ffs, if it isn't case there
are other ways to find that quickly (binary search or multiplication).

There are several details that would make it faster but less readable so
I omitted them now. One would be add goto to end of function as gcc
likes to duplicate identical tail five times instead just jumping there.

Could arch mainainers test this and report if its improvement?

	* string/strchrnul.c (STRCHRNUL): Optimize implementation.

diff --git a/string/strchrnul.c b/string/strchrnul.c
index 4986c5d..ea91195 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -21,9 +21,8 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <memcopy.h>
-#include <stdlib.h>
-
+#include <libc-internal.h>
+#include <stdint.h>
 #undef __strchrnul
 #undef strchrnul
 
@@ -31,146 +30,144 @@
 # define STRCHRNUL __strchrnul
 #endif
 
-/* Find the first occurrence of C in S or the final NUL byte.  */
-#ifdef AS_STRCHRNUL
-static __always_inline
-#endif
-char *
-STRCHRNUL (s, c_in)
-     const char *s;
-     int c_in;
-{
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c || *char_ptr == '\0')
-      return (void *) char_ptr;
+static const unsigned long int ones = (~0UL / 255); /* 0x0101...*/
+static const unsigned long int add = 127 * (~0UL / 255);
+static const unsigned long int high_bits = 128 * (~0UL / 255);
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+/* Use vector arithmetic tricks. Idea is to take expression works on
+   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
+   Our expression is ((s + 127) ^ (~s)) & 128  
+   Now we evaluate this expression on each byte in parallel and on first 
+   nonzero byte our expression will have nonzero value. */
 
-  longword_ptr = (unsigned long int *) char_ptr;
+static always_inline
+unsigned long int 
+contains_zero (unsigned long int s)
+{
+  return ((s + add) ^ ~s) & high_bits;
+}
 
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
+static always_inline
+int
+found_in_long_bytes(char *s, unsigned long int cmask, char **result)
+{
+  const unsigned long int *lptr = (const unsigned long int *) s;
+  unsigned long int mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+  if (mask)
+    {
+      *result = s + ffsl (mask) / 8 - 1;
+      return 1;
+    }
+  else
+    return 0;
+}
 
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
+#define LSIZE sizeof (unsigned long int)
+#define CROSS_PAGE(x, n) (((uintptr_t)x)%4096 >= 4096 - n)
 
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  switch (sizeof (longword))
+/* Find the first occurrence of C in S or the final NUL byte.  */
+#ifdef AS_STRCHR
+static always_inline
+#endif
+char *
+STRCHRNUL (const char *s_in, int c_in)
+{
+  unsigned long int mask;
+  const unsigned long int *lptr;
+  char *s = (char *) s_in;
+  unsigned char c = (unsigned char) c_in;
+  char *r;
+  unsigned long int cmask = c * ones;
+
+#if _STRING_ARCH_unaligned
+  /* We fetch 32 bytes while not crossing page boundary. 
+     Most strings in practice are of that size and we avoid a loop.
+     This looks as best in practice, alternative below uses aligned load 
+     but is slower when string starts just few 
+     bytes before 32 byte boundary. A tradeoff is that we rarely could 
+     fetch extra cache line without needing it but this optimization 
+     does pay for that. */
+  if (!CROSS_PAGE(s, 32))
     {
-    case 4: magic_bits = 0x7efefeffL; break;
-    case 8: magic_bits = ((0x7efefefeL << 16) << 16) | 0xfefefeffL; break;
-    default:
-      abort ();
+      if (found_in_long_bytes (s + 0 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 1 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 2 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 3 * LSIZE, cmask, &r))
+        return r;
+      if (sizeof (unsigned long int) == 4)
+        {
+          if (found_in_long_bytes (s + 0 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 1 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 2 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 3 * LSIZE, cmask, &r))
+            return r;
+        }
     }
+  else
+    {
+#endif
+  /* We need use aligned loads. For first load we read some bytes before 
+     start that we discard by shifting them down. */
+ 
+      char *s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
+      lptr = (const unsigned long int *) s_aligned;
+      mask = (contains_zero (*lptr) 
+              | contains_zero (*lptr ^ cmask))
+             >> (8 * (s_aligned - s));
+
+      if (mask)
+        return s + ffsl (mask) / 8 - 1;
+
+      if (found_in_long_bytes (s + 1 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 2 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 3 * LSIZE, cmask, &r))
+        return r;
+#if _STRING_ARCH_unaligned
+    }
+#endif
+   /* Now we read enough bytes to start a loop.  */
 
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  char *s_loop = PTR_ALIGN_DOWN (s, 4 * LSIZE);
+  do 
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	    }
-	}
+      s_aligned += 4 * LSIZE;
+      lptr = (const unsigned long int *) (s_loop + 0 * LSIZE);
+      mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+      lptr = (const unsigned long int *) (s_loop + 1 * LSIZE);
+      mask |= contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+      lptr = (const unsigned long int *) (s_loop + 2 * LSIZE);
+      mask |= contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+      lptr = (const unsigned long int *) (s_loop + 3 * LSIZE);
+      mask |= contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+
     }
+  while (!mask);
+ if (found_in_long_bytes (s_loop + 0 * LSIZE, cmask, &r))
+   return r;
+ if (found_in_long_bytes (s_loop + 1 * LSIZE, cmask, &r))
+   return r;
+ if (found_in_long_bytes (s_loop + 2 * LSIZE, cmask, &r))
+   return r;
+ found_in_long_bytes (s_loop + 3 * LSIZE, cmask, &r);
+ return r;
+}
+
+
+char *strchr(char *s, int c)
+{
+  char *r = __strchrnule(s,c);
+  return *r ? r : 0;
 
-  /* This should never happen.  */
-  return NULL;
 }
 
+#ifndef AS_STRCHR
 weak_alias (__strchrnul, strchrnul)
+#endif

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] Optimize strchrnul with unrolling, better header and  unaligned loads
  2015-05-26 18:51 ` [PATCH 2/3] Optimize strchrnul with unrolling, better header and unaligned loads Ondřej Bílka
@ 2015-05-28 17:45   ` Joseph Myers
  2015-05-28 17:48     ` Ondřej Bílka
  0 siblings, 1 reply; 9+ messages in thread
From: Joseph Myers @ 2015-05-28 17:45 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On Tue, 26 May 2015, OndĂ¸ej BĂlka wrote:

> +static always_inline
> +int
> +found_in_long_bytes(char *s, unsigned long int cmask, char **result)
> +{
> +  const unsigned long int *lptr = (const unsigned long int *) s;
> +  unsigned long int mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
> +  if (mask)
> +    {
> +      *result = s + ffsl (mask) / 8 - 1;

If this gets used in strchr, note ffsl is in the user's namespace.  Are 
you sure this will always be inlined by all supported GCC versions on all 
supported architectures (or converted to a call to a libgcc __clz* 
function, which is just as good in namespace terms)?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] Optimize strchrnul with unrolling, better header and  unaligned loads
  2015-05-28 17:45   ` Joseph Myers
@ 2015-05-28 17:48     ` Ondřej Bílka
  0 siblings, 0 replies; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-28 17:48 UTC (permalink / raw)
  To: Joseph Myers; +Cc: libc-alpha

On Thu, May 28, 2015 at 05:02:32PM +0000, Joseph Myers wrote:
> On Tue, 26 May 2015, OndÅ™ej BÃlka wrote:
> 
> > +static always_inline
> > +int
> > +found_in_long_bytes(char *s, unsigned long int cmask, char **result)
> > +{
> > +  const unsigned long int *lptr = (const unsigned long int *) s;
> > +  unsigned long int mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
> > +  if (mask)
> > +    {
> > +      *result = s + ffsl (mask) / 8 - 1;
> 
> If this gets used in strchr, note ffsl is in the user's namespace.  Are 
> you sure this will always be inlined by all supported GCC versions on all 
> supported architectures (or converted to a call to a libgcc __clz* 
> function, which is just as good in namespace terms)?
> 
No, i handle that problem of next iteration of this patch that uses
skeleton. And you need different function for big endian.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/3] Exploit that strchr needle is mostly ascii
  2015-05-26 18:48 [PATCH 1/3] Use generic strchrnul for strchr Ondřej Bílka
  2015-05-26 18:51 ` [PATCH 2/3] Optimize strchrnul with unrolling, better header and unaligned loads Ondřej Bílka
@ 2015-05-26 18:55 ` Ondřej Bílka
  2015-05-26 20:34   ` [RFC PATCH 4/3] Fix previous patch and add header Ondřej Bílka
  1 sibling, 1 reply; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-26 18:55 UTC (permalink / raw)
  To: libc-alpha

I realized that strchrnul could be optimized more with exploiting
statistical properties of input. A idea is that needle is most time in
range 0-127

When we handle bytes 128-255 separately it allows considerable
simplification as we do remove false positive 128 only once instead of
twice as previously done.

Comments?


	* string/strchrnul.c: Exploit that c is most times in ascii.

diff --git b/string/strchrnul.c a/string/strchrnul.c
index ea91195..5019773 100644
--- b/string/strchrnul.c
+++ a/string/strchrnul.c
@@ -47,12 +47,24 @@ contains_zero (unsigned long int s)
   return ((s + add) ^ ~s) & high_bits;
 }
 
+
+/* Here idea is still use the result of expression
+   contains_zero (*p) | contains_zero (*p ^ cmask)
+   but we can optimize it. By moving or to compute
+   ((s + add) | ((s ^ cmask) + add) a highest bit is set only for characters
+   0, c, 128, c + 128
+   As we ensured that c has highest byte zero a ^ ~*p eliminates both 128 and 
+   128 + c instead doing xor twice.  */
+
+
 static always_inline
 int
 found_in_long_bytes(char *s, unsigned long int cmask, char **result)
 {
   const unsigned long int *lptr = (const unsigned long int *) s;
-  unsigned long int mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
+  unsigned long int mask = (((*lptr + add) | ((*lptr ^ cmask) + add)) 
+                            ^ ~*lptr) & high_bits;
+
   if (mask)
     {
       *result = s + ffsl (mask) / 8 - 1;
@@ -76,9 +88,30 @@ STRCHRNUL (const char *s_in, int c_in)
   const unsigned long int *lptr;
   char *s = (char *) s_in;
   unsigned char c = (unsigned char) c_in;
-  char *r;
+  char *r, *s_aligned;
   unsigned long int cmask = c * ones;
 
+  if (__libc_unlikely (c > 127))
+    {
+      s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
+      lptr = (const unsigned long int *) s_aligned;
+      mask = (contains_zero (*lptr) 
+              | contains_zero (*lptr ^ cmask))
+             >> (8 * (s_aligned - s));
+
+      if (mask)
+        return s + ffsl (mask) / 8 - 1;
+      while (1)
+        {
+          s_aligned += LSIZE;
+          lptr = (const unsigned long int *) s_aligned;
+          mask = (contains_zero (*lptr)
+               | contains_zero (*lptr ^ cmask));
+          if (mask)       
+            return s_aligned + ffsl (mask) / 8 - 1;
+        }
+    }
+
 #if _STRING_ARCH_unaligned
   /* We fetch 32 bytes while not crossing page boundary. 
      Most strings in practice are of that size and we avoid a loop.
@@ -115,7 +148,7 @@ STRCHRNUL (const char *s_in, int c_in)
   /* We need use aligned loads. For first load we read some bytes before 
      start that we discard by shifting them down. */
  
-      char *s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
+      s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
       lptr = (const unsigned long int *) s_aligned;
       mask = (contains_zero (*lptr) 
               | contains_zero (*lptr ^ cmask))

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/3] Fix previous patch and add header.
  2015-05-26 18:55 ` [RFC PATCH 3/3] Exploit that strchr needle is mostly ascii Ondřej Bílka
@ 2015-05-26 20:34   ` Ondřej Bílka
  2015-05-26 20:38     ` Ondřej Bílka
  2015-05-28 17:49     ` Joseph Myers
  0 siblings, 2 replies; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-26 20:34 UTC (permalink / raw)
  To: libc-alpha


Sorry, found that my method of testing as compiling these as standalone
file then including them to libc caused problems. This should fix these.

Also moved reusable functions to separate header as I will cover a
strlen, rawmemchr, memchr and rawmemchr.

diff --git a/string/common.h b/string/common.h
new file mode 100644
index 0000000..84f9767
--- /dev/null
+++ b/string/common.h
@@ -0,0 +1,35 @@
+#include <stdint.h>
+
+static const unsigned long int ones = (~0UL / 255); /* 0x0101...*/
+static const unsigned long int add = 127 * (~0UL / 255);
+static const unsigned long int high_bits = 128 * (~0UL / 255);
+
+/* Use vector arithmetic tricks. Idea is to take expression works on
+   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
+   Our expression is ((s + 127) ^ (~s)) & 128  
+   Now we evaluate this expression on each byte in parallel and on first 
+   nonzero byte our expression will have nonzero value. */
+
+static __always_inline
+unsigned long int 
+contains_zero (unsigned long int s)
+{
+  return ((s + add) ^ ~s) & high_bits;
+}
+
+#define LSIZE sizeof (unsigned long int)
+#define CROSS_PAGE(x, n) (((uintptr_t)x) % 4096 >= 4096 - n)
+
+static __always_inline
+size_t
+first_nonzero_byte (unsigned long int u)
+{
+#ifdef FAST_FFS
+  return ffsl (u) / 8 - 1;
+#else
+  u = u = u ^ (u - 1);
+  u = u & ones;
+  u = u * ones;
+  return (u >> (8 * LSIZE - 8)) - 1;
+#endif
+}
diff --git a/string/strchr.c b/string/strchr.c
index 17473b8..ca543f4 100644
--- a/string/strchr.c
+++ b/string/strchr.c
@@ -25,9 +25,8 @@
 
 #undef strchr
 
-
 /* Include strchrnul and let gcc inline it.  */
-#define AS_STRCHRNUL
+#define AS_STRCHR
 #define STRCHRNUL strchrnul_i
 #include "string/strchrnul.c"
 
diff --git a/string/strchrnul.c b/string/strchrnul.c
index ea91195..5be9da9 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -23,6 +23,7 @@
 #include <string.h>
 #include <libc-internal.h>
 #include <stdint.h>
+#include "string/common.h"
 #undef __strchrnul
 #undef strchrnul
 
@@ -30,24 +31,8 @@
 # define STRCHRNUL __strchrnul
 #endif
 
-static const unsigned long int ones = (~0UL / 255); /* 0x0101...*/
-static const unsigned long int add = 127 * (~0UL / 255);
-static const unsigned long int high_bits = 128 * (~0UL / 255);
 
-/* Use vector arithmetic tricks. Idea is to take expression works on
-   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
-   Our expression is ((s + 127) ^ (~s)) & 128  
-   Now we evaluate this expression on each byte in parallel and on first 
-   nonzero byte our expression will have nonzero value. */
-
-static always_inline
-unsigned long int 
-contains_zero (unsigned long int s)
-{
-  return ((s + add) ^ ~s) & high_bits;
-}
-
-static always_inline
+static __always_inline
 int
 found_in_long_bytes(char *s, unsigned long int cmask, char **result)
 {
@@ -55,19 +40,16 @@ found_in_long_bytes(char *s, unsigned long int cmask, char **result)
   unsigned long int mask = contains_zero (*lptr) | contains_zero (*lptr ^ cmask);
   if (mask)
     {
-      *result = s + ffsl (mask) / 8 - 1;
+      *result = s + first_nonzero_byte (mask);
       return 1;
     }
   else
     return 0;
 }
 
-#define LSIZE sizeof (unsigned long int)
-#define CROSS_PAGE(x, n) (((uintptr_t)x)%4096 >= 4096 - n)
-
 /* Find the first occurrence of C in S or the final NUL byte.  */
 #ifdef AS_STRCHR
-static always_inline
+static __always_inline
 #endif
 char *
 STRCHRNUL (const char *s_in, int c_in)
@@ -122,7 +104,7 @@ STRCHRNUL (const char *s_in, int c_in)
              >> (8 * (s_aligned - s));
 
       if (mask)
-        return s + ffsl (mask) / 8 - 1;
+        return s + first_nonzero_byte (mask);
 
       if (found_in_long_bytes (s + 1 * LSIZE, cmask, &r))
         return r;
@@ -160,14 +142,6 @@ STRCHRNUL (const char *s_in, int c_in)
  return r;
 }
 
-
-char *strchr(char *s, int c)
-{
-  char *r = __strchrnule(s,c);
-  return *r ? r : 0;
-
-}
-
-#ifndef AS_STRCHR
+#ifdef AS_STRCHR
 weak_alias (__strchrnul, strchrnul)
 #endif



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 4/3] Fix previous patch and add header.
  2015-05-26 20:34   ` [RFC PATCH 4/3] Fix previous patch and add header Ondřej Bílka
@ 2015-05-26 20:38     ` Ondřej Bílka
  2015-05-28 17:49     ` Joseph Myers
  1 sibling, 0 replies; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-26 20:38 UTC (permalink / raw)
  To: libc-alpha

On Tue, May 26, 2015 at 08:55:34PM +0200, OndÅ™ej BÃlka wrote:
> 
Now I get second thoughts which are nasty.

So I decided to resubmit this as all these functions follow same
template so I will just write this template and add few specific bits.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 4/3] Fix previous patch and add header.
  2015-05-26 20:34   ` [RFC PATCH 4/3] Fix previous patch and add header Ondřej Bílka
  2015-05-26 20:38     ` Ondřej Bílka
@ 2015-05-28 17:49     ` Joseph Myers
  2015-05-28 17:59       ` Ondřej Bílka
  1 sibling, 1 reply; 9+ messages in thread
From: Joseph Myers @ 2015-05-28 17:49 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

On Tue, 26 May 2015, OndĂ¸ej BĂlka wrote:

> diff --git a/string/common.h b/string/common.h
> new file mode 100644
> index 0000000..84f9767
> --- /dev/null
> +++ b/string/common.h
> @@ -0,0 +1,35 @@
> +#include <stdint.h>

All files should start with a comment that first describes the file, then 
has the copyright and license notice.

> +/* Use vector arithmetic tricks. Idea is to take expression works on
> +   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
> +   Our expression is ((s + 127) ^ (~s)) & 128  
> +   Now we evaluate this expression on each byte in parallel and on first 
> +   nonzero byte our expression will have nonzero value. */

I think a more detailed, multi-paragraph comment carefully explaining the 
algorithm used (and why and how it works) would be better.  Like the one 
discussed in bug 5806, but of course without the mistake discussed there.

The principle of a common header with this logic is a good one - hopefully 
if it gets used everywhere this can resolve bug 5806 by eliminating all 
copies of the old comment.

(The patch submission should carefully discuss how the algorithm relates 
to the ones discussed in bug 5806 / 5807 or used in existing code.  But 
that's part of justifying the patches rather than for the comments in the 
new code.)

> +static __always_inline
> +unsigned long int 
> +contains_zero (unsigned long int s)
> +{
> +  return ((s + add) ^ ~s) & high_bits;

Instead of using unsigned long int, architectures should be able to 
configure the type used.  I expect that for ilp32 configurations on 64-bit 
architectures, where registers are 64-bit but unsigned long int is 32-bit, 
such as x32 and MIPS n32, it will be better to use unsigned long long int.

> +#define CROSS_PAGE(x, n) (((uintptr_t)x) % 4096 >= 4096 - n)

It might also make sense for the use of 4096 here to be configurable by 
architectures as well.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 4/3] Fix previous patch and add header.
  2015-05-28 17:49     ` Joseph Myers
@ 2015-05-28 17:59       ` Ondřej Bílka
  0 siblings, 0 replies; 9+ messages in thread
From: Ondřej Bílka @ 2015-05-28 17:59 UTC (permalink / raw)
  To: Joseph Myers; +Cc: libc-alpha

On Thu, May 28, 2015 at 05:17:22PM +0000, Joseph Myers wrote:
> On Tue, 26 May 2015, OndÅ™ej BÃlka wrote:
> 
> > diff --git a/string/common.h b/string/common.h
> > new file mode 100644
> > index 0000000..84f9767
> > --- /dev/null
> > +++ b/string/common.h
> > @@ -0,0 +1,35 @@
> > +#include <stdint.h>
> 
> All files should start with a comment that first describes the file, then 
> has the copyright and license notice.
> 
> > +/* Use vector arithmetic tricks. Idea is to take expression works on
> > +   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
> > +   Our expression is ((s + 127) ^ (~s)) & 128  
> > +   Now we evaluate this expression on each byte in parallel and on first 
> > +   nonzero byte our expression will have nonzero value. */
> 
> I think a more detailed, multi-paragraph comment carefully explaining the 
> algorithm used (and why and how it works) would be better.  Like the one 
> discussed in bug 5806, but of course without the mistake discussed there.
> 
> The principle of a common header with this logic is a good one - hopefully 
> if it gets used everywhere this can resolve bug 5806 by eliminating all 
> copies of the old comment.
> 
> (The patch submission should carefully discuss how the algorithm relates 
> to the ones discussed in bug 5806 / 5807 or used in existing code.  But 
> that's part of justifying the patches rather than for the comments in the 
> new code.)
> 
> > +static __always_inline
> > +unsigned long int 
> > +contains_zero (unsigned long int s)
> > +{
> > +  return ((s + add) ^ ~s) & high_bits;
> 
> Instead of using unsigned long int, architectures should be able to 
> configure the type used.  I expect that for ilp32 configurations on 64-bit 
> architectures, where registers are 64-bit but unsigned long int is 32-bit, 
> such as x32 and MIPS n32, it will be better to use unsigned long long int.
>
in next iteration I would add typedef and archs would use it from
header. Perhaps something like.

#ifndef VECTOR_INT
# define VECTOR_INT unsigned long int
#endif
typedef VECTOR_INT vector_int;
 
> > +#define CROSS_PAGE(x, n) (((uintptr_t)x) % 4096 >= 4096 - n)
> 
> It might also make sense for the use of 4096 here to be configurable by 
> architectures as well.
> 
Could but it won't likely make noticable difference as you cross page
with probability 0.8% for uniformly random address.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-28 17:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-26 18:48 [PATCH 1/3] Use generic strchrnul for strchr Ondřej Bílka
2015-05-26 18:51 ` [PATCH 2/3] Optimize strchrnul with unrolling, better header and unaligned loads Ondřej Bílka
2015-05-28 17:45   ` Joseph Myers
2015-05-28 17:48     ` Ondřej Bílka
2015-05-26 18:55 ` [RFC PATCH 3/3] Exploit that strchr needle is mostly ascii Ondřej Bílka
2015-05-26 20:34   ` [RFC PATCH 4/3] Fix previous patch and add header Ondřej Bílka
2015-05-26 20:38     ` Ondřej Bílka
2015-05-28 17:49     ` Joseph Myers
2015-05-28 17:59       ` Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).