public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 00/11] Improve generic string routines
@ 2016-12-17  6:57 Richard Henderson
  2016-12-17  6:57 ` [PATCH 07/11] Improve generic strnlen Richard Henderson
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

The idea here was to improve hppa performance without going all
the way to hand-written assembly (at least for now) so that other
minor architectures could benefit.

The main improvement, IMO, is pulling out all of the scattered
"magic" byte tests to a new header, haszero.h.

I'll note that the comments regarding these tests are in fact out
of date -- they speak of a number 0x7efefeff which does not appear
(we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
actually occur (presumably because we changed algorithms).

This header can then be easily overridden for a target to use a
cheaper zero-byte test, if one exists.  I've provided such for
both hppa and alpha.

I've tested this for hppa and alpha, both under qemu.  While I
cannot provide non-simulated numbers for benchmark improvements,
I'm confident that the real hardware ought to agree in magnitude.


r~



Richard Henderson (11):
  Improve generic strlen
  Improve generic strchr
  Improve generic memchr
  Improve generic strchrnul
  Improve generic strrchr
  Improve generic memrchr
  Improve generic strnlen
  Improve generic strcmp
  hppa: Add memcopy.h
  hppa: Add haszero.h and whichzero.h
  alpha: Add haszero.h and whichzero.h

 string/memchr.c               | 118 +++++++--------------------
 string/memrchr.c              | 180 +++++++-----------------------------------
 string/strchr.c               | 144 +++++----------------------------
 string/strchrnul.c            | 126 ++++-------------------------
 string/strcmp.c               | 100 +++++++++++++++++++++--
 string/strlen.c               |  80 ++++---------------
 string/strnlen.c              | 125 +++++------------------------
 string/strrchr.c              |  67 +++++++++++++---
 sysdeps/alpha/haszero.h       |  36 +++++++++
 sysdeps/alpha/whichzero.h     |  55 +++++++++++++
 sysdeps/generic/extractbyte.h |  34 ++++++++
 sysdeps/generic/haszero.h     |  42 ++++++++++
 sysdeps/generic/whichzero.h   |  67 ++++++++++++++++
 sysdeps/hppa/haszero.h        |  82 +++++++++++++++++++
 sysdeps/hppa/memcopy.h        |  44 +++++++++++
 sysdeps/hppa/whichzero.h      |  70 ++++++++++++++++
 16 files changed, 707 insertions(+), 663 deletions(-)
 create mode 100644 sysdeps/alpha/haszero.h
 create mode 100644 sysdeps/alpha/whichzero.h
 create mode 100644 sysdeps/generic/extractbyte.h
 create mode 100644 sysdeps/generic/haszero.h
 create mode 100644 sysdeps/generic/whichzero.h
 create mode 100644 sysdeps/hppa/haszero.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/whichzero.h

-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 03/11] Improve generic memchr
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
  2016-12-17  6:57 ` [PATCH 07/11] Improve generic strnlen Richard Henderson
  2016-12-17  6:57 ` [PATCH 02/11] Improve generic strchr Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-17  6:57 ` [PATCH 01/11] Improve generic strlen Richard Henderson
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

        * string/memchr.c: Use haszero.h, whichzero.h.
---
 string/memchr.c | 118 ++++++++++++++------------------------------------------
 1 file changed, 29 insertions(+), 89 deletions(-)

diff --git a/string/memchr.c b/string/memchr.c
index ca9fd99..858a494 100644
--- a/string/memchr.c
+++ b/string/memchr.c
@@ -25,10 +25,11 @@
 #endif
 
 #include <string.h>
-
 #include <stddef.h>
-
 #include <limits.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
 
 #undef __memchr
 #ifdef _LIBC
@@ -47,111 +48,50 @@
 void *
 MEMCHR (void const *s, int c_in, size_t n)
 {
-  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
-     long instead of a 64-bit uintmax_t tends to give better
-     performance.  On 64-bit hardware, unsigned long is generally 64
-     bits already.  Change this typedef to experiment with
-     performance.  */
-  typedef unsigned long int longword;
-
   const unsigned char *char_ptr;
-  const longword *longword_ptr;
-  longword repeated_one;
-  longword repeated_c;
+  const unsigned long int *longword_ptr;
+  unsigned long int longword, repeated_c;
   unsigned char c;
+  uintptr_t i, align;
 
   c = (unsigned char) c_in;
+  char_ptr = (const unsigned char *) s;
 
   /* Handle the first few bytes by reading one byte at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
-       --n, ++char_ptr)
+  align = -(uintptr_t)char_ptr % sizeof(longword);
+  if (align > n)
+    align = n;
+  for (i = 0; i < align; ++i, ++char_ptr)
     if (*char_ptr == c)
       return (void *) char_ptr;
 
-  longword_ptr = (const longword *) char_ptr;
+  /* Set up a longword, each of whose bytes is C.  */
+  repeated_c = (-1ul / 0xff) * c;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to any size longwords.  */
+  longword_ptr = (const unsigned long int *) char_ptr;
+  n -= align;
+  if (__glibc_unlikely(n == 0))
+    return NULL;
 
-  /* Compute auxiliary longword values:
-     repeated_one is a value which has a 1 in every byte.
-     repeated_c has c in every byte.  */
-  repeated_one = 0x01010101;
-  repeated_c = c | (c << 8);
-  repeated_c |= repeated_c << 16;
-  if (0xffffffffU < (longword) -1)
+  /* Loop while we have more than one longword remaining.  */
+  while (n > sizeof (longword))
     {
-      repeated_one |= repeated_one << 31 << 1;
-      repeated_c |= repeated_c << 31 << 1;
-      if (8 < sizeof (longword))
-	{
-	  size_t i;
-
-	  for (i = 64; i < sizeof (longword) * 8; i *= 2)
-	    {
-	      repeated_one |= repeated_one << i;
-	      repeated_c |= repeated_c << i;
-	    }
-	}
-    }
-
-  /* Instead of the traditional loop which tests each byte, we will test a
-     longword at a time.  The tricky part is testing if *any of the four*
-     bytes in the longword in question are equal to c.  We first use an xor
-     with repeated_c.  This reduces the task to testing whether *any of the
-     four* bytes in longword1 is zero.
-
-     We compute tmp =
-       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
-     That is, we perform the following operations:
-       1. Subtract repeated_one.
-       2. & ~longword1.
-       3. & a mask consisting of 0x80 in every byte.
-     Consider what happens in each byte:
-       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
-	 and step 3 transforms it into 0x80.  A carry can also be propagated
-	 to more significant bytes.
-       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
-	 position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
-	 the byte ends in a single bit of value 0 and k bits of value 1.
-	 After step 2, the result is just k bits of value 1: 2^k - 1.  After
-	 step 3, the result is 0.  And no carry is produced.
-     So, if longword1 has only non-zero bytes, tmp is zero.
-     Whereas if longword1 has a zero byte, call j the position of the least
-     significant zero byte.  Then the result has a zero at positions 0, ...,
-     j-1 and a 0x80 at position j.  We cannot predict the result at the more
-     significant bytes (positions j+1..3), but it does not matter since we
-     already have a non-zero bit at position 8*j+7.
-
-     So, the test whether any byte in longword1 is zero is equivalent to
-     testing whether tmp is nonzero.  */
-
-  while (n >= sizeof (longword))
-    {
-      longword longword1 = *longword_ptr ^ repeated_c;
-
-      if ((((longword1 - repeated_one) & ~longword1)
-	   & (repeated_one << 7)) != 0)
-	break;
+      longword = *longword_ptr ^ repeated_c;
+      if (haszero(longword))
+	goto found;
       longword_ptr++;
       n -= sizeof (longword);
     }
 
-  char_ptr = (const unsigned char *) longword_ptr;
-
-  /* At this point, we know that either n < sizeof (longword), or one of the
-     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
-     machines, we could determine the first such byte without any further
-     memory accesses, just by looking at the tmp result from the last loop
-     iteration.  But this does not work on big-endian machines.  Choose code
-     that works in both cases.  */
-
-  for (; n > 0; --n, ++char_ptr)
+  /* Since our pointer is aligned, we can always read the last longword.  */
+  longword = *longword_ptr ^ repeated_c;
+  if (haszero(longword))
     {
-      if (*char_ptr == c)
-	return (void *) char_ptr;
+ found:
+      i = whichzero(longword);
+      if (i < n)
+	return (char *) longword_ptr + i;
     }
 
   return NULL;
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 05/11] Improve generic strrchr
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (7 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 04/11] Improve generic strchrnul Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-20 12:50   ` Adhemerval Zanella
  2016-12-17  6:58 ` [PATCH 11/11] alpha: Add haszero.h and whichzero.h Richard Henderson
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

     * string/strrchr.c: Use haszero.h.
---
 string/strrchr.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 56 insertions(+), 11 deletions(-)

diff --git a/string/strrchr.c b/string/strrchr.c
index a07457e..bea7d76 100644
--- a/string/strrchr.c
+++ b/string/strrchr.c
@@ -16,6 +16,10 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
+#include <extractbyte.h>
 
 #undef strrchr
 
@@ -25,25 +29,66 @@
 
 /* Find the last occurrence of C in S.  */
 char *
-STRRCHR (const char *s, int c)
+STRRCHR (const char *s, int int_c)
 {
-  const char *found, *p;
+  const unsigned char *found_c = NULL, *ptr_c;
+  const unsigned long int *found_w = NULL, *ptr_w;
+  unsigned long int longword, repeated_c;
+  uintptr_t i, align;
+  unsigned char c;
 
-  c = (unsigned char) c;
+  c = (unsigned char) int_c;
+  ptr_c = (const unsigned char *) s;
 
-  /* Since strchr is fast, we use it rather than the obvious loop.  */
+  /* Handle the first few characters by reading one character at a time.
+     Do this until CHAR_PTR is aligned on a longword boundary.  */
+  align = -(uintptr_t)ptr_c % sizeof(longword);
+  for (i = 0; i < align; ++i, ++ptr_c)
+    {
+      unsigned char this_c = *ptr_c;
+      if (this_c == c)
+        found_c = ptr_c;
+      if (this_c == '\0')
+        goto done;
+    }
+
+  /* Set up a longword, each of whose bytes is C.  */
+  repeated_c = (-1ul / 0xff) * c;
+
+  /* Search words for C.  At this point, merely record the last word
+     that contained the character.  Stop when we find EOS.  */
+  ptr_w = (const unsigned long int *) ptr_c;
+  while (1)
+    {
+      longword = *ptr_w;
+      if (haszero (longword))
+	break;
+      if (haszero (longword ^ repeated_c))
+	found_w = ptr_w;
+      ptr_w++;
+    }
 
-  if (c == '\0')
-    return strchr (s, '\0');
+  /* Check to see if we've got C in the last longword.  */
+  i = whichzero2 (longword, longword ^ repeated_c);
+  if (extractbyte (longword, i) == c)
+    found_w = ptr_w;
 
-  found = NULL;
-  while ((p = strchr (s, c)) != NULL)
+  /* If we found a word containing C, go back and search it byte by byte.  */
+  if (found_w)
     {
-      found = p;
-      s = p + 1;
+      ptr_c = (const unsigned char *) found_w;
+      for (i = 0; i < sizeof(longword); ++i, ++ptr_c)
+	{
+	  unsigned char this_c = *ptr_c;
+	  if (this_c == c)
+	    found_c = ptr_c;
+	  if (this_c == '\0')
+	    break;
+	}
     }
 
-  return (char *) found;
+ done:
+  return (char *) found_c;
 }
 
 #ifdef weak_alias
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 02/11] Improve generic strchr
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
  2016-12-17  6:57 ` [PATCH 07/11] Improve generic strnlen Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-19 14:38   ` Adhemerval Zanella
  2016-12-17  6:57 ` [PATCH 03/11] Improve generic memchr Richard Henderson
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

	* string/strchr.c: Use haszero.h, whichzero.h, extractbyte.h.
---
 string/strchr.c | 144 ++++++++------------------------------------------------
 1 file changed, 19 insertions(+), 125 deletions(-)

diff --git a/string/strchr.c b/string/strchr.c
index 25c2fe4..a70d72a 100644
--- a/string/strchr.c
+++ b/string/strchr.c
@@ -22,6 +22,10 @@
 
 #include <string.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
+#include <extractbyte.h>
 
 #undef strchr
 
@@ -35,147 +39,37 @@ STRCHR (const char *s, int c_in)
 {
   const unsigned char *char_ptr;
   const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
+  unsigned long int longword, repeated_c, found;
+  uintptr_t i, align;
   unsigned char c;
 
   c = (unsigned char) c_in;
+  char_ptr = (const unsigned char *) s;
 
   /* Handle the first few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
+  align = -(uintptr_t)char_ptr % sizeof(longword);
+  for (i = 0; i < align; ++i, ++char_ptr)
     if (*char_ptr == c)
-      return (void *) char_ptr;
+      return (char *) char_ptr;
     else if (*char_ptr == '\0')
       return NULL;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
   /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
+  repeated_c = (-1ul / 0xff) * c;
 
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  longword_ptr = (unsigned long int *) char_ptr;
+  do
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
       longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	    }
-	}
     }
+  while (!haszero2(longword, longword ^ repeated_c));
 
-  return NULL;
+  found = whichzero2(longword, longword ^ repeated_c);
+  if (extractbyte(longword, found) == c)
+    return (char *) (longword_ptr - 1) + found;
+  else
+    return NULL;
 }
 
 #ifdef weak_alias
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 04/11] Improve generic strchrnul
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (6 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 10/11] hppa: Add haszero.h and whichzero.h Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-17  6:57 ` [PATCH 05/11] Improve generic strrchr Richard Henderson
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

	* string/strchrnul.c: Use haszero.h, whichzero.h.
---
 string/strchrnul.c | 126 +++++++----------------------------------------------
 1 file changed, 15 insertions(+), 111 deletions(-)

diff --git a/string/strchrnul.c b/string/strchrnul.c
index 629db46..fb2d1a4 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -21,8 +21,10 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <memcopy.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
 
 #undef __strchrnul
 #undef strchrnul
@@ -37,130 +39,32 @@ STRCHRNUL (const char *s, int c_in)
 {
   const unsigned char *char_ptr;
   const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
+  unsigned long int longword, repeated_c, found;
+  uintptr_t i, align;
   unsigned char c;
 
   c = (unsigned char) c_in;
+  char_ptr = (const unsigned char *) s;
 
   /* Handle the first few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
+  align = -(uintptr_t)char_ptr % sizeof(longword);
+  for (i = 0; i < align; ++i, ++char_ptr)
     if (*char_ptr == c || *char_ptr == '\0')
-      return (void *) char_ptr;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
+      return (char *) char_ptr;
 
   /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
+  repeated_c = (-1ul / 0xff) * c;
 
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  longword_ptr = (unsigned long int *) char_ptr;
+  do
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
       longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0 ||
-
-	  /* That caught zeroes.  Now test for C.  */
-	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	    }
-	}
     }
+  while (!haszero2(longword, longword ^ repeated_c));
 
-  /* This should never happen.  */
-  return NULL;
+  found = whichzero2(longword, longword ^ repeated_c);
+  return (char *) (longword_ptr - 1) + found;
 }
 
 weak_alias (__strchrnul, strchrnul)
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 06/11] Improve generic memrchr
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (3 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 01/11] Improve generic strlen Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-20 12:30   ` Adhemerval Zanella
  2016-12-17  6:57 ` [PATCH 09/11] hppa: Add memcopy.h Richard Henderson
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

    * string/memrchr.c: Use haszero.h, whichzero.h.
---
 string/memrchr.c | 180 ++++++++++---------------------------------------------
 1 file changed, 30 insertions(+), 150 deletions(-)

diff --git a/string/memrchr.c b/string/memrchr.c
index b9b0c9e..d45b261 100644
--- a/string/memrchr.c
+++ b/string/memrchr.c
@@ -21,180 +21,60 @@
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
-#include <stdlib.h>
-
-#ifdef HAVE_CONFIG_H
-# include <config.h>
-#endif
-
-#undef __ptr_t
-#define __ptr_t void *
-
-#if defined _LIBC
-# include <string.h>
-# include <memcopy.h>
-#endif
-
-#if defined HAVE_LIMITS_H || defined _LIBC
-# include <limits.h>
-#endif
-
-#define LONG_MAX_32_BITS 2147483647
-
-#ifndef LONG_MAX
-# define LONG_MAX LONG_MAX_32_BITS
-#endif
-
-#include <sys/types.h>
+#include <string.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
 
 #undef __memrchr
 #undef memrchr
 
-#ifndef weak_alias
-# define __memrchr memrchr
-#endif
-
 /* Search no more than N bytes of S for C.  */
-__ptr_t
-#ifndef MEMRCHR
-__memrchr
-#else
-MEMRCHR
-#endif
-     (const __ptr_t s, int c_in, size_t n)
+void *
+__memrchr (const void *s, int c_in, size_t n)
 {
-  const unsigned char *char_ptr;
+  const unsigned char *char_ptr = (const unsigned char *) s + n;
   const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
+  unsigned long int longword, repeated_c;
+  uintptr_t i, align;
   unsigned char c;
 
   c = (unsigned char) c_in;
 
   /* Handle the last few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s + n;
-       n > 0 && ((unsigned long int) char_ptr
-		 & (sizeof (longword) - 1)) != 0;
-       --n)
+  align = (uintptr_t)char_ptr % sizeof(longword);
+  for (i = 0; i < align; ++i)
     if (*--char_ptr == c)
-      return (__ptr_t) char_ptr;
+      return (void *) char_ptr;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  /* Set up a longword, each of whose bytes is C.  */
+  repeated_c = (-1ul / 0xff) * c;
 
   longword_ptr = (const unsigned long int *) char_ptr;
+  n -= align;
+  if (__glibc_unlikely(n == 0))
+    return NULL;
 
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-#if LONG_MAX > LONG_MAX_32_BITS
-  charmask |= charmask << 32;
-#endif
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (n >= sizeof (longword))
+  /* Loop while we have more than one longword remaining.  */
+  while (n > sizeof (longword))
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C, not zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *--longword_ptr ^ charmask;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) longword_ptr;
-
-#if LONG_MAX > 2147483647
-	  if (cp[7] == c)
-	    return (__ptr_t) &cp[7];
-	  if (cp[6] == c)
-	    return (__ptr_t) &cp[6];
-	  if (cp[5] == c)
-	    return (__ptr_t) &cp[5];
-	  if (cp[4] == c)
-	    return (__ptr_t) &cp[4];
-#endif
-	  if (cp[3] == c)
-	    return (__ptr_t) &cp[3];
-	  if (cp[2] == c)
-	    return (__ptr_t) &cp[2];
-	  if (cp[1] == c)
-	    return (__ptr_t) &cp[1];
-	  if (cp[0] == c)
-	    return (__ptr_t) cp;
-	}
-
+      longword = *--longword_ptr ^ repeated_c;
+      if (haszero (longword))
+	goto found;
       n -= sizeof (longword);
     }
 
-  char_ptr = (const unsigned char *) longword_ptr;
-
-  while (n-- > 0)
+  /* Since our pointer is aligned, we can always read the last longword.  */
+  longword = *--longword_ptr ^ repeated_c;
+  if (haszero (longword))
     {
-      if (*--char_ptr == c)
-	return (__ptr_t) char_ptr;
+ found:
+      i = whichzero (longword);
+      if (sizeof (longword) - 1 - i < n)
+	return (char *) longword_ptr + i;
     }
 
-  return 0;
+  return NULL;
 }
-#ifndef MEMRCHR
-# ifdef weak_alias
 weak_alias (__memrchr, memrchr)
-# endif
-#endif
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 07/11] Improve generic strnlen
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-19 15:00   ` Adhemerval Zanella
  2016-12-17  6:57 ` [PATCH 02/11] Improve generic strchr Richard Henderson
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

        * string/strnlen.c: Use haszero.h, whichzero.h.
---
 string/strnlen.c | 125 +++++++++----------------------------------------------
 1 file changed, 19 insertions(+), 106 deletions(-)

diff --git a/string/strnlen.c b/string/strnlen.c
index b2b0664..9927b9d 100644
--- a/string/strnlen.c
+++ b/string/strnlen.c
@@ -21,7 +21,9 @@
    not, see <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <stdlib.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
 
 /* Find the length of S, but scan at most MAXLEN characters.  If no
    '\0' terminator is found in that many characters, return MAXLEN.  */
@@ -33,11 +35,12 @@
 size_t
 __strnlen (const char *str, size_t maxlen)
 {
-  const char *char_ptr, *end_ptr = str + maxlen;
+  const char *char_ptr = str, *end_ptr = str + maxlen;
   const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
+  unsigned long int longword;
+  size_t i, align;
 
-  if (maxlen == 0)
+  if (__glibc_unlikely (maxlen == 0))
     return 0;
 
   if (__glibc_unlikely (end_ptr < str))
@@ -45,116 +48,26 @@ __strnlen (const char *str, size_t maxlen)
 
   /* Handle the first few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
+  align = -(uintptr_t)char_ptr % sizeof(longword);
+  for (i = 0; i < align; ++i, ++char_ptr)
     if (*char_ptr == '\0')
-      {
-	if (char_ptr > end_ptr)
-	  char_ptr = end_ptr;
-	return char_ptr - str;
-      }
+      goto found;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  longword_ptr = (const unsigned long int *) char_ptr;
+  char_ptr = end_ptr;
 
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
+  while (longword_ptr < (const unsigned long int *) end_ptr)
     {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
-    }
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (longword_ptr < (unsigned long int *) end_ptr)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.  */
-
-      longword = *longword_ptr++;
-
-      if ((longword - lomagic) & himagic)
+      longword = *longword_ptr;
+      if (haszero (longword))
 	{
-	  /* Which of the bytes was the zero?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
-
-	  char_ptr = cp;
-	  if (cp[0] == 0)
-	    break;
-	  char_ptr = cp + 1;
-	  if (cp[1] == 0)
-	    break;
-	  char_ptr = cp + 2;
-	  if (cp[2] == 0)
-	    break;
-	  char_ptr = cp + 3;
-	  if (cp[3] == 0)
-	    break;
-	  if (sizeof (longword) > 4)
-	    {
-	      char_ptr = cp + 4;
-	      if (cp[4] == 0)
-		break;
-	      char_ptr = cp + 5;
-	      if (cp[5] == 0)
-		break;
-	      char_ptr = cp + 6;
-	      if (cp[6] == 0)
-		break;
-	      char_ptr = cp + 7;
-	      if (cp[7] == 0)
-		break;
-	    }
+	  char_ptr = (const char *) longword_ptr + whichzero (longword);
+	  break;
 	}
-      char_ptr = end_ptr;
+      longword_ptr++;
     }
 
+ found:
   if (char_ptr > end_ptr)
     char_ptr = end_ptr;
   return char_ptr - str;
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/11] Improve generic strlen
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (2 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 03/11] Improve generic memchr Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-19 14:33   ` Adhemerval Zanella
  2016-12-17  6:57 ` [PATCH 06/11] Improve generic memrchr Richard Henderson
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

Extract haszero and whichzero tests into headers that can be
tailored for the architecture.

	* sysdeps/generic/haszero.h: New file.
	* sysdeps/generic/whichzero.h: New file.
	* sysdeps/generic/extractbyte.h: New file.
	* string/strlen.c: Use them.
---
 string/strlen.c               | 80 +++++++++----------------------------------
 sysdeps/generic/extractbyte.h | 34 ++++++++++++++++++
 sysdeps/generic/haszero.h     | 42 +++++++++++++++++++++++
 sysdeps/generic/whichzero.h   | 67 ++++++++++++++++++++++++++++++++++++
 4 files changed, 160 insertions(+), 63 deletions(-)
 create mode 100644 sysdeps/generic/extractbyte.h
 create mode 100644 sysdeps/generic/haszero.h
 create mode 100644 sysdeps/generic/whichzero.h

diff --git a/string/strlen.c b/string/strlen.c
index 4943ce2..f2a53a6 100644
--- a/string/strlen.c
+++ b/string/strlen.c
@@ -20,6 +20,9 @@
 
 #include <string.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <haszero.h>
+#include <whichzero.h>
 
 #undef strlen
 
@@ -32,78 +35,29 @@
 size_t
 STRLEN (const char *str)
 {
-  const char *char_ptr;
+  const char *char_ptr = str;
   const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
+  unsigned long int longword;
+  uintptr_t i, align;
 
   /* Handle the first few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
+  align = -(uintptr_t)char_ptr % sizeof(longword);
+  for (i = 0; i < align; ++i, ++char_ptr)
     if (*char_ptr == '\0')
-      return char_ptr - str;
+      goto found;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
-    {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
-    }
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
+  longword_ptr = (const unsigned long int *) char_ptr;
+  do
     {
       longword = *longword_ptr++;
+    }
+  while (!haszero(longword));
 
-      if (((longword - lomagic) & ~longword & himagic) != 0)
-	{
-	  /* Which of the bytes was the zero?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
+  char_ptr = (const char *) (longword_ptr - 1);
+  char_ptr += whichzero(longword);
 
-	  if (cp[0] == 0)
-	    return cp - str;
-	  if (cp[1] == 0)
-	    return cp - str + 1;
-	  if (cp[2] == 0)
-	    return cp - str + 2;
-	  if (cp[3] == 0)
-	    return cp - str + 3;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (cp[4] == 0)
-		return cp - str + 4;
-	      if (cp[5] == 0)
-		return cp - str + 5;
-	      if (cp[6] == 0)
-		return cp - str + 6;
-	      if (cp[7] == 0)
-		return cp - str + 7;
-	    }
-	}
-    }
+ found:
+  return char_ptr - str;
 }
 libc_hidden_builtin_def (strlen)
diff --git a/sysdeps/generic/extractbyte.h b/sysdeps/generic/extractbyte.h
new file mode 100644
index 0000000..ace2cda
--- /dev/null
+++ b/sysdeps/generic/extractbyte.h
@@ -0,0 +1,34 @@
+/* extractbyte.h -- function memory order byte extract.  Generic C version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef EXTRACTBYTE_H
+#define EXTRACTBYTE_H	1
+
+#include <limits.h>
+#include <endian.h>
+
+static inline unsigned char
+extractbyte(unsigned long int x, unsigned idx)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return x >> (idx * CHAR_BIT);
+  else
+    return x >> (sizeof(x) - 1 - idx) * CHAR_BIT;
+}
+
+#endif /* EXTRACTBYTE_H */
diff --git a/sysdeps/generic/haszero.h b/sysdeps/generic/haszero.h
new file mode 100644
index 0000000..084741c
--- /dev/null
+++ b/sysdeps/generic/haszero.h
@@ -0,0 +1,42 @@
+/* haszero.h -- function for zero byte detection.  Generic C version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef HASZERO_H
+#define HASZERO_H	1
+
+/* See https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
+
+static inline unsigned long int
+haszero(unsigned long int x)
+{
+  unsigned long int ones = -1UL / 0xff;
+  unsigned long int signs = ones * 0x80;
+  return (x - ones) & ~x & signs;
+}
+
+/* Likewise, but for two words simultaneously.  */
+
+static inline unsigned long int
+haszero2(unsigned long int x1, unsigned long int x2)
+{
+  unsigned long int ones = -1UL / 0xff;
+  unsigned long int signs = ones * 0x80;
+  return (((x1 - ones) & ~x1) | ((x2 - ones) & ~x2)) & signs;
+}
+
+#endif /* haszero.h */
diff --git a/sysdeps/generic/whichzero.h b/sysdeps/generic/whichzero.h
new file mode 100644
index 0000000..346b09a
--- /dev/null
+++ b/sysdeps/generic/whichzero.h
@@ -0,0 +1,67 @@
+/* whichzero.h -- functions for zero byte searching.  Generic C version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef WHICHZERO_H
+#define WHICHZERO_H 1
+
+#include <endian.h>
+
+/* A subroutine for the whichzero and whichzero2 functions.  Given a
+   test word C, return the index of the first byte in memory order
+   that contains 0x80 (all other bits will be zero).  */
+
+static inline unsigned int
+whichzero_ffs(unsigned long int c)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return __builtin_ctzl (c) / 8u;
+  else
+    return __builtin_clzl (c) / 8u;
+}
+
+/* Given a long that is known to contain a zero byte, return the
+   index of the first such within the long in host memory order.  */
+
+static inline unsigned int
+whichzero(unsigned long int x)
+{
+  /* For each byte, find not-zero by
+     (0) And 0x7f so that we cannot carry between bytes,
+     (1) Add 0x7f so that we carry into 0x80,
+     (2) Or in the original byte which might have contained 0x80.
+     Then invert and mask such that 0x80 is set iff the byte was zero.  */
+  unsigned long int m = (-1UL / 0xff) * 0x7f;
+  unsigned long int c = ~(((x & m) + m) | x | m);
+
+  return whichzero_ffs(c);
+}
+
+/* Similarly, but perform the test for two longs simultaneously.  */
+
+static inline unsigned int
+whichzero2(unsigned long int x1, unsigned long int x2)
+{
+  unsigned long int m = (-1UL / 0xff) * 0x7f;
+  unsigned long int c1 = ((x1 & m) + m) | x1;
+  unsigned long int c2 = ((x2 & m) + m) | x2;
+  unsigned long int c = ~((c1 & c2) | m);
+
+  return whichzero_ffs(c);
+}
+
+#endif /* whichzero.h */
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 09/11] hppa: Add memcopy.h
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (4 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 06/11] Improve generic memrchr Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-17  6:57 ` [PATCH 10/11] hppa: Add haszero.h and whichzero.h Richard Henderson
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

GCC's combine pass cannot merge (x >> c | y << (32 - c)) into a
double-word shift unless (1) the subtract is in the same basic block
and (2) the result of the subtract is used exactly once.  Neither
condition is true for any use of MERGE.

By forcing the use of a double-word shift, we not only reduce
contention on SAR, but also allow the setting of SAR to be hoisted
outside of a loop.

	* sysdeps/hppa/memcopy.h: New file.
---
 sysdeps/hppa/memcopy.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 sysdeps/hppa/memcopy.h

diff --git a/sysdeps/hppa/memcopy.h b/sysdeps/hppa/memcopy.h
new file mode 100644
index 0000000..4076b8b
--- /dev/null
+++ b/sysdeps/hppa/memcopy.h
@@ -0,0 +1,44 @@
+/* memcopy.h -- definitions for memory copy functions, PA-RISC version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdeps/generic/memcopy.h>
+
+/* Use a single double-word shift instead of two shifts and an ior.
+   If the uses of MERGE were close to the computation of shl/shr,
+   the compiler might have been able to create this itself.
+   But instead that computation is well separated.
+
+   Using an inline function instead of a macro is the easiest way
+   to ensure that the types are correct.  */
+
+#undef MERGE
+
+extern void link_error(void);
+
+static inline op_t
+MERGE(op_t w0, int shl, op_t w1, int shr)
+{
+  op_t res;
+  if (OPSIZ == 4)
+    asm("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  else if (OPSIZ == 8)
+    asm("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  else
+    link_error(), res = 0;
+  return res;
+}
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 10/11] hppa: Add haszero.h and whichzero.h
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (5 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 09/11] hppa: Add memcopy.h Richard Henderson
@ 2016-12-17  6:57 ` Richard Henderson
  2016-12-19 14:54   ` Adhemerval Zanella
  2016-12-17  6:57 ` [PATCH 04/11] Improve generic strchrnul Richard Henderson
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:57 UTC (permalink / raw)
  To: libc-alpha

Use UXOR,SBZ to test for a zero byte within a word.  While we can
get semi-decent code out of asm-goto, we would do slightly better
with a compiler builtin.

For whichzero, sequential testing of bytes is less expensive than any
tricks that involve a count-leading-zeros insn that we don't have.

	* sysdeps/hppa/haszero.h: New file.
	* sysdeps/hppa/whichzero.h: New file.
---
 sysdeps/hppa/haszero.h   | 82 ++++++++++++++++++++++++++++++++++++++++++++++++
 sysdeps/hppa/whichzero.h | 70 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+)
 create mode 100644 sysdeps/hppa/haszero.h
 create mode 100644 sysdeps/hppa/whichzero.h

diff --git a/sysdeps/hppa/haszero.h b/sysdeps/hppa/haszero.h
new file mode 100644
index 0000000..99cc6fc
--- /dev/null
+++ b/sysdeps/hppa/haszero.h
@@ -0,0 +1,82 @@
+/* haszero.h -- function for zero byte detection.  HPPA version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef HASZERO_H
+#define HASZERO_H	1
+
+static inline unsigned long int
+haszero(unsigned long int x)
+{
+#if __GNUC_PREREQ(4, 5)
+  /* It's more useful to expose a control transfer to the compiler
+     than to expose a proper boolean result.  */
+  if (sizeof(x) == 8)
+    asm goto ("uxor,*sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
+  else
+    asm goto ("uxor,sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
+  return 1;
+ nbz:
+  return 0;
+#else
+  unsigned long int ret;
+  if (sizeof(x) == 8)
+    asm ("uxor,*sbz %%r0,%1,%%r0\n\tcopy %%r0,%0"
+	 : "=r"(ret) : "r"(x), "0"(1));
+  else
+    asm ("uxor,sbz %%r0,%1,%%r0\n\tcopy %%r0,%0"
+        : "=r"(ret) : "r"(x), "0"(1));
+  return ret;
+#endif
+}
+
+/* Likewise, but for two words simultaneously.  */
+
+static inline unsigned long int
+haszero2(unsigned long int x1, unsigned long int x2)
+{
+#if __GNUC_PREREQ(4, 5)
+  /* It's more useful to expose a control transfer to the compiler
+     than to expose a proper boolean result.  */
+  if (sizeof(x1) == 8)
+    asm goto ("uxor,*sbz %%r0,%0,%%r0\n\t"
+	      "uxor,*nbz %%r0,%1,%%r0\n\t"
+	      "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
+  else
+    asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
+	      "uxor,nbz %%r0,%1,%%r0\n\t"
+	      "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
+  return 0;
+ sbz:
+  return 1;
+#else
+  unsigned long int ret;
+  if (sizeof(x1) == 8)
+    asm ("uxor,*sbz %%r0,%1,%%r0\n\t"
+	 "uxor,*nbz %%r0,%2,%%r0\n\t"
+	 "ldi 1,%0"
+	 : "=r"(ret) : "r"(x1), "r"(x2), "0"(0));
+  else
+    asm ("uxor,sbz %%r0,%1,%%r0\n\t"
+	 "uxor,nbz %%r0,%2,%%r0\n\t"
+	 "ldi 1,%0"
+	 : "=r"(ret) : "r"(x1), "r"(x2), "0"(0));
+  return ret;
+#endif
+}
+
+#endif /* haszero.h */
diff --git a/sysdeps/hppa/whichzero.h b/sysdeps/hppa/whichzero.h
new file mode 100644
index 0000000..ef18cc7
--- /dev/null
+++ b/sysdeps/hppa/whichzero.h
@@ -0,0 +1,70 @@
+/* whichzero.h -- functions for zero byte searching.  HPPA version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef HPPA_WHICHZERO_H
+#define HPPA_WHICHZERO_H 1
+
+/* Given a long that is known to contain a zero byte, return the
+   index of the first such within the long in host memory order.  */
+
+static inline unsigned int
+whichzero(unsigned long int x)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof(x) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %1,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but perform the test for two longs simultaneously.  */
+
+static inline unsigned int
+whichzero2(unsigned long int x1, unsigned long int x2)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof(x1) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,= %1,23,8,%%r0\n\t"
+       "extrw,u,<> %2,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,= %1,15,8,%%r0\n\t"
+       "extrw,u,<> %2,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,= %1,7,8,%%r0\n\t"
+       "extrw,u,<> %2,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x1), "r"(x2), "0"(3));
+
+  return ret;
+}
+
+#endif /* whichzero.h */
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 08/11] Improve generic strcmp
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (9 preceding siblings ...)
  2016-12-17  6:58 ` [PATCH 11/11] alpha: Add haszero.h and whichzero.h Richard Henderson
@ 2016-12-17  6:58 ` Richard Henderson
  2016-12-17  7:24 ` [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:58 UTC (permalink / raw)
  To: libc-alpha

	* string/strcmp.c: Rewrite using memcopy.h and haszero.h.
---
 string/strcmp.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 92 insertions(+), 8 deletions(-)

diff --git a/string/strcmp.c b/string/strcmp.c
index 4b16f99..2b32854 100644
--- a/string/strcmp.c
+++ b/string/strcmp.c
@@ -16,6 +16,11 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <string.h>
+#include <stdint.h>
+#include <limits.h>
+#include <haszero.h>
+#include <extractbyte.h>
+#include <memcopy.h>
 
 #undef strcmp
 
@@ -29,19 +34,98 @@
 int
 STRCMP (const char *p1, const char *p2)
 {
-  const unsigned char *s1 = (const unsigned char *) p1;
-  const unsigned char *s2 = (const unsigned char *) p2;
+  const unsigned long int *x1, *x2;
+  unsigned long int w1, w2;
   unsigned char c1, c2;
+  uintptr_t i, n, ofs;
+  int diff;
 
-  do
+  /* Handle the unaligned bytes of p1 first.  */
+  n = -(uintptr_t)p1 % sizeof(unsigned long int);
+  for (i = 0; i < n; ++i)
     {
-      c1 = (unsigned char) *s1++;
-      c2 = (unsigned char) *s2++;
-      if (c1 == '\0')
-	return c1 - c2;
+      c1 = *p1++;
+      c2 = *p2++;
+      diff = c1 - c2;
+      if (c1 == '\0' || diff)
+	return diff;
     }
-  while (c1 == c2);
 
+  /* P1 is now aligned to unsigned long.  P2 may or may not be.  */
+  x1 = (const unsigned long int *)p1;
+  w1 = *x1++;
+  ofs = (uintptr_t)p2 % sizeof(unsigned long int);
+  if (ofs == 0)
+    {
+      x2 = (const unsigned long int *)p2;
+      w2 = *x2++;
+      /* Aligned loop.  If a difference is found, exit to compare the
+         bytes.  Else if a zero is found we have equal strings.  */
+      while (w1 == w2)
+	{
+	  if (haszero (w1))
+	    return 0;
+          w1 = *x1++;
+          w2 = *x2++;
+	}
+    }
+  else
+    {
+      unsigned long int w2a, w2b;
+      uintptr_t sh_1, sh_2;
+
+      x2 = (const unsigned long int *)(p2 - ofs);
+      w2a = *x2++;
+      sh_1 = ofs * CHAR_BIT;
+      sh_2 = sizeof(unsigned long int) * CHAR_BIT - sh_1;
+
+      /* Align the first partial of P2, with 0xff for the rest of the
+         bytes so that we can also apply the haszero test to see if we
+         have already reached EOS.  If we have, then we can simply fall
+         through to the final comparison.  */
+      w2 = MERGE (w2a, sh_1, -1UL, sh_2);
+      if (!haszero (w2))
+	{
+	  /* Unaligned loop.  The invariant is that W2B, which is "ahead"
+             of W1, does not contain end-of-string.  Therefore it is safe
+             (and necessary) to read another word from each while we do
+             not have a difference.  */
+	  while (1)
+	    {
+	      w2b = *x2++;
+	      w2 = MERGE (w2a, sh_1, w2b, sh_2);
+	      if (w1 != w2)
+		goto final_cmp;
+	      if (haszero (w2b))
+		break;
+	      w1 = *x1++;
+	      w2a = w2b;
+	    }
+
+	  /* Zero found in the second partial of P2.  If we had EOS
+	     in the aligned word, we have equality.  */
+	  if (haszero (w1))
+	    return 0;
+
+          /* Load the final word of P1 and align the final partial of P2.  */
+	  w1 = *x1++;
+          w2 = MERGE (w2b, sh_1, 0, sh_2);
+	}
+    }
+
+ final_cmp:
+  /* We have two aligned words of data.  */
+  for (i = 0; i < sizeof(unsigned long int) - 1; ++i)
+    {
+      c1 = extractbyte (w1, i);
+      c2 = extractbyte (w2, i);
+      diff = c1 - c2;
+      if (c1 == '\0' || diff)
+	return diff;
+    }
+  c1 = extractbyte (w1, i);
+  c2 = extractbyte (w2, i);
   return c1 - c2;
 }
+
 libc_hidden_builtin_def (strcmp)
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 11/11] alpha: Add haszero.h and whichzero.h
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (8 preceding siblings ...)
  2016-12-17  6:57 ` [PATCH 05/11] Improve generic strrchr Richard Henderson
@ 2016-12-17  6:58 ` Richard Henderson
  2016-12-17  6:58 ` [PATCH 08/11] Improve generic strcmp Richard Henderson
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  6:58 UTC (permalink / raw)
  To: libc-alpha

While alpha has the more important string functions in assembly,
there are still a few for which the generic routines are used.

Use the CMPBGE insn, via the builtin, for testing of zeros.  Use a
simplified expansion of __builtin_ctz when the insn isn't available.

	* sysdeps/alpha/haszero.h: New file.
	* sysdeps/alpha/whichzero.h: New file.
---
 sysdeps/alpha/haszero.h   | 36 +++++++++++++++++++++++++++++++
 sysdeps/alpha/whichzero.h | 55 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)
 create mode 100644 sysdeps/alpha/haszero.h
 create mode 100644 sysdeps/alpha/whichzero.h

diff --git a/sysdeps/alpha/haszero.h b/sysdeps/alpha/haszero.h
new file mode 100644
index 0000000..e5c2c49
--- /dev/null
+++ b/sysdeps/alpha/haszero.h
@@ -0,0 +1,36 @@
+/* haszero.h -- function for zero byte detection.  Alpha version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef HASZERO_H
+#define HASZERO_H	1
+
+static inline unsigned long int
+haszero(unsigned long int x)
+{
+  return __builtin_alpha_cmpbge (0, x);
+}
+
+/* Likewise, but for two words simultaneously.  */
+
+static inline unsigned long int
+haszero2(unsigned long int x1, unsigned long int x2)
+{
+  return __builtin_alpha_cmpbge (0, x1) | __builtin_alpha_cmpbge (0, x2);
+}
+
+#endif /* haszero.h */
diff --git a/sysdeps/alpha/whichzero.h b/sysdeps/alpha/whichzero.h
new file mode 100644
index 0000000..3f8e5c9
--- /dev/null
+++ b/sysdeps/alpha/whichzero.h
@@ -0,0 +1,55 @@
+/* whichzero.h -- functions for zero byte searching.  Alpha version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef WHICHZERO_H
+#define WHICHZERO_H 1
+
+/* A subroutine for the whichzero and whichzero2 functions.  Given a
+   test word C, return the index of the first byte in memory order
+   that contains 0x80 (all other bits will be zero).  */
+
+static inline unsigned int
+whichzero_ffs(unsigned long int c)
+{
+#ifdef __alpha_cix__
+  return __builtin_ctzl (c);
+#else
+  c = c & -c;
+  return (c & 0xf0 ? 4 : 0) + (c & 0xcc ? 2 : 0) + (c & 0xaa ? 1 : 0);
+#endif
+}
+
+/* Given a long that is known to contain a zero byte, return the
+   index of the first such within the long in host memory order.  */
+
+static inline unsigned int
+whichzero(unsigned long int x)
+{
+  return whichzero_ffs (__builtin_alpha_cmpbge (0, x));
+}
+
+/* Similarly, but perform the test for two longs simultaneously.  */
+
+static inline unsigned int
+whichzero2(unsigned long int x1, unsigned long int x2)
+{
+  return whichzero_ffs (__builtin_alpha_cmpbge (0, x1)
+		        | __builtin_alpha_cmpbge (0, x2));
+}
+
+#endif /* whichzero.h */
-- 
2.9.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (10 preceding siblings ...)
  2016-12-17  6:58 ` [PATCH 08/11] Improve generic strcmp Richard Henderson
@ 2016-12-17  7:24 ` Richard Henderson
  2016-12-19 14:32 ` Adhemerval Zanella
  2016-12-19 16:15 ` Joseph Myers
  13 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-17  7:24 UTC (permalink / raw)
  To: libc-alpha

On 12/16/2016 10:57 PM, Richard Henderson wrote:
> I'll note that the comments regarding these tests are in fact out
> of date -- they speak of a number 0x7efefeff which does not appear
> (we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
> actually occur (presumably because we changed algorithms).

Hmph.  So, this comment really only applies to strlen and memchr.

Upon more rigorous examination, there appear to be three variants in use, one 
of which does use 0x7efefeff, and two of which do have the misfire problem.

So the other benefit is that we unify everything to the same, better, algorithm.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (11 preceding siblings ...)
  2016-12-17  7:24 ` [PATCH 00/11] Improve generic string routines Richard Henderson
@ 2016-12-19 14:32 ` Adhemerval Zanella
  2016-12-19 16:15 ` Joseph Myers
  13 siblings, 0 replies; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 14:32 UTC (permalink / raw)
  To: libc-alpha



On 17/12/2016 04:57, Richard Henderson wrote:
> The idea here was to improve hppa performance without going all
> the way to hand-written assembly (at least for now) so that other
> minor architectures could benefit.
> 
> The main improvement, IMO, is pulling out all of the scattered
> "magic" byte tests to a new header, haszero.h.
> 
> I'll note that the comments regarding these tests are in fact out
> of date -- they speak of a number 0x7efefeff which does not appear
> (we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
> actually occur (presumably because we changed algorithms).

Nice, this is something I already started to implement which resembles
your implementation.  The bogus comment was already reported on 
BZ#5806 [1] and these cleanup seems a step further to clean fix it
(we will still need to check on i386 and m68k memchr, strchrnul, strlen,
and rawmemchr that copied and paste the same comment on its assembly
implementation).

> 
> This header can then be easily overridden for a target to use a
> cheaper zero-byte test, if one exists.  I've provided such for
> both hppa and alpha.
> 
> I've tested this for hppa and alpha, both under qemu.  While I
> cannot provide non-simulated numbers for benchmark improvements,
> I'm confident that the real hardware ought to agree in magnitude.
> 
> 
> r~

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=5806

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] Improve generic strlen
  2016-12-17  6:57 ` [PATCH 01/11] Improve generic strlen Richard Henderson
@ 2016-12-19 14:33   ` Adhemerval Zanella
  2016-12-19 19:28     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 14:33 UTC (permalink / raw)
  To: libc-alpha

LGTM and just a few nit below.

On 17/12/2016 04:57, Richard Henderson wrote:
> Extract haszero and whichzero tests into headers that can be
> tailored for the architecture.
> 
> 	* sysdeps/generic/haszero.h: New file.
> 	* sysdeps/generic/whichzero.h: New file.
> 	* sysdeps/generic/extractbyte.h: New file.
> 	* string/strlen.c: Use them.
> ---
>  string/strlen.c               | 80 +++++++++----------------------------------
>  sysdeps/generic/extractbyte.h | 34 ++++++++++++++++++
>  sysdeps/generic/haszero.h     | 42 +++++++++++++++++++++++
>  sysdeps/generic/whichzero.h   | 67 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 160 insertions(+), 63 deletions(-)
>  create mode 100644 sysdeps/generic/extractbyte.h
>  create mode 100644 sysdeps/generic/haszero.h
>  create mode 100644 sysdeps/generic/whichzero.h
> 
> diff --git a/string/strlen.c b/string/strlen.c
> index 4943ce2..f2a53a6 100644
> --- a/string/strlen.c
> +++ b/string/strlen.c
> @@ -20,6 +20,9 @@
>  
>  #include <string.h>
>  #include <stdlib.h>
> +#include <stdint.h>
> +#include <haszero.h>
> +#include <whichzero.h>
>  
>  #undef strlen
>  
> @@ -32,78 +35,29 @@
>  size_t
>  STRLEN (const char *str)
>  {
> -  const char *char_ptr;
> +  const char *char_ptr = str;
>    const unsigned long int *longword_ptr;
> -  unsigned long int longword, himagic, lomagic;
> +  unsigned long int longword;
> +  uintptr_t i, align;
>  
>    /* Handle the first few characters by reading one character at a time.
>       Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = str; ((unsigned long int) char_ptr
> -			& (sizeof (longword) - 1)) != 0;
> -       ++char_ptr)
> +  align = -(uintptr_t)char_ptr % sizeof(longword);
> +  for (i = 0; i < align; ++i, ++char_ptr)
>      if (*char_ptr == '\0')
> -      return char_ptr - str;
> +      goto found;

Why not just return 'char_ptr - str' and get rid of label?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] Improve generic strchr
  2016-12-17  6:57 ` [PATCH 02/11] Improve generic strchr Richard Henderson
@ 2016-12-19 14:38   ` Adhemerval Zanella
  2016-12-19 19:31     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 14:38 UTC (permalink / raw)
  To: libc-alpha



On 17/12/2016 04:57, Richard Henderson wrote:
> 	* string/strchr.c: Use haszero.h, whichzero.h, extractbyte.h.

Since you are changing strchrnul, why not just simplify strchr to just:

--
/* Find the first occurrence of C in S.  */
char *
STRCHR (const char *s, int c_in)
{
  char *r = __strchrnul (s, c_in);
  return *(unsigned char *) r == (unsigned char) c_in ? r : NULL;
}
--

?

It will incur in a function call, but as least it has two main advantages:

 - less icache pressure due potentially less code.
 - an architecture that uses default strchr and strchrnull will require just
   to provide an optimized strchrnul to get a boot on both symbols.

> ---
>  string/strchr.c | 144 ++++++++------------------------------------------------
>  1 file changed, 19 insertions(+), 125 deletions(-)
> 
> diff --git a/string/strchr.c b/string/strchr.c
> index 25c2fe4..a70d72a 100644
> --- a/string/strchr.c
> +++ b/string/strchr.c
> @@ -22,6 +22,10 @@
>  
>  #include <string.h>
>  #include <stdlib.h>
> +#include <stdint.h>
> +#include <haszero.h>
> +#include <whichzero.h>
> +#include <extractbyte.h>
>  
>  #undef strchr
>  
> @@ -35,147 +39,37 @@ STRCHR (const char *s, int c_in)
>  {
>    const unsigned char *char_ptr;
>    const unsigned long int *longword_ptr;
> -  unsigned long int longword, magic_bits, charmask;
> +  unsigned long int longword, repeated_c, found;
> +  uintptr_t i, align;
>    unsigned char c;
>  
>    c = (unsigned char) c_in;
> +  char_ptr = (const unsigned char *) s;
>  
>    /* Handle the first few characters by reading one character at a time.
>       Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s;
> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> -       ++char_ptr)
> +  align = -(uintptr_t)char_ptr % sizeof(longword);
> +  for (i = 0; i < align; ++i, ++char_ptr)
>      if (*char_ptr == c)
> -      return (void *) char_ptr;
> +      return (char *) char_ptr;
>      else if (*char_ptr == '\0')
>        return NULL;
>  
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> -
> -  longword_ptr = (unsigned long int *) char_ptr;
> -
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  magic_bits = -1;
> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> -
>    /* Set up a longword, each of whose bytes is C.  */
> -  charmask = c | (c << 8);
> -  charmask |= charmask << 16;
> -  if (sizeof (longword) > 4)
> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> -    charmask |= (charmask << 16) << 16;
> -  if (sizeof (longword) > 8)
> -    abort ();
> +  repeated_c = (-1ul / 0xff) * c;
>  
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  for (;;)
> +  longword_ptr = (unsigned long int *) char_ptr;
> +  do
>      {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -	 LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -	 1) Is this safe?  Will it catch all the zero bytes?
> -	 Suppose there is a byte with all zeros.  Any carry bits
> -	 propagating from its left will fall into the hole at its
> -	 least significant bit and stop.  Since there will be no
> -	 carry from its most significant bit, the LSB of the
> -	 byte to the left will be unchanged, and the zero will be
> -	 detected.
> -
> -	 2) Is this worthwhile?  Will it ignore everything except
> -	 zero bytes?  Suppose every byte of LONGWORD has a bit set
> -	 somewhere.  There will be a carry into bit 8.  If bit 8
> -	 is set, this will carry into bit 16.  If bit 8 is clear,
> -	 one of bits 9-15 must be set, so there will be a carry
> -	 into bit 16.  Similarly, there will be a carry into bit
> -	 24.  If one of bits 24-30 is set, there will be a carry
> -	 into bit 31, so all of the hole bits will be changed.
> -
> -	 The one misfire occurs when bits 24-30 are clear and bit
> -	 31 is set; in this case, the hole at bit 31 is not
> -	 changed.  If we had access to the processor carry flag,
> -	 we could close this loophole by putting the fourth hole
> -	 at bit 32!
> -
> -	 So it ignores everything except 128's, when they're aligned
> -	 properly.
> -
> -	 3) But wait!  Aren't we looking for C as well as zero?
> -	 Good point.  So what we do is XOR LONGWORD with a longword,
> -	 each of whose bytes is C.  This turns each byte that is C
> -	 into a zero.  */
> -
>        longword = *longword_ptr++;
> -
> -      /* Add MAGIC_BITS to LONGWORD.  */
> -      if ((((longword + magic_bits)
> -
> -	    /* Set those bits that were unchanged by the addition.  */
> -	    ^ ~longword)
> -
> -	   /* Look at only the hole bits.  If any of the hole bits
> -	      are unchanged, most likely one of the bytes was a
> -	      zero.  */
> -	   & ~magic_bits) != 0 ||
> -
> -	  /* That caught zeroes.  Now test for C.  */
> -	  ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
> -	   & ~magic_bits) != 0)
> -	{
> -	  /* Which of the bytes was C or zero?
> -	     If none of them were, it was a misfire; continue the search.  */
> -
> -	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
> -
> -	  if (*cp == c)
> -	    return (char *) cp;
> -	  else if (*cp == '\0')
> -	    return NULL;
> -	  if (*++cp == c)
> -	    return (char *) cp;
> -	  else if (*cp == '\0')
> -	    return NULL;
> -	  if (*++cp == c)
> -	    return (char *) cp;
> -	  else if (*cp == '\0')
> -	    return NULL;
> -	  if (*++cp == c)
> -	    return (char *) cp;
> -	  else if (*cp == '\0')
> -	    return NULL;
> -	  if (sizeof (longword) > 4)
> -	    {
> -	      if (*++cp == c)
> -		return (char *) cp;
> -	      else if (*cp == '\0')
> -		return NULL;
> -	      if (*++cp == c)
> -		return (char *) cp;
> -	      else if (*cp == '\0')
> -		return NULL;
> -	      if (*++cp == c)
> -		return (char *) cp;
> -	      else if (*cp == '\0')
> -		return NULL;
> -	      if (*++cp == c)
> -		return (char *) cp;
> -	      else if (*cp == '\0')
> -		return NULL;
> -	    }
> -	}
>      }
> +  while (!haszero2(longword, longword ^ repeated_c));
>  
> -  return NULL;
> +  found = whichzero2(longword, longword ^ repeated_c);
> +  if (extractbyte(longword, found) == c)
> +    return (char *) (longword_ptr - 1) + found;
> +  else
> +    return NULL;
>  }
>  
>  #ifdef weak_alias
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/11] hppa: Add haszero.h and whichzero.h
  2016-12-17  6:57 ` [PATCH 10/11] hppa: Add haszero.h and whichzero.h Richard Henderson
@ 2016-12-19 14:54   ` Adhemerval Zanella
  2016-12-19 19:44     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 14:54 UTC (permalink / raw)
  To: libc-alpha



On 17/12/2016 04:57, Richard Henderson wrote:

> +static inline unsigned long int
> +haszero(unsigned long int x)
> +{
> +#if __GNUC_PREREQ(4, 5)
> +  /* It's more useful to expose a control transfer to the compiler
> +     than to expose a proper boolean result.  */
> +  if (sizeof(x) == 8)
> +    asm goto ("uxor,*sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
> +  else
> +    asm goto ("uxor,sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
> +  return 1;
> + nbz:
> +  return 0;
> +#else

Since current GLIBC requires GCC 4.7 as minimum compiler I think we
can get rid of snippets for old compilers.  Same for the other
override functios.

> +  unsigned long int ret;
> +  if (sizeof(x) == 8)
> +    asm ("uxor,*sbz %%r0,%1,%%r0\n\tcopy %%r0,%0"
> +	 : "=r"(ret) : "r"(x), "0"(1));
> +  else
> +    asm ("uxor,sbz %%r0,%1,%%r0\n\tcopy %%r0,%0"
> +        : "=r"(ret) : "r"(x), "0"(1));
> +  return ret;
> +#endif
> +}
> +
> +/* Likewise, but for two words simultaneously.  */
> +
> +static inline unsigned long int
> +haszero2(unsigned long int x1, unsigned long int x2)
> +{
> +#if __GNUC_PREREQ(4, 5)
> +  /* It's more useful to expose a control transfer to the compiler
> +     than to expose a proper boolean result.  */
> +  if (sizeof(x1) == 8)
> +    asm goto ("uxor,*sbz %%r0,%0,%%r0\n\t"
> +	      "uxor,*nbz %%r0,%1,%%r0\n\t"
> +	      "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
> +  else
> +    asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
> +	      "uxor,nbz %%r0,%1,%%r0\n\t"
> +	      "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
> +  return 0;
> + sbz:
> +  return 1;
> +#else
> +  unsigned long int ret;
> +  if (sizeof(x1) == 8)
> +    asm ("uxor,*sbz %%r0,%1,%%r0\n\t"
> +	 "uxor,*nbz %%r0,%2,%%r0\n\t"
> +	 "ldi 1,%0"
> +	 : "=r"(ret) : "r"(x1), "r"(x2), "0"(0));
> +  else
> +    asm ("uxor,sbz %%r0,%1,%%r0\n\t"
> +	 "uxor,nbz %%r0,%2,%%r0\n\t"
> +	 "ldi 1,%0"
> +	 : "=r"(ret) : "r"(x1), "r"(x2), "0"(0));
> +  return ret;
> +#endif
> +}
> +
> +#endif /* haszero.h */
> diff --git a/sysdeps/hppa/whichzero.h b/sysdeps/hppa/whichzero.h
> new file mode 100644
> index 0000000..ef18cc7
> --- /dev/null
> +++ b/sysdeps/hppa/whichzero.h
> @@ -0,0 +1,70 @@
> +/* whichzero.h -- functions for zero byte searching.  HPPA version.
> +   Copyright (C) 2016 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef HPPA_WHICHZERO_H
> +#define HPPA_WHICHZERO_H 1
> +
> +/* Given a long that is known to contain a zero byte, return the
> +   index of the first such within the long in host memory order.  */
> +
> +static inline unsigned int
> +whichzero(unsigned long int x)
> +{
> +  unsigned int ret;
> +
> +  _Static_assert (sizeof(x) == 4, "64-bit not supported");
> +
> +  /* Since we have no clz insn, direct tests of the bytes is faster
> +     than loading up the constants to do the masking.  */
> +  asm ("extrw,u,<> %1,23,8,%%r0\n\t"
> +       "ldi 2,%0\n\t"
> +       "extrw,u,<> %1,15,8,%%r0\n\t"
> +       "ldi 1,%0\n\t"
> +       "extrw,u,<> %1,7,8,%%r0\n\t"
> +       "ldi 0,%0"
> +       : "=r"(ret) : "r"(x), "0"(3));
> +
> +  return ret;
> +}
> +
> +/* Similarly, but perform the test for two longs simultaneously.  */
> +
> +static inline unsigned int
> +whichzero2(unsigned long int x1, unsigned long int x2)
> +{
> +  unsigned int ret;
> +
> +  _Static_assert (sizeof(x1) == 4, "64-bit not supported");
> +
> +  /* Since we have no clz insn, direct tests of the bytes is faster
> +     than loading up the constants to do the masking.  */
> +  asm ("extrw,u,= %1,23,8,%%r0\n\t"
> +       "extrw,u,<> %2,23,8,%%r0\n\t"
> +       "ldi 2,%0\n\t"
> +       "extrw,u,= %1,15,8,%%r0\n\t"
> +       "extrw,u,<> %2,15,8,%%r0\n\t"
> +       "ldi 1,%0\n\t"
> +       "extrw,u,= %1,7,8,%%r0\n\t"
> +       "extrw,u,<> %2,7,8,%%r0\n\t"
> +       "ldi 0,%0"
> +       : "=r"(ret) : "r"(x1), "r"(x2), "0"(3));
> +
> +  return ret;
> +}
> +
> +#endif /* whichzero.h */

I am far from a hppa expert, but can't we code the same snippet in C? How
bad would it be compared to this optimized asm?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] Improve generic strnlen
  2016-12-17  6:57 ` [PATCH 07/11] Improve generic strnlen Richard Henderson
@ 2016-12-19 15:00   ` Adhemerval Zanella
  2016-12-19 19:32     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 15:00 UTC (permalink / raw)
  To: libc-alpha



On 17/12/2016 04:57, Richard Henderson wrote:
>         * string/strnlen.c: Use haszero.h, whichzero.h.

As for strchr, since you already optimizing memchr why not base
strnlen on memchr instead:

--
size_t
__strnlen (const char *str, size_t maxlen)
{
  const char *char_str = memchr (str, 0, maxlen);
  return char_str ? char_str  - s : maxlen;
}
--

> ---
>  string/strnlen.c | 125 +++++++++----------------------------------------------
>  1 file changed, 19 insertions(+), 106 deletions(-)
> 
> diff --git a/string/strnlen.c b/string/strnlen.c
> index b2b0664..9927b9d 100644
> --- a/string/strnlen.c
> +++ b/string/strnlen.c
> @@ -21,7 +21,9 @@
>     not, see <http://www.gnu.org/licenses/>.  */
>  
>  #include <string.h>
> -#include <stdlib.h>
> +#include <stdint.h>
> +#include <haszero.h>
> +#include <whichzero.h>
>  
>  /* Find the length of S, but scan at most MAXLEN characters.  If no
>     '\0' terminator is found in that many characters, return MAXLEN.  */
> @@ -33,11 +35,12 @@
>  size_t
>  __strnlen (const char *str, size_t maxlen)
>  {
> -  const char *char_ptr, *end_ptr = str + maxlen;
> +  const char *char_ptr = str, *end_ptr = str + maxlen;
>    const unsigned long int *longword_ptr;
> -  unsigned long int longword, himagic, lomagic;
> +  unsigned long int longword;
> +  size_t i, align;
>  
> -  if (maxlen == 0)
> +  if (__glibc_unlikely (maxlen == 0))
>      return 0;
>  
>    if (__glibc_unlikely (end_ptr < str))
> @@ -45,116 +48,26 @@ __strnlen (const char *str, size_t maxlen)
>  
>    /* Handle the first few characters by reading one character at a time.
>       Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = str; ((unsigned long int) char_ptr
> -			& (sizeof (longword) - 1)) != 0;
> -       ++char_ptr)
> +  align = -(uintptr_t)char_ptr % sizeof(longword);
> +  for (i = 0; i < align; ++i, ++char_ptr)
>      if (*char_ptr == '\0')
> -      {
> -	if (char_ptr > end_ptr)
> -	  char_ptr = end_ptr;
> -	return char_ptr - str;
> -      }
> +      goto found;
>  
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> +  longword_ptr = (const unsigned long int *) char_ptr;
> +  char_ptr = end_ptr;
>  
> -  longword_ptr = (unsigned long int *) char_ptr;
> -
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  himagic = 0x80808080L;
> -  lomagic = 0x01010101L;
> -  if (sizeof (longword) > 4)
> +  while (longword_ptr < (const unsigned long int *) end_ptr)
>      {
> -      /* 64-bit version of the magic.  */
> -      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> -      himagic = ((himagic << 16) << 16) | himagic;
> -      lomagic = ((lomagic << 16) << 16) | lomagic;
> -    }
> -  if (sizeof (longword) > 8)
> -    abort ();
> -
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  while (longword_ptr < (unsigned long int *) end_ptr)
> -    {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -	 LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -	 1) Is this safe?  Will it catch all the zero bytes?
> -	 Suppose there is a byte with all zeros.  Any carry bits
> -	 propagating from its left will fall into the hole at its
> -	 least significant bit and stop.  Since there will be no
> -	 carry from its most significant bit, the LSB of the
> -	 byte to the left will be unchanged, and the zero will be
> -	 detected.
> -
> -	 2) Is this worthwhile?  Will it ignore everything except
> -	 zero bytes?  Suppose every byte of LONGWORD has a bit set
> -	 somewhere.  There will be a carry into bit 8.  If bit 8
> -	 is set, this will carry into bit 16.  If bit 8 is clear,
> -	 one of bits 9-15 must be set, so there will be a carry
> -	 into bit 16.  Similarly, there will be a carry into bit
> -	 24.  If one of bits 24-30 is set, there will be a carry
> -	 into bit 31, so all of the hole bits will be changed.
> -
> -	 The one misfire occurs when bits 24-30 are clear and bit
> -	 31 is set; in this case, the hole at bit 31 is not
> -	 changed.  If we had access to the processor carry flag,
> -	 we could close this loophole by putting the fourth hole
> -	 at bit 32!
> -
> -	 So it ignores everything except 128's, when they're aligned
> -	 properly.  */
> -
> -      longword = *longword_ptr++;
> -
> -      if ((longword - lomagic) & himagic)
> +      longword = *longword_ptr;
> +      if (haszero (longword))
>  	{
> -	  /* Which of the bytes was the zero?  If none of them were, it was
> -	     a misfire; continue the search.  */
> -
> -	  const char *cp = (const char *) (longword_ptr - 1);
> -
> -	  char_ptr = cp;
> -	  if (cp[0] == 0)
> -	    break;
> -	  char_ptr = cp + 1;
> -	  if (cp[1] == 0)
> -	    break;
> -	  char_ptr = cp + 2;
> -	  if (cp[2] == 0)
> -	    break;
> -	  char_ptr = cp + 3;
> -	  if (cp[3] == 0)
> -	    break;
> -	  if (sizeof (longword) > 4)
> -	    {
> -	      char_ptr = cp + 4;
> -	      if (cp[4] == 0)
> -		break;
> -	      char_ptr = cp + 5;
> -	      if (cp[5] == 0)
> -		break;
> -	      char_ptr = cp + 6;
> -	      if (cp[6] == 0)
> -		break;
> -	      char_ptr = cp + 7;
> -	      if (cp[7] == 0)
> -		break;
> -	    }
> +	  char_ptr = (const char *) longword_ptr + whichzero (longword);
> +	  break;
>  	}
> -      char_ptr = end_ptr;
> +      longword_ptr++;
>      }
>  
> + found:
>    if (char_ptr > end_ptr)
>      char_ptr = end_ptr;
>    return char_ptr - str;
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
                   ` (12 preceding siblings ...)
  2016-12-19 14:32 ` Adhemerval Zanella
@ 2016-12-19 16:15 ` Joseph Myers
  2016-12-19 18:16   ` Richard Henderson
  13 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-12-19 16:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]

On Fri, 16 Dec 2016, Richard Henderson wrote:

> I'll note that the comments regarding these tests are in fact out
> of date -- they speak of a number 0x7efefeff which does not appear
> (we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
> actually occur (presumably because we changed algorithms).

Another issue with this comment is bug 5806.  Thus, patches removing that 
comment should include [BZ #5806] in their ChangeLog entries, and once all 
instances of the comment have been removed, that bug should be resolved as 
FIXED with an appropriate milestone set.

Other general observations on the patch series:

* Missing spaces before '(', in lots of places in the patch series.

* Rather than hardcoding unsigned long int as the word type to use, it 
would be better to have another sysdeps header where architectures can 
choose the type to use for this purpose, so that ILP32 configurations with 
64-bit registers can use unsigned long long int instead.  (E.g. there 
aren't many MIPS-specific string functions, so MIPS n32 would benefit from 
using 64-bit integers here.)

* Ondøej Bílka had patches doing similar things in May/June 2015, e.g. 
<https://sourceware.org/ml/libc-alpha/2015-05/msg00882.html> 
<https://sourceware.org/ml/libc-alpha/2015-05/msg00884.html> 
<https://sourceware.org/ml/libc-alpha/2015-05/msg00881.html> 
<https://sourceware.org/ml/libc-alpha/2015-06/msg00072.html> (these may 
not be the latest versions of the relevant patches and may not constitute 
a complete set of relevant patches; it's not very clear which of the 
patches in patchwork are or are not superseded).  I haven't compared the 
approaches in detail to see if there is anything useful to take from those 
old patches not covered in the present series.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-19 16:15 ` Joseph Myers
@ 2016-12-19 18:16   ` Richard Henderson
  2016-12-19 18:24     ` Joseph Myers
  2016-12-19 18:26     ` Adhemerval Zanella
  0 siblings, 2 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-19 18:16 UTC (permalink / raw)
  To: Joseph Myers; +Cc: libc-alpha

On 12/19/2016 08:15 AM, Joseph Myers wrote:
> On Fri, 16 Dec 2016, Richard Henderson wrote:
> 
>> I'll note that the comments regarding these tests are in fact out
>> of date -- they speak of a number 0x7efefeff which does not appear
>> (we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
>> actually occur (presumably because we changed algorithms).
> 
> Another issue with this comment is bug 5806.  Thus, patches removing that 
> comment should include [BZ #5806] in their ChangeLog entries, and once all 
> instances of the comment have been removed, that bug should be resolved as 
> FIXED with an appropriate milestone set.

Oh how nice -- outstanding for 8 years.

> * Missing spaces before '(', in lots of places in the patch series.

Ug, yes.  Too much recent work in code bases that prohibit this.

> * Rather than hardcoding unsigned long int as the word type to use, it 
> would be better to have another sysdeps header where architectures can 
> choose the type to use for this purpose, so that ILP32 configurations with 
> 64-bit registers can use unsigned long long int instead.  (E.g. there 
> aren't many MIPS-specific string functions, so MIPS n32 would benefit from 
> using 64-bit integers here.)

Since my strcmp uses memcopy.h for MERGE, perhaps the op_t that's already
defined in there?  Or...

> * Ondøej Bílka had patches doing similar things in May/June 2015, e.g. 
> <https://sourceware.org/ml/libc-alpha/2015-05/msg00882.html> 
> <https://sourceware.org/ml/libc-alpha/2015-05/msg00884.html> 
> <https://sourceware.org/ml/libc-alpha/2015-05/msg00881.html> 
> <https://sourceware.org/ml/libc-alpha/2015-06/msg00072.html> (these may 
> not be the latest versions of the relevant patches and may not constitute 
> a complete set of relevant patches; it's not very clear which of the 
> patches in patchwork are or are not superseded).  I haven't compared the 
> approaches in detail to see if there is anything useful to take from those 
> old patches not covered in the present series.

... if we go with Ondrej's skeletons that defines its own type.

Is there a preference for or against many little headers, like I was using, or
a larger header using ifdefs to allow an architecture to override specific
portions, like Ondrej was using?

Given our recent-ish preference for avoiding #ifdef in favor of #if, I really
don't know which is more preferable.

The things that Ondrej does that I don't are (1) cater to cpus with unaligned
accesses, (2) unroll loops and headers, (3) handle more routines.  There are a
lot of little-endian-only checks in his code and I'm not sure why it would have
to be so.

Perhaps I should look more closely at Ondrej's patches and see how easy it
might be to apply my inline hppa asms...


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-19 18:16   ` Richard Henderson
@ 2016-12-19 18:24     ` Joseph Myers
  2016-12-19 18:26     ` Adhemerval Zanella
  1 sibling, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-12-19 18:24 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha

On Mon, 19 Dec 2016, Richard Henderson wrote:

> > * Rather than hardcoding unsigned long int as the word type to use, it 
> > would be better to have another sysdeps header where architectures can 
> > choose the type to use for this purpose, so that ILP32 configurations with 
> > 64-bit registers can use unsigned long long int instead.  (E.g. there 
> > aren't many MIPS-specific string functions, so MIPS n32 would benefit from 
> > using 64-bit integers here.)
> 
> Since my strcmp uses memcopy.h for MERGE, perhaps the op_t that's already
> defined in there?  Or...

gmp-mparam.h is also similar in spirit (i.e. something that should be 
overridden to be optimal in cases of 64-bit registers with ILP32).  Maybe 
there should be a common internal header to declare register size (default 
__WORDSIZE = size of long), which gmp-mparam.h could use to define other 
macros (possibly avoiding the need for architecture-specific versions of 
that header) and which could also be used for defining the type used in 
string functions?

> Is there a preference for or against many little headers, like I was using, or
> a larger header using ifdefs to allow an architecture to override specific
> portions, like Ondrej was using?
> 
> Given our recent-ish preference for avoiding #ifdef in favor of #if, I really
> don't know which is more preferable.

I prefer little headers that avoid #ifdef.  (The only way in which those 
fail to be typo-proof is if you typo the header name itself and so have an 
unused architecture-specific file.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Improve generic string routines
  2016-12-19 18:16   ` Richard Henderson
  2016-12-19 18:24     ` Joseph Myers
@ 2016-12-19 18:26     ` Adhemerval Zanella
  1 sibling, 0 replies; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-19 18:26 UTC (permalink / raw)
  To: libc-alpha



On 19/12/2016 16:16, Richard Henderson wrote:
> On 12/19/2016 08:15 AM, Joseph Myers wrote:
>> On Fri, 16 Dec 2016, Richard Henderson wrote:
>>
>>> I'll note that the comments regarding these tests are in fact out
>>> of date -- they speak of a number 0x7efefeff which does not appear
>>> (we use -0x01010101 or 0xfefefeff), and a test misfire which cannot
>>> actually occur (presumably because we changed algorithms).
>>
>> Another issue with this comment is bug 5806.  Thus, patches removing that 
>> comment should include [BZ #5806] in their ChangeLog entries, and once all 
>> instances of the comment have been removed, that bug should be resolved as 
>> FIXED with an appropriate milestone set.
> 
> Oh how nice -- outstanding for 8 years.
> 
>> * Missing spaces before '(', in lots of places in the patch series.
> 
> Ug, yes.  Too much recent work in code bases that prohibit this.
> 
>> * Rather than hardcoding unsigned long int as the word type to use, it 
>> would be better to have another sysdeps header where architectures can 
>> choose the type to use for this purpose, so that ILP32 configurations with 
>> 64-bit registers can use unsigned long long int instead.  (E.g. there 
>> aren't many MIPS-specific string functions, so MIPS n32 would benefit from 
>> using 64-bit integers here.)
> 
> Since my strcmp uses memcopy.h for MERGE, perhaps the op_t that's already
> defined in there?  Or...
> 
>> * Ondøej Bílka had patches doing similar things in May/June 2015, e.g. 
>> <https://sourceware.org/ml/libc-alpha/2015-05/msg00882.html> 
>> <https://sourceware.org/ml/libc-alpha/2015-05/msg00884.html> 
>> <https://sourceware.org/ml/libc-alpha/2015-05/msg00881.html> 
>> <https://sourceware.org/ml/libc-alpha/2015-06/msg00072.html> (these may 
>> not be the latest versions of the relevant patches and may not constitute 
>> a complete set of relevant patches; it's not very clear which of the 
>> patches in patchwork are or are not superseded).  I haven't compared the 
>> approaches in detail to see if there is anything useful to take from those 
>> old patches not covered in the present series.
> 
> ... if we go with Ondrej's skeletons that defines its own type.
> 
> Is there a preference for or against many little headers, like I was using, or
> a larger header using ifdefs to allow an architecture to override specific
> portions, like Ondrej was using?
> 

As Joseph, I also prefer small headers since it usually requires less boiler
plate to override specific definitions.

> Given our recent-ish preference for avoiding #ifdef in favor of #if, I really
> don't know which is more preferable.
> 
> The things that Ondrej does that I don't are (1) cater to cpus with unaligned
> accesses, (2) unroll loops and headers, (3) handle more routines.  There are a
> lot of little-endian-only checks in his code and I'm not sure why it would have
> to be so.

I think all 3 more things Ondrej patches added are orthogonal to the
ones in you patchset and I think it would be more simpler to limit
the optimization and changes instead of drop a large one with all
the possible outlined ones on it. 

> 
> Perhaps I should look more closely at Ondrej's patches and see how easy it
> might be to apply my inline hppa asms...
> 
> 
> r~
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] Improve generic strlen
  2016-12-19 14:33   ` Adhemerval Zanella
@ 2016-12-19 19:28     ` Richard Henderson
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-19 19:28 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/19/2016 06:32 AM, Adhemerval Zanella wrote:
>> -      return char_ptr - str;
>> +      goto found;
> Why not just return 'char_ptr - str' and get rid of label?

Premature optimization, perhaps, having looked at gcc 6 output that failed to
merge the two code paths.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] Improve generic strchr
  2016-12-19 14:38   ` Adhemerval Zanella
@ 2016-12-19 19:31     ` Richard Henderson
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-19 19:31 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/19/2016 06:38 AM, Adhemerval Zanella wrote:
> Since you are changing strchrnul, why not just simplify strchr to just:
> 
> --
> /* Find the first occurrence of C in S.  */
> char *
> STRCHR (const char *s, int c_in)
> {
>   char *r = __strchrnul (s, c_in);
>   return *(unsigned char *) r == (unsigned char) c_in ? r : NULL;
> }
> --
> 
> ?
> 
> It will incur in a function call, but as least it has two main advantages:
> 
>  - less icache pressure due potentially less code.
>  - an architecture that uses default strchr and strchrnull will require just
>    to provide an optimized strchrnul to get a boot on both symbols.

Surely strchr is used vastly more often than strchrnul.  I'd prefer the more
common symbol to avoid another function call.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] Improve generic strnlen
  2016-12-19 15:00   ` Adhemerval Zanella
@ 2016-12-19 19:32     ` Richard Henderson
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-19 19:32 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/19/2016 06:59 AM, Adhemerval Zanella wrote:
> As for strchr, since you already optimizing memchr why not base
> strnlen on memchr instead:
> 
> --
> size_t
> __strnlen (const char *str, size_t maxlen)
> {
>   const char *char_str = memchr (str, 0, maxlen);
>   return char_str ? char_str  - s : maxlen;
> }
> --

Yes, I think this is the right thing here.  Unlike strchr, strnlen is a lesser
used function.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/11] hppa: Add haszero.h and whichzero.h
  2016-12-19 14:54   ` Adhemerval Zanella
@ 2016-12-19 19:44     ` Richard Henderson
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2016-12-19 19:44 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/19/2016 06:54 AM, Adhemerval Zanella wrote:
>> +static inline unsigned long int
>> +haszero(unsigned long int x)
>> +{
>> +#if __GNUC_PREREQ(4, 5)
>> +  /* It's more useful to expose a control transfer to the compiler
>> +     than to expose a proper boolean result.  */
>> +  if (sizeof(x) == 8)
>> +    asm goto ("uxor,*sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
>> +  else
>> +    asm goto ("uxor,sbz %%r0,%0,%%r0\n\tb,n %l1" : : "r"(x) : : nbz);
>> +  return 1;
>> + nbz:
>> +  return 0;
>> +#else
> 
> Since current GLIBC requires GCC 4.7 as minimum compiler I think we
> can get rid of snippets for old compilers.  Same for the other
> override functios.

Ah good.  I'd meant to go back and look for the minimum required gcc.

>> +  /* Since we have no clz insn, direct tests of the bytes is faster
>> +     than loading up the constants to do the masking.  */
>> +  asm ("extrw,u,= %1,23,8,%%r0\n\t"
>> +       "extrw,u,<> %2,23,8,%%r0\n\t"
>> +       "ldi 2,%0\n\t"
>> +       "extrw,u,= %1,15,8,%%r0\n\t"
>> +       "extrw,u,<> %2,15,8,%%r0\n\t"
>> +       "ldi 1,%0\n\t"
>> +       "extrw,u,= %1,7,8,%%r0\n\t"
>> +       "extrw,u,<> %2,7,8,%%r0\n\t"
>> +       "ldi 0,%0"
>> +       : "=r"(ret) : "r"(x1), "r"(x2), "0"(3));
>> +
>> +  return ret;
>> +}
>> +
>> +#endif /* whichzero.h */
> 
> I am far from a hppa expert, but can't we code the same snippet in C? How
> bad would it be compared to this optimized asm?

The compiler is not great at this.  It only attempts nullification on
comparisons (not directly as a result of an operation like extract), and it
never attempts double nullification as above.

So for whichzero gcc will use 10 insns instead of my 7; for whichzero2 gcc will
use 19 insns instead of my 10.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] Improve generic memrchr
  2016-12-17  6:57 ` [PATCH 06/11] Improve generic memrchr Richard Henderson
@ 2016-12-20 12:30   ` Adhemerval Zanella
  2016-12-20 17:19     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-20 12:30 UTC (permalink / raw)
  To: libc-alpha

This implementation has a lot of issues, checking on both x86_64 and i386
(by manually removing the assembly implementation to use the new default
one) I am seeing a lot of issues with string/testers and others. 
Unfortunately the test-memrchr itself does not trigger most of the issues.

On 17/12/2016 04:57, Richard Henderson wrote:
>     * string/memrchr.c: Use haszero.h, whichzero.h.
> ---
>  string/memrchr.c | 180 ++++++++++---------------------------------------------
>  1 file changed, 30 insertions(+), 150 deletions(-)
> 
> diff --git a/string/memrchr.c b/string/memrchr.c
> index b9b0c9e..d45b261 100644
> --- a/string/memrchr.c
> +++ b/string/memrchr.c
> @@ -21,180 +21,60 @@
>     License along with the GNU C Library; if not, see
>     <http://www.gnu.org/licenses/>.  */
>  
> -#include <stdlib.h>
> -
> -#ifdef HAVE_CONFIG_H
> -# include <config.h>
> -#endif
> -
> -#undef __ptr_t
> -#define __ptr_t void *
> -
> -#if defined _LIBC
> -# include <string.h>
> -# include <memcopy.h>
> -#endif
> -
> -#if defined HAVE_LIMITS_H || defined _LIBC
> -# include <limits.h>
> -#endif
> -
> -#define LONG_MAX_32_BITS 2147483647
> -
> -#ifndef LONG_MAX
> -# define LONG_MAX LONG_MAX_32_BITS
> -#endif
> -
> -#include <sys/types.h>
> +#include <string.h>
> +#include <stdint.h>
> +#include <haszero.h>
> +#include <whichzero.h>
>  
>  #undef __memrchr
>  #undef memrchr
>  
> -#ifndef weak_alias
> -# define __memrchr memrchr
> -#endif
> -
>  /* Search no more than N bytes of S for C.  */
> -__ptr_t
> -#ifndef MEMRCHR
> -__memrchr
> -#else
> -MEMRCHR
> -#endif

We can not get rid of this indirections without change the architectures
that uses the default implementation on ifunc (i386 for instance).

> -     (const __ptr_t s, int c_in, size_t n)
> +void *
> +__memrchr (const void *s, int c_in, size_t n)
>  {
> -  const unsigned char *char_ptr;
> +  const unsigned char *char_ptr = (const unsigned char *) s + n;
>    const unsigned long int *longword_ptr;
> -  unsigned long int longword, magic_bits, charmask;
> +  unsigned long int longword, repeated_c;
> +  uintptr_t i, align;
>    unsigned char c;
>  
>    c = (unsigned char) c_in;
>  
>    /* Handle the last few characters by reading one character at a time.
>       Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s + n;
> -       n > 0 && ((unsigned long int) char_ptr
> -		 & (sizeof (longword) - 1)) != 0;
> -       --n)
> +  align = (uintptr_t)char_ptr % sizeof(longword);
> +  for (i = 0; i < align; ++i)
>      if (*--char_ptr == c)
> -      return (__ptr_t) char_ptr;
> +      return (void *) char_ptr;

We need to take care of 'n' while interacting over the string to align
it.  It might the case where 's' is unaligned and 'n' is less than
the aligned value.

>  
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> +  /* Set up a longword, each of whose bytes is C.  */
> +  repeated_c = (-1ul / 0xff) * c;
>  
>    longword_ptr = (const unsigned long int *) char_ptr;
> +  n -= align;
> +  if (__glibc_unlikely(n == 0))
> +    return NULL;
>  
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  magic_bits = -1;
> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> -
> -  /* Set up a longword, each of whose bytes is C.  */
> -  charmask = c | (c << 8);
> -  charmask |= charmask << 16;
> -#if LONG_MAX > LONG_MAX_32_BITS
> -  charmask |= charmask << 32;
> -#endif
> -
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  while (n >= sizeof (longword))
> +  /* Loop while we have more than one longword remaining.  */
> +  while (n > sizeof (longword))
>      {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -	 LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -	 1) Is this safe?  Will it catch all the zero bytes?
> -	 Suppose there is a byte with all zeros.  Any carry bits
> -	 propagating from its left will fall into the hole at its
> -	 least significant bit and stop.  Since there will be no
> -	 carry from its most significant bit, the LSB of the
> -	 byte to the left will be unchanged, and the zero will be
> -	 detected.
> -
> -	 2) Is this worthwhile?  Will it ignore everything except
> -	 zero bytes?  Suppose every byte of LONGWORD has a bit set
> -	 somewhere.  There will be a carry into bit 8.  If bit 8
> -	 is set, this will carry into bit 16.  If bit 8 is clear,
> -	 one of bits 9-15 must be set, so there will be a carry
> -	 into bit 16.  Similarly, there will be a carry into bit
> -	 24.  If one of bits 24-30 is set, there will be a carry
> -	 into bit 31, so all of the hole bits will be changed.
> -
> -	 The one misfire occurs when bits 24-30 are clear and bit
> -	 31 is set; in this case, the hole at bit 31 is not
> -	 changed.  If we had access to the processor carry flag,
> -	 we could close this loophole by putting the fourth hole
> -	 at bit 32!
> -
> -	 So it ignores everything except 128's, when they're aligned
> -	 properly.
> -
> -	 3) But wait!  Aren't we looking for C, not zero?
> -	 Good point.  So what we do is XOR LONGWORD with a longword,
> -	 each of whose bytes is C.  This turns each byte that is C
> -	 into a zero.  */
> -
> -      longword = *--longword_ptr ^ charmask;
> -
> -      /* Add MAGIC_BITS to LONGWORD.  */
> -      if ((((longword + magic_bits)
> -
> -	    /* Set those bits that were unchanged by the addition.  */
> -	    ^ ~longword)
> -
> -	   /* Look at only the hole bits.  If any of the hole bits
> -	      are unchanged, most likely one of the bytes was a
> -	      zero.  */
> -	   & ~magic_bits) != 0)
> -	{
> -	  /* Which of the bytes was C?  If none of them were, it was
> -	     a misfire; continue the search.  */
> -
> -	  const unsigned char *cp = (const unsigned char *) longword_ptr;
> -
> -#if LONG_MAX > 2147483647
> -	  if (cp[7] == c)
> -	    return (__ptr_t) &cp[7];
> -	  if (cp[6] == c)
> -	    return (__ptr_t) &cp[6];
> -	  if (cp[5] == c)
> -	    return (__ptr_t) &cp[5];
> -	  if (cp[4] == c)
> -	    return (__ptr_t) &cp[4];
> -#endif
> -	  if (cp[3] == c)
> -	    return (__ptr_t) &cp[3];
> -	  if (cp[2] == c)
> -	    return (__ptr_t) &cp[2];
> -	  if (cp[1] == c)
> -	    return (__ptr_t) &cp[1];
> -	  if (cp[0] == c)
> -	    return (__ptr_t) cp;
> -	}
> -
> +      longword = *--longword_ptr ^ repeated_c;
> +      if (haszero (longword))
> +	goto found;
>        n -= sizeof (longword);
>      }
>  
> -  char_ptr = (const unsigned char *) longword_ptr;
> -
> -  while (n-- > 0)
> +  /* Since our pointer is aligned, we can always read the last longword.  */
> +  longword = *--longword_ptr ^ repeated_c;
> +  if (haszero (longword))
>      {
> -      if (*--char_ptr == c)
> -	return (__ptr_t) char_ptr;
> + found:
> +      i = whichzero (longword);
> +      if (sizeof (longword) - 1 - i < n)
> +	return (char *) longword_ptr + i;

We need to check longword in reverse word since we are checking for last
occurrence. 

Below it is a workable implementation (on x86_64, I haven't checked on
a BE machine since I do not have access to one). It passes all the string
tests:

--
#ifndef weak_alias
# define __memrchr memrchr
#endif

static inline unsigned int
whichzero_reverse (unsigned long int x)
{ 
  if (sizeof (x) == 4)
    return whichzero (__builtin_bswap32 (x));
  else
    return whichzero (__builtin_bswap64 (x));
}

/* Search no more than N bytes of S for C.  */
void *
#ifndef MEMRCHR
__memrchr
#else
MEMRCHR
#endif
     (const void *s, int c_in, size_t n)
{ 
  const unsigned char *char_ptr = (const unsigned char *) s + n;
  const unsigned long int *longword_ptr;
  unsigned long int longword, repeated_c;
  uintptr_t i;
  unsigned char c;

  c = (unsigned char) c_in;

  /* Align s to size_t reading one character at time.  */
  for (;  
       ((uintptr_t)char_ptr & (sizeof(longword) - 1)) && (n != 0);
       --n)
    if (*--char_ptr == c_in)
      return (void *) char_ptr;

  /* Set up a longword, each of whose bytes is C.  */
  repeated_c = (-1ul / 0xff) * c;

  /* Loop while we have more than one longword remaining.  */
  for (longword_ptr = (const void*)char_ptr;
       n > sizeof(size_t);
       n -= sizeof (size_t))
    {
      longword = *--longword_ptr ^ repeated_c;
      if (haszero (longword))
        goto found;
    }

  longword = *--longword_ptr ^ repeated_c;
  if (haszero (longword))
    {
found:
      i = whichzero_reverse (longword);
      if (i < n)
        return (char*) (longword_ptr) + (sizeof (longword) - 1 - i);
     }
  return NULL;
}
#ifndef MEMRCHR
# ifdef weak_alias
weak_alias (__memrchr, memrchr)
# endif
#endif

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] Improve generic strrchr
  2016-12-17  6:57 ` [PATCH 05/11] Improve generic strrchr Richard Henderson
@ 2016-12-20 12:50   ` Adhemerval Zanella
  2016-12-20 17:20     ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-20 12:50 UTC (permalink / raw)
  To: libc-alpha

As for strchr and and strnlen, I think default strrchr can be changed 
use already optimized symbols:

--
#ifndef STRRCHR
# define STRRCHR strrchr
#endif

/* Find the last occurrence of C in S.  */
char *
STRRCHR (const char *s, int int_c)
{
  return __memrchr (s, int_c, strlen (s) + 1);
}
--

On 17/12/2016 04:57, Richard Henderson wrote:
>      * string/strrchr.c: Use haszero.h.
> ---
>  string/strrchr.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 56 insertions(+), 11 deletions(-)
> 
> diff --git a/string/strrchr.c b/string/strrchr.c
> index a07457e..bea7d76 100644
> --- a/string/strrchr.c
> +++ b/string/strrchr.c
> @@ -16,6 +16,10 @@
>     <http://www.gnu.org/licenses/>.  */
>  
>  #include <string.h>
> +#include <stdint.h>
> +#include <haszero.h>
> +#include <whichzero.h>
> +#include <extractbyte.h>
>  
>  #undef strrchr
>  
> @@ -25,25 +29,66 @@
>  
>  /* Find the last occurrence of C in S.  */
>  char *
> -STRRCHR (const char *s, int c)
> +STRRCHR (const char *s, int int_c)
>  {
> -  const char *found, *p;
> +  const unsigned char *found_c = NULL, *ptr_c;
> +  const unsigned long int *found_w = NULL, *ptr_w;
> +  unsigned long int longword, repeated_c;
> +  uintptr_t i, align;
> +  unsigned char c;
>  
> -  c = (unsigned char) c;
> +  c = (unsigned char) int_c;
> +  ptr_c = (const unsigned char *) s;
>  
> -  /* Since strchr is fast, we use it rather than the obvious loop.  */
> +  /* Handle the first few characters by reading one character at a time.
> +     Do this until CHAR_PTR is aligned on a longword boundary.  */
> +  align = -(uintptr_t)ptr_c % sizeof(longword);
> +  for (i = 0; i < align; ++i, ++ptr_c)
> +    {
> +      unsigned char this_c = *ptr_c;
> +      if (this_c == c)
> +        found_c = ptr_c;
> +      if (this_c == '\0')
> +        goto done;
> +    }
> +
> +  /* Set up a longword, each of whose bytes is C.  */
> +  repeated_c = (-1ul / 0xff) * c;
> +
> +  /* Search words for C.  At this point, merely record the last word
> +     that contained the character.  Stop when we find EOS.  */
> +  ptr_w = (const unsigned long int *) ptr_c;
> +  while (1)
> +    {
> +      longword = *ptr_w;
> +      if (haszero (longword))
> +	break;
> +      if (haszero (longword ^ repeated_c))
> +	found_w = ptr_w;
> +      ptr_w++;
> +    }
>  
> -  if (c == '\0')
> -    return strchr (s, '\0');
> +  /* Check to see if we've got C in the last longword.  */
> +  i = whichzero2 (longword, longword ^ repeated_c);
> +  if (extractbyte (longword, i) == c)
> +    found_w = ptr_w;
>  
> -  found = NULL;
> -  while ((p = strchr (s, c)) != NULL)
> +  /* If we found a word containing C, go back and search it byte by byte.  */
> +  if (found_w)
>      {
> -      found = p;
> -      s = p + 1;
> +      ptr_c = (const unsigned char *) found_w;
> +      for (i = 0; i < sizeof(longword); ++i, ++ptr_c)
> +	{
> +	  unsigned char this_c = *ptr_c;
> +	  if (this_c == c)
> +	    found_c = ptr_c;
> +	  if (this_c == '\0')
> +	    break;
> +	}
>      }
>  
> -  return (char *) found;
> + done:
> +  return (char *) found_c;
>  }
>  
>  #ifdef weak_alias
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] Improve generic memrchr
  2016-12-20 12:30   ` Adhemerval Zanella
@ 2016-12-20 17:19     ` Richard Henderson
  2016-12-20 20:25       ` Adhemerval Zanella
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-20 17:19 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/20/2016 04:30 AM, Adhemerval Zanella wrote:
> This implementation has a lot of issues, checking on both x86_64 and i386
> (by manually removing the assembly implementation to use the new default
> one) I am seeing a lot of issues with string/testers and others. 
> Unfortunately the test-memrchr itself does not trigger most of the issues.

I did notice problems by eye when going through the patch set for the second
time.  It is unfortunate that the tester doesn't trigger them.

>> +  align = (uintptr_t)char_ptr % sizeof(longword);
>> +  for (i = 0; i < align; ++i)
>>      if (*--char_ptr == c)
>> -      return (__ptr_t) char_ptr;
>> +      return (void *) char_ptr;
> 
> We need to take care of 'n' while interacting over the string to align
> it.  It might the case where 's' is unaligned and 'n' is less than
> the aligned value.

As for memchr, I had intended

  if (align > n)
    align = n;

>> + found:
>> +      i = whichzero (longword);
>> +      if (sizeof (longword) - 1 - i < n)
>> +	return (char *) longword_ptr + i;
> 
> We need to check longword in reverse word since we are checking for last
> occurrence. 

Well, yes and no.  We check from reverse but want the last match.  So it's
still a forward search.  What's needed is to mask out any match that is
excluded by N.

I currently have

  /* Handle any remaining portion of a word.  */
  if (n > 0)
    {
      word = *--word_ptr ^ repeated_c;

      /* Mask out the garbage bytes.  */
      op_t m = -1;
      if (__BYTE_ORDER == __LITTLE_ENDIAN)
        m <<= n * CHAR_BIT;
      else
        m >>= n * CHAR_BIT;
      word |= ~m;

      if (haszero (word))
        {
        found:
          return (char *) word_ptr + findzero (word);
        }
    }


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] Improve generic strrchr
  2016-12-20 12:50   ` Adhemerval Zanella
@ 2016-12-20 17:20     ` Richard Henderson
  2016-12-20 21:01       ` Adhemerval Zanella
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2016-12-20 17:20 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 12/20/2016 04:50 AM, Adhemerval Zanella wrote:
> /* Find the last occurrence of C in S.  */
> char *
> STRRCHR (const char *s, int int_c)
> {
>   return __memrchr (s, int_c, strlen (s) + 1);
> }

Do we really want to touch the memory twice for such a common function?


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] Improve generic memrchr
  2016-12-20 17:19     ` Richard Henderson
@ 2016-12-20 20:25       ` Adhemerval Zanella
  0 siblings, 0 replies; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-20 20:25 UTC (permalink / raw)
  To: Richard Henderson, libc-alpha



On 20/12/2016 15:19, Richard Henderson wrote:
> On 12/20/2016 04:30 AM, Adhemerval Zanella wrote:
>> This implementation has a lot of issues, checking on both x86_64 and i386
>> (by manually removing the assembly implementation to use the new default
>> one) I am seeing a lot of issues with string/testers and others. 
>> Unfortunately the test-memrchr itself does not trigger most of the issues.
> 
> I did notice problems by eye when going through the patch set for the second
> time.  It is unfortunate that the tester doesn't trigger them.
> 
>>> +  align = (uintptr_t)char_ptr % sizeof(longword);
>>> +  for (i = 0; i < align; ++i)
>>>      if (*--char_ptr == c)
>>> -      return (__ptr_t) char_ptr;
>>> +      return (void *) char_ptr;
>>
>> We need to take care of 'n' while interacting over the string to align
>> it.  It might the case where 's' is unaligned and 'n' is less than
>> the aligned value.
> 
> As for memchr, I had intended
> 
>   if (align > n)
>     align = n;
> 
>>> + found:
>>> +      i = whichzero (longword);
>>> +      if (sizeof (longword) - 1 - i < n)
>>> +	return (char *) longword_ptr + i;
>>
>> We need to check longword in reverse word since we are checking for last
>> occurrence. 
> 
> Well, yes and no.  We check from reverse but want the last match.  So it's
> still a forward search.  What's needed is to mask out any match that is
> excluded by N.
> 
> I currently have
> 
>   /* Handle any remaining portion of a word.  */
>   if (n > 0)
>     {
>       word = *--word_ptr ^ repeated_c;
> 
>       /* Mask out the garbage bytes.  */
>       op_t m = -1;
>       if (__BYTE_ORDER == __LITTLE_ENDIAN)
>         m <<= n * CHAR_BIT;
>       else
>         m >>= n * CHAR_BIT;
>       word |= ~m;
> 
>       if (haszero (word))
>         {
>         found:
>           return (char *) word_ptr + findzero (word);
>         }
>     }

I do not have a strong preference here, neither I am sure which strategy
would yield better performance for mostly architectures.  I would be usually
one instruction if the architecture supports direct swap instruction (like
x86 and aarch64), but the ones that do not it might generate some long
sequences.  Anyway, In such cases I would go either for code simplicity.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] Improve generic strrchr
  2016-12-20 17:20     ` Richard Henderson
@ 2016-12-20 21:01       ` Adhemerval Zanella
  0 siblings, 0 replies; 32+ messages in thread
From: Adhemerval Zanella @ 2016-12-20 21:01 UTC (permalink / raw)
  To: Richard Henderson, libc-alpha



On 20/12/2016 15:20, Richard Henderson wrote:
> On 12/20/2016 04:50 AM, Adhemerval Zanella wrote:
>> /* Find the last occurrence of C in S.  */
>> char *
>> STRRCHR (const char *s, int int_c)
>> {
>>   return __memrchr (s, int_c, strlen (s) + 1);
>> }
> 
> Do we really want to touch the memory twice for such a common function?

It really depends of the pattern strrchr is usually issue with.  If last
occurrence is near end of string it can be potentially faster, however
if it is not the case it can potentially touch memory twice indeed.
Using current benchtest I am seeing exactly this:

  - first two loops where string size way larger than character position
    the simplified implementation is slower (about 60% in some cases);

  - third loop show some gains from size 2 and larger (about 15%)

  - it shows also some gain on 4th loop and forward where is acts 
    basically as strlen (since it searches for 0).

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-12-20 21:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-17  6:57 [PATCH 00/11] Improve generic string routines Richard Henderson
2016-12-17  6:57 ` [PATCH 07/11] Improve generic strnlen Richard Henderson
2016-12-19 15:00   ` Adhemerval Zanella
2016-12-19 19:32     ` Richard Henderson
2016-12-17  6:57 ` [PATCH 02/11] Improve generic strchr Richard Henderson
2016-12-19 14:38   ` Adhemerval Zanella
2016-12-19 19:31     ` Richard Henderson
2016-12-17  6:57 ` [PATCH 03/11] Improve generic memchr Richard Henderson
2016-12-17  6:57 ` [PATCH 01/11] Improve generic strlen Richard Henderson
2016-12-19 14:33   ` Adhemerval Zanella
2016-12-19 19:28     ` Richard Henderson
2016-12-17  6:57 ` [PATCH 06/11] Improve generic memrchr Richard Henderson
2016-12-20 12:30   ` Adhemerval Zanella
2016-12-20 17:19     ` Richard Henderson
2016-12-20 20:25       ` Adhemerval Zanella
2016-12-17  6:57 ` [PATCH 09/11] hppa: Add memcopy.h Richard Henderson
2016-12-17  6:57 ` [PATCH 10/11] hppa: Add haszero.h and whichzero.h Richard Henderson
2016-12-19 14:54   ` Adhemerval Zanella
2016-12-19 19:44     ` Richard Henderson
2016-12-17  6:57 ` [PATCH 04/11] Improve generic strchrnul Richard Henderson
2016-12-17  6:57 ` [PATCH 05/11] Improve generic strrchr Richard Henderson
2016-12-20 12:50   ` Adhemerval Zanella
2016-12-20 17:20     ` Richard Henderson
2016-12-20 21:01       ` Adhemerval Zanella
2016-12-17  6:58 ` [PATCH 11/11] alpha: Add haszero.h and whichzero.h Richard Henderson
2016-12-17  6:58 ` [PATCH 08/11] Improve generic strcmp Richard Henderson
2016-12-17  7:24 ` [PATCH 00/11] Improve generic string routines Richard Henderson
2016-12-19 14:32 ` Adhemerval Zanella
2016-12-19 16:15 ` Joseph Myers
2016-12-19 18:16   ` Richard Henderson
2016-12-19 18:24     ` Joseph Myers
2016-12-19 18:26     ` Adhemerval Zanella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).