[PATCH 04/17] Add string vectorized find and detection functions

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 04/17] Add string vectorized find and detection functions
@ 2022-09-03 13:13 Wilco Dijkstra
  2022-09-19 13:59 ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 5+ messages in thread
From: Wilco Dijkstra @ 2022-09-03 13:13 UTC (permalink / raw)
  To: 'GNU C Library'; +Cc: Adhemerval Zanella

Hi Adhemerval,

+static inline unsigned int
+__clz (op_t x)
+{
+#if !HAVE_BUILTIN_CLZ
+  unsigned r;
+  op_t i;
+
+  x |= x >> 1;
+  x |= x >> 2;
+  x |= x >> 4;
+  x |= x >> 8;
+  x |= x >> 16;
+# if __WORDSIZE == 64
+  x |= x >> 32;
+  i = x * 0x03F79D71B4CB0A89ull >> 58;
+# else
+  i = x * 0x07C4ACDDU >> 27;
+# endif
+  r = index_access (i);
+  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
+#else
+  if (sizeof (op_t) == sizeof (long int))
+    return __builtin_clzl (x);
+  else
+    return __builtin_clzll (x);
+#endif
+}

This is a really bad idea. Firstly it is incorrect - sizeof (op_t) != __WORDSIZE due to
the odd way it is defined (it can be 64 bits on 32-bit targets). That in itself is
problematic since it isn't clear that using 64 bits operations extensively is efficient
on 32-bit targets (using 64-bit multiplies in GMP is different from using 64-bit
load/store in memcpy/memset which is different from 64-bit logical operations and
shifts, so all of these should be decoupled rather than forced together).

Secondly, there are already several ways to use count leading zeroes in GLIBC.
One is use the builtin unconditionally (done in lots of places, eg. by math code),
another is count_leading_zeros defined in longlong.h. This would add the third way. 
It's not clear how much gain inlining gives over using the libgcc implementation,
but if it is significant then we could provide a generic inline clzl/clzll that can be
used throughout GLIBC (replacing existing builtin_clz and count_leading_zeros).

Finally, emulating a full clz is inefficient. If you have already called find_zero_low
then there are at most 4 bits set on a 32-bit LE target, so you can trivially get the
index of the first zero byte via:

x = x & -x;
x = (x >> 15) + (x >> 22) + 3 * (x >> 31);

This is many times faster. There may be similar sequences for big-endian, but
you could just do a multiply with a magic word that gives the correct result
without needing a lookup table.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 04/17] Add string vectorized find and detection functions
  2022-09-03 13:13 [PATCH 04/17] Add string vectorized find and detection functions Wilco Dijkstra
@ 2022-09-19 13:59 ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 5+ messages in thread
From: Adhemerval Zanella Netto @ 2022-09-19 13:59 UTC (permalink / raw)
  To: Wilco Dijkstra, 'GNU C Library'



On 03/09/22 10:13, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
> +static inline unsigned int
> +__clz (op_t x)
> +{
> +#if !HAVE_BUILTIN_CLZ
> +  unsigned r;
> +  op_t i;
> +
> +  x |= x >> 1;
> +  x |= x >> 2;
> +  x |= x >> 4;
> +  x |= x >> 8;
> +  x |= x >> 16;
> +# if __WORDSIZE == 64
> +  x |= x >> 32;
> +  i = x * 0x03F79D71B4CB0A89ull >> 58;
> +# else
> +  i = x * 0x07C4ACDDU >> 27;
> +# endif
> +  r = index_access (i);
> +  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
> +#else
> +  if (sizeof (op_t) == sizeof (long int))
> +    return __builtin_clzl (x);
> +  else
> +    return __builtin_clzll (x);
> +#endif
> +}
> 
> This is a really bad idea. Firstly it is incorrect - sizeof (op_t) != __WORDSIZE due to
> the odd way it is defined (it can be 64 bits on 32-bit targets). That in itself is
> problematic since it isn't clear that using 64 bits operations extensively is efficient
> on 32-bit targets (using 64-bit multiplies in GMP is different from using 64-bit
> load/store in memcpy/memset which is different from 64-bit logical operations and
> shifts, so all of these should be decoupled rather than forced together).
> 
> Secondly, there are already several ways to use count leading zeroes in GLIBC.
> One is use the builtin unconditionally (done in lots of places, eg. by math code),
> another is count_leading_zeros defined in longlong.h. This would add the third way. 
> It's not clear how much gain inlining gives over using the libgcc implementation,
> but if it is significant then we could provide a generic inline clzl/clzll that can be
> used throughout GLIBC (replacing existing builtin_clz and count_leading_zeros).
> 

Fair enough, I can't really recall I have added another count bits routine instead
of using the already provided ones on longlong.h.  The longlong.h already take care
of avoiding libcall, so I adjusted the patch to use them instead.

> Finally, emulating a full clz is inefficient. If you have already called find_zero_low
> then there are at most 4 bits set on a 32-bit LE target, so you can trivially get the
> index of the first zero byte via:
> 
> x = x & -x;
> x = (x >> 15) + (x >> 22) + 3 * (x >> 31);
> 
> This is many times faster. There may be similar sequences for big-endian, but
> you could just do a multiply with a magic word that gives the correct result
> without needing a lookup table.

I take that on recent architectures it would be faster assuming the existence of
clz instruction, and this specific code is just used on memrchr tail call. So
I think we can optimize it further on a subsequent patch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 00/17] Improve generic string routines
@ 2022-09-02 20:39 Adhemerval Zanella
  2022-09-02 20:39 ` [PATCH 04/17] Add string vectorized find and detection functions Adhemerval Zanella
  0 siblings, 1 reply; 5+ messages in thread
From: Adhemerval Zanella @ 2022-09-02 20:39 UTC (permalink / raw)
  To: libc-alpha; +Cc: Joseph Myers, caiyinyu

It is an update my previous patchset [1] to provide generic string 
implementation for newer ports and make them only focus on just 
specific routines to get a better overall improvement.

It is done by:

  1. parametrizing the internal routines (for instance the find zero
     in a word) so each architecture can reimplement without the need
     to reimplement the whole routine.

  2. vectorizing more string implementations (for instance strcpy 
     and strcmp).

  3. Change some implementations to use already possible optimized
     ones (for instance strnlen).  It makes new ports to focus on
     only provide optimized implementation of a hardful symbols
     (for instance memchr) and make its improvement to be used in
     a larger set of routines.

For the rest of #5806 I think we can handle them later and if 
performance of generic implementation is closer I think it is better
to just remove old assembly implementations.

I also checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly 
implementation and disabling multiarch (it covers both LE and BE
for 64 and 32 bits). I also checked the string routines on alpha, hppa,
and sh.

Changes since v3:
  * Rebased against master.
  * Dropped strcpy optimization.
  * Refactor strcmp implementation.
  * Some minor changes in comments.

Changes since v2:
  * Move string-fz{a,b,i} to its own patch.
  * Add a inline implementation for __builtin_c{l,t}z to avoid using
    compiler provided symbols.
  * Add a new header, string-maskoff.h, to handle unaligned accesses
    on some implementation.
  * Fixed strcmp on LE machines.
  * Added a unaligned strcpy variant for architecture that define
    _STRING_ARCH_unaligned.
  * Add SH string-fzb.h (which uses cmp/str instruction to find
    a zero in word).

Changes since v1:
  * Marked ChangeLog entries with [BZ #5806], as appropriate.
  * Reorganized the headers, so that armv6t2 and power6 need override
    as little as possible to use their (integer) zero detection insns.
  * Hopefully fixed all of the coding style issues.
  * Adjusted the memrchr algorithm as discussed.
  * Replaced the #ifdef STRRCHR etc that are used by the multiarch
  * files.
  * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
    ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
    aarch64, alpha (qemu) and hppa (qemu).

[1] https://sourceware.org/legacy-ml/libc-alpha/2018-01/msg00318.html

Adhemerval Zanella (10):
  Add string-maskoff.h generic header
  Add string vectorized find and detection functions
  string: Improve generic strlen
  string: Improve generic strnlen
  string: Improve generic strchr
  string: Improve generic strchrnul
  string: Improve generic strcmp
  string: Improve generic memchr
  string: Improve generic memrchr
  sh: Add string-fzb.h

Richard Henderson (7):
  Parameterize op_t from memcopy.h
  Parameterize OP_T_THRES from memcopy.h
  hppa: Add memcopy.h
  hppa: Add string-fzb.h and string-fzi.h
  alpha: Add string-fzb.h and string-fzi.h
  arm: Add string-fza.h
  powerpc: Add string-fza.h

 config.h.in                                   |   8 +
 configure                                     |  54 +++++
 configure.ac                                  |  34 +++
 string/memchr.c                               | 168 ++++----------
 string/memcmp.c                               |   4 -
 string/memrchr.c                              | 189 +++-------------
 string/strchr.c                               | 172 +++------------
 string/strchrnul.c                            | 156 ++-----------
 string/strcmp.c                               | 117 ++++++++--
 string/strlen.c                               |  90 ++------
 string/strnlen.c                              | 137 +-----------
 sysdeps/alpha/string-fzb.h                    |  51 +++++
 sysdeps/alpha/string-fzi.h                    | 113 ++++++++++
 sysdeps/arm/armv6t2/string-fza.h              |  70 ++++++
 sysdeps/generic/memcopy.h                     |  10 +-
 sysdeps/generic/string-extbyte.h              |  37 ++++
 sysdeps/generic/string-fza.h                  | 106 +++++++++
 sysdeps/generic/string-fzb.h                  |  49 +++++
 sysdeps/generic/string-fzi.h                  | 208 ++++++++++++++++++
 sysdeps/generic/string-maskoff.h              |  73 ++++++
 sysdeps/generic/string-opthr.h                |  25 +++
 sysdeps/generic/string-optype.h               |  31 +++
 sysdeps/hppa/memcopy.h                        |  42 ++++
 sysdeps/hppa/string-fzb.h                     |  69 ++++++
 sysdeps/hppa/string-fzi.h                     | 135 ++++++++++++
 sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
 sysdeps/i386/memcopy.h                        |   3 -
 sysdeps/i386/string-opthr.h                   |  25 +++
 sysdeps/m68k/memcopy.h                        |   3 -
 sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
 .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
 .../power4/multiarch/strchrnul-ppc32.c        |   4 -
 .../power4/multiarch/strnlen-ppc32.c          |  14 +-
 .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
 sysdeps/powerpc/string-fza.h                  |  70 ++++++
 sysdeps/s390/strchr-c.c                       |  11 +-
 sysdeps/s390/strchrnul-c.c                    |   2 -
 sysdeps/s390/strlen-c.c                       |  10 +-
 sysdeps/s390/strnlen-c.c                      |  14 +-
 sysdeps/sh/string-fzb.h                       |  53 +++++
 40 files changed, 1548 insertions(+), 851 deletions(-)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h
 create mode 100644 sysdeps/generic/string-maskoff.h
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h
 create mode 100644 sysdeps/i386/string-opthr.h
 create mode 100644 sysdeps/powerpc/string-fza.h
 create mode 100644 sysdeps/sh/string-fzb.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 04/17] Add string vectorized find and detection functions
  2022-09-02 20:39 [PATCH 00/17] Improve generic string routines Adhemerval Zanella
@ 2022-09-02 20:39 ` Adhemerval Zanella
  2022-09-03  3:20   ` Noah Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Adhemerval Zanella @ 2022-09-02 20:39 UTC (permalink / raw)
  To: libc-alpha; +Cc: Joseph Myers, caiyinyu, Richard Henderson

This patch adds generic string find and detection meant to be used in
generic vectorized string implementation.  The idea is to decompose the
basic string operation so each architecture can reimplement if it
provides any specialized hardware instruction.

The 'string-fza.h' provides zero byte detection functions (find_zero_low,
find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
provided by 'string-fzb.h' and 'string-fzi'.

The 'string-fzb.h' provides boolean zero byte detection with the
functions:

  - has_zero: determine if any byte within a word is zero.
  - has_eq: determine byte equality between two words.
  - has_zero_eq: determine if any byte within a word is zero along with
    byte equality between two words.

The 'string-fzi.h' provides zero byte detection along with its positions:

  - index_first_zero: return index of first zero byte within a word.
  - index_first_eq: return index of first byte different between two words.
  - index_first_zero_eq: return index of first zero byte within a word or
    first byte different between two words.
  - index_first_zero_ne: return index of first zero byte within a word or
    first byte equal between two words.
  - index_last_zero: return index of last zero byte within a word.
  - index_last_eq: return index of last byte different between two words.

Also, to avoid libcalls in the '__builtin_c{t,l}z{l}' calls (which may
add performance degradation), inline implementation based on De Bruijn
sequences are added (enabled by a configure check).

Co-authored-by: Richard Henderson <rth@twiddle.net>
---
 config.h.in                      |   8 ++
 configure                        |  54 ++++++++
 configure.ac                     |  34 +++++
 sysdeps/generic/string-extbyte.h |  37 ++++++
 sysdeps/generic/string-fza.h     | 106 ++++++++++++++++
 sysdeps/generic/string-fzb.h     |  49 ++++++++
 sysdeps/generic/string-fzi.h     | 208 +++++++++++++++++++++++++++++++
 7 files changed, 496 insertions(+)
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h

diff --git a/config.h.in b/config.h.in
index 43d32518ab..8f0540d85a 100644
--- a/config.h.in
+++ b/config.h.in
@@ -202,6 +202,14 @@
 /* An integer used to scale the timeout of test programs.  */
 #define TIMEOUTFACTOR 1
 
+/* If compiler supports __builtin_ctz{l} without any external dependencies
+   (libgcc for instance).  */
+#define HAVE_BUILTIN_CTZ 0
+
+/* If compiler supports __builtin_clz{l} without any external dependencies
+   (libgcc for instance).  */
+#define HAVE_BUILTIN_CLZ 0
+
 /*
 \f */
 
diff --git a/configure b/configure
index ff2c406b3b..99cdec1bef 100755
--- a/configure
+++ b/configure
@@ -6726,6 +6726,60 @@ if test $libc_cv_builtin_trap = yes; then
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz{l} with no external dependencies" >&5
+$as_echo_n "checking for __builtin_ctz{l} with no external dependencies... " >&6; }
+if ${libc_cv_builtin_ctz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  libc_cv_builtin_ctz=yes
+echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
+if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_ctz=no
+  fi
+fi
+rm -f conftest.c conftest.s
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_ctz" >&5
+$as_echo "$libc_cv_builtin_ctz" >&6; }
+if test x$libc_cv_builtin_ctz = xyes; then
+  $as_echo "#define HAVE_BUILTIN_CTZ 1" >>confdefs.h
+
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clz{l} with no external dependencies" >&5
+$as_echo_n "checking for __builtin_clz{l} with no external dependencies... " >&6; }
+if ${libc_cv_builtin_clz+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  libc_cv_builtin_clz=yes
+echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
+if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_clz=no
+  fi
+fi
+rm -f conftest.c conftest.s
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_clz" >&5
+$as_echo "$libc_cv_builtin_clz" >&6; }
+if test x$libc_cv_builtin_clz = xyes; then
+  $as_echo "#define HAVE_BUILTIN_CLZ 1" >>confdefs.h
+
+fi
+
 ac_ext=cpp
 ac_cpp='$CXXCPP $CPPFLAGS'
 ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
diff --git a/configure.ac b/configure.ac
index eb5bc6a131..66e02a0566 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1618,6 +1618,40 @@ if test $libc_cv_builtin_trap = yes; then
   AC_DEFINE([HAVE_BUILTIN_TRAP])
 fi
 
+AC_CACHE_CHECK(for __builtin_ctz{l} with no external dependencies,
+	       libc_cv_builtin_ctz, [dnl
+libc_cv_builtin_ctz=yes
+echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
+if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
+changequote(,)dnl
+  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_ctz=no
+  fi
+changequote([,])dnl
+fi
+rm -f conftest.c conftest.s
+])
+if test x$libc_cv_builtin_ctz = xyes; then
+  AC_DEFINE(HAVE_BUILTIN_CTZ)
+fi
+
+AC_CACHE_CHECK(for __builtin_clz{l} with no external dependencies,
+	       libc_cv_builtin_clz, [dnl
+libc_cv_builtin_clz=yes
+echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
+if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
+changequote(,)dnl
+  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
+    libc_cv_builtin_clz=no
+  fi
+changequote([,])dnl
+fi
+rm -f conftest.c conftest.s
+])
+if test x$libc_cv_builtin_clz = xyes; then
+  AC_DEFINE(HAVE_BUILTIN_CLZ)
+fi
+
 dnl C++ feature tests.
 AC_LANG_PUSH([C++])
 
diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
new file mode 100644
index 0000000000..c8fecd259f
--- /dev/null
+++ b/sysdeps/generic/string-extbyte.h
@@ -0,0 +1,37 @@
+/* Extract by from memory word.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_EXTBYTE_H
+#define _STRING_EXTBYTE_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-optype.h>
+
+/* Extract the byte at index IDX from word X, with index 0 being the
+   least significant byte.  */
+static inline unsigned char
+extractbyte (op_t x, unsigned int idx)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return x >> (idx * CHAR_BIT);
+  else
+    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
+}
+
+#endif /* _STRING_EXTBYTE_H */
diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
new file mode 100644
index 0000000000..8470437824
--- /dev/null
+++ b/sysdeps/generic/string-fza.h
@@ -0,0 +1,106 @@
+/* Basic zero byte detection.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZA_H
+#define _STRING_FZA_H 1
+
+#include <limits.h>
+#include <string-optype.h>
+#include <string-maskoff.h>
+
+/* This function returns non-zero if any byte in X is zero.
+   More specifically, at least one bit set within the least significant
+   byte that was zero; other bytes within the word are indeterminate.  */
+static inline op_t
+find_zero_low (op_t x)
+{
+  /* This expression comes from
+       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
+     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
+  op_t lsb = repeat_bytes (0x01);
+  op_t msb = repeat_bytes (0x80);
+  return (x - lsb) & ~x & msb;
+}
+
+/* This function returns at least one bit set within every byte of X that
+   is zero.  The result is exact in that, unlike find_zero_low, all bytes
+   are determinate.  This is usually used for finding the index of the
+   most significant byte that was zero.  */
+static inline op_t
+find_zero_all (op_t x)
+{
+  /* For each byte, find not-zero by
+     (0) And 0x7f so that we cannot carry between bytes,
+     (1) Add 0x7f so that non-zero carries into 0x80,
+     (2) Or in the original byte (which might have had 0x80 set).
+     Then invert and mask such that 0x80 is set iff that byte was zero.  */
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  return ~(((x & m) + m) | x | m);
+}
+
+/* With similar caveats, identify bytes that are equal between X1 and X2.  */
+static inline op_t
+find_eq_low (op_t x1, op_t x2)
+{
+  return find_zero_low (x1 ^ x2);
+}
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1 ^ x2);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   equal between in X1 and X2.  */
+static inline op_t
+find_zero_eq_low (op_t x1, op_t x2)
+{
+  return find_zero_low (x1) | find_zero_low (x1 ^ x2);
+}
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   not equal between in X1 and X2.  */
+static inline op_t
+find_zero_ne_low (op_t x1, op_t x2)
+{
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  op_t eq = x1 ^ x2;
+  op_t nz1 = (x1 + m) | x1;	/* msb set if byte not zero.  */
+  op_t ne2 = (eq + m) | eq;	/* msb set if byte not equal.  */
+  return (ne2 | ~nz1) & ~m;	/* msb set if x1 zero or x2 not equal.  */
+}
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  op_t eq = x1 ^ x2;
+  op_t nz1 = ((x1 & m) + m) | x1;
+  op_t ne2 = ((eq & m) + m) | eq;
+  return (ne2 | ~nz1) & ~m;
+}
+
+#endif /* _STRING_FZA_H */
diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
new file mode 100644
index 0000000000..f1c0ae0922
--- /dev/null
+++ b/sysdeps/generic/string-fzb.h
@@ -0,0 +1,49 @@
+/* Zero byte detection, boolean.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZB_H
+#define _STRING_FZB_H 1
+
+#include <endian.h>
+#include <string-fza.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  return find_zero_low (x) != 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return find_eq_low (x1, x2) != 0;
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return find_zero_eq_low (x1, x2);
+}
+
+#endif /* _STRING_FZB_H */
diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
new file mode 100644
index 0000000000..8c6b6dc3d8
--- /dev/null
+++ b/sysdeps/generic/string-fzi.h
@@ -0,0 +1,208 @@
+/* Zero byte detection; indexes.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZI_H
+#define _STRING_FZI_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-fza.h>
+
+/* An improved bitscan routine, multiplying the De Bruijn sequence with a
+   0-1 mask separated by the least significant one bit of a scanned integer
+   or bitboard [1].
+
+   [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
+static inline unsigned int
+index_access (const op_t i)
+{
+  static const char index[] =
+  {
+# if __WORDSIZE == 64
+     0, 47,  1, 56, 48, 27,  2, 60,
+    57, 49, 41, 37, 28, 16,  3, 61,
+    54, 58, 35, 52, 50, 42, 21, 44,
+    38, 32, 29, 23, 17, 11,  4, 62,
+    46, 55, 26, 59, 40, 36, 15, 53,
+    34, 51, 20, 43, 31, 22, 10, 45,
+    25, 39, 14, 33, 19, 30,  9, 24,
+    13, 18,  8, 12,  7,  6,  5, 63
+# else
+     0,  9,  1, 10, 13, 21,  2, 29,
+    11, 14, 16, 18, 22, 25,  3, 30,
+     8, 12, 20, 28, 15, 17, 24,  7,
+    19, 27, 23,  6, 26,  5,  4, 31
+# endif
+  };
+  return index[i];
+}
+
+/* For architectures which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
+   and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
+   (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
+   inline implementation for both count leading zeros and count trailing
+   zeros using branchless computation.  As for builtin, if x is 0 the
+   result is undefined.*/
+static inline unsigned int
+__ctz (op_t x)
+{
+#if !HAVE_BUILTIN_CTZ
+  op_t i;
+# if __WORDSIZE == 64
+  i = (x ^ (x - 1)) * 0x03F79D71B4CB0A89ull >> 58;
+# else
+  i = (x ^ (x - 1)) * 0x07C4ACDDU >> 27;
+# endif
+  return index_access (i);
+#else
+  if (sizeof (op_t) == sizeof (long int))
+    return __builtin_ctzl (x);
+  else
+    return __builtin_ctzll (x);
+#endif
+};
+
+static inline unsigned int
+__clz (op_t x)
+{
+#if !HAVE_BUILTIN_CLZ
+  unsigned r;
+  op_t i;
+
+  x |= x >> 1;
+  x |= x >> 2;
+  x |= x >> 4;
+  x |= x >> 8;
+  x |= x >> 16;
+# if __WORDSIZE == 64
+  x |= x >> 32;
+  i = x * 0x03F79D71B4CB0A89ull >> 58;
+# else
+  i = x * 0x07C4ACDDU >> 27;
+# endif
+  r = index_access (i);
+  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
+#else
+  if (sizeof (op_t) == sizeof (long int))
+    return __builtin_clzl (x);
+  else
+    return __builtin_clzll (x);
+#endif
+}
+
+/* A subroutine for the index_zero functions.  Given a test word C, return
+   the (memory order) index of the first byte (in memory order) that is
+   non-zero.  */
+static inline unsigned int
+index_first_ (op_t c)
+{
+  _Static_assert (sizeof (op_t) == sizeof (long int)
+		  || sizeof (op_t) == sizeof (long long int),
+		  "Unhandled word size");
+
+  unsigned r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    r = __ctz (c);
+  else
+    r = __clz (c);
+  return r / CHAR_BIT;
+}
+
+/* Similarly, but return the (memory order) index of the last byte that is
+   non-zero.  */
+static inline unsigned int
+index_last_ (op_t c)
+{
+  _Static_assert (sizeof (op_t) == sizeof (long int)
+		  || sizeof (op_t) == sizeof (long long int),
+		  "Unhandled word size");
+
+  unsigned r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    r = __clz (c);
+  else
+    r = __ctz (c);
+  return sizeof (op_t) - 1 - (r / CHAR_BIT);
+}
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the word in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_low (x);
+  else
+    x = find_zero_all (x);
+  return index_first_ (x);
+}
+
+/* Similarly, but perform the search for byte equality between X1 and X2.  */
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_eq_low (x1, x2);
+  else
+    x1 = find_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or equality between
+   X1 and X2.  */
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_eq_low (x1, x2);
+  else
+    x1 = find_zero_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   inequality between X1 and X2.  */
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_ne_low (x1, x2);
+  else
+    x1 = find_zero_ne_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but search for the last zero within X.  */
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_all (x);
+  else
+    x = find_zero_low (x);
+  return index_last_ (x);
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* STRING_FZI_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 04/17] Add string vectorized find and detection functions
  2022-09-02 20:39 ` [PATCH 04/17] Add string vectorized find and detection functions Adhemerval Zanella
@ 2022-09-03  3:20   ` Noah Goldstein
  2022-09-19 14:00     ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 5+ messages in thread
From: Noah Goldstein @ 2022-09-03  3:20 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: GNU C Library, Richard Henderson, Joseph Myers, caiyinyu

On Fri, Sep 2, 2022 at 1:42 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> This patch adds generic string find and detection meant to be used in
> generic vectorized string implementation.  The idea is to decompose the
> basic string operation so each architecture can reimplement if it
> provides any specialized hardware instruction.
>
> The 'string-fza.h' provides zero byte detection functions (find_zero_low,
> find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
> find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
> provided by 'string-fzb.h' and 'string-fzi'.
>
> The 'string-fzb.h' provides boolean zero byte detection with the
> functions:
>
>   - has_zero: determine if any byte within a word is zero.
>   - has_eq: determine byte equality between two words.
>   - has_zero_eq: determine if any byte within a word is zero along with
>     byte equality between two words.
>
> The 'string-fzi.h' provides zero byte detection along with its positions:
>
>   - index_first_zero: return index of first zero byte within a word.
>   - index_first_eq: return index of first byte different between two words.
>   - index_first_zero_eq: return index of first zero byte within a word or
>     first byte different between two words.
>   - index_first_zero_ne: return index of first zero byte within a word or
>     first byte equal between two words.
>   - index_last_zero: return index of last zero byte within a word.
>   - index_last_eq: return index of last byte different between two words.
>
> Also, to avoid libcalls in the '__builtin_c{t,l}z{l}' calls (which may
> add performance degradation), inline implementation based on De Bruijn
> sequences are added (enabled by a configure check).
>
> Co-authored-by: Richard Henderson <rth@twiddle.net>
> ---
>  config.h.in                      |   8 ++
>  configure                        |  54 ++++++++
>  configure.ac                     |  34 +++++
>  sysdeps/generic/string-extbyte.h |  37 ++++++
>  sysdeps/generic/string-fza.h     | 106 ++++++++++++++++
>  sysdeps/generic/string-fzb.h     |  49 ++++++++
>  sysdeps/generic/string-fzi.h     | 208 +++++++++++++++++++++++++++++++
>  7 files changed, 496 insertions(+)
>  create mode 100644 sysdeps/generic/string-extbyte.h
>  create mode 100644 sysdeps/generic/string-fza.h
>  create mode 100644 sysdeps/generic/string-fzb.h
>  create mode 100644 sysdeps/generic/string-fzi.h
>
> diff --git a/config.h.in b/config.h.in
> index 43d32518ab..8f0540d85a 100644
> --- a/config.h.in
> +++ b/config.h.in
> @@ -202,6 +202,14 @@
>  /* An integer used to scale the timeout of test programs.  */
>  #define TIMEOUTFACTOR 1
>
> +/* If compiler supports __builtin_ctz{l} without any external dependencies
> +   (libgcc for instance).  */
> +#define HAVE_BUILTIN_CTZ 0
> +
> +/* If compiler supports __builtin_clz{l} without any external dependencies
> +   (libgcc for instance).  */
> +#define HAVE_BUILTIN_CLZ 0
> +
>  /*
>   */
>
> diff --git a/configure b/configure
> index ff2c406b3b..99cdec1bef 100755
> --- a/configure
> +++ b/configure
> @@ -6726,6 +6726,60 @@ if test $libc_cv_builtin_trap = yes; then
>
>  fi
>
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz{l} with no external dependencies" >&5
> +$as_echo_n "checking for __builtin_ctz{l} with no external dependencies... " >&6; }
> +if ${libc_cv_builtin_ctz+:} false; then :
> +  $as_echo_n "(cached) " >&6
> +else
> +  libc_cv_builtin_ctz=yes
> +echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
> +if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
> +  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
> +  (eval $ac_try) 2>&5
> +  ac_status=$?
> +  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
> +  test $ac_status = 0; }; }; then
> +  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
> +    libc_cv_builtin_ctz=no
> +  fi
> +fi
> +rm -f conftest.c conftest.s
> +
> +fi
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_ctz" >&5
> +$as_echo "$libc_cv_builtin_ctz" >&6; }
> +if test x$libc_cv_builtin_ctz = xyes; then
> +  $as_echo "#define HAVE_BUILTIN_CTZ 1" >>confdefs.h
> +
> +fi
> +
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clz{l} with no external dependencies" >&5
> +$as_echo_n "checking for __builtin_clz{l} with no external dependencies... " >&6; }
> +if ${libc_cv_builtin_clz+:} false; then :
> +  $as_echo_n "(cached) " >&6
> +else
> +  libc_cv_builtin_clz=yes
> +echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
> +if { ac_try='${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&5'
> +  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
> +  (eval $ac_try) 2>&5
> +  ac_status=$?
> +  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
> +  test $ac_status = 0; }; }; then
> +  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
> +    libc_cv_builtin_clz=no
> +  fi
> +fi
> +rm -f conftest.c conftest.s
> +
> +fi
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_builtin_clz" >&5
> +$as_echo "$libc_cv_builtin_clz" >&6; }
> +if test x$libc_cv_builtin_clz = xyes; then
> +  $as_echo "#define HAVE_BUILTIN_CLZ 1" >>confdefs.h
> +
> +fi
> +
>  ac_ext=cpp
>  ac_cpp='$CXXCPP $CPPFLAGS'
>  ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
> diff --git a/configure.ac b/configure.ac
> index eb5bc6a131..66e02a0566 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1618,6 +1618,40 @@ if test $libc_cv_builtin_trap = yes; then
>    AC_DEFINE([HAVE_BUILTIN_TRAP])
>  fi
>
> +AC_CACHE_CHECK(for __builtin_ctz{l} with no external dependencies,
> +              libc_cv_builtin_ctz, [dnl
> +libc_cv_builtin_ctz=yes
> +echo 'int foo (unsigned long x) { return __builtin_ctz (x); }' > conftest.c
> +if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
> +changequote(,)dnl
> +  if grep '__ctz[s,d]i2' conftest.s > /dev/null; then
> +    libc_cv_builtin_ctz=no
> +  fi
> +changequote([,])dnl
> +fi
> +rm -f conftest.c conftest.s
> +])
> +if test x$libc_cv_builtin_ctz = xyes; then
> +  AC_DEFINE(HAVE_BUILTIN_CTZ)
> +fi
> +
> +AC_CACHE_CHECK(for __builtin_clz{l} with no external dependencies,
> +              libc_cv_builtin_clz, [dnl
> +libc_cv_builtin_clz=yes
> +echo 'int foo (unsigned long x) { return __builtin_clz (x); }' > conftest.c
> +if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS -S conftest.c -o conftest.s 1>&AS_MESSAGE_LOG_FD); then
> +changequote(,)dnl
> +  if grep '__clz[s,d]i2' conftest.s > /dev/null; then
> +    libc_cv_builtin_clz=no
> +  fi
> +changequote([,])dnl
> +fi
> +rm -f conftest.c conftest.s
> +])
> +if test x$libc_cv_builtin_clz = xyes; then
> +  AC_DEFINE(HAVE_BUILTIN_CLZ)
> +fi
> +
>  dnl C++ feature tests.
>  AC_LANG_PUSH([C++])
>
> diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
> new file mode 100644
> index 0000000000..c8fecd259f
> --- /dev/null
> +++ b/sysdeps/generic/string-extbyte.h
> @@ -0,0 +1,37 @@
> +/* Extract by from memory word.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_EXTBYTE_H
> +#define _STRING_EXTBYTE_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-optype.h>
> +
> +/* Extract the byte at index IDX from word X, with index 0 being the
> +   least significant byte.  */
> +static inline unsigned char
> +extractbyte (op_t x, unsigned int idx)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    return x >> (idx * CHAR_BIT);
> +  else
> +    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
> +}
> +
> +#endif /* _STRING_EXTBYTE_H */
> diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
> new file mode 100644
> index 0000000000..8470437824
> --- /dev/null
> +++ b/sysdeps/generic/string-fza.h
> @@ -0,0 +1,106 @@
> +/* Basic zero byte detection.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZA_H
> +#define _STRING_FZA_H 1
> +
> +#include <limits.h>
> +#include <string-optype.h>
> +#include <string-maskoff.h>
> +
> +/* This function returns non-zero if any byte in X is zero.
> +   More specifically, at least one bit set within the least significant
> +   byte that was zero; other bytes within the word are indeterminate.  */
> +static inline op_t
> +find_zero_low (op_t x)
> +{
> +  /* This expression comes from
> +       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
> +     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
> +     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
> +  op_t lsb = repeat_bytes (0x01);
> +  op_t msb = repeat_bytes (0x80);
> +  return (x - lsb) & ~x & msb;
> +}
> +
> +/* This function returns at least one bit set within every byte of X that
> +   is zero.  The result is exact in that, unlike find_zero_low, all bytes
> +   are determinate.  This is usually used for finding the index of the
> +   most significant byte that was zero.  */
> +static inline op_t
> +find_zero_all (op_t x)
> +{
> +  /* For each byte, find not-zero by
> +     (0) And 0x7f so that we cannot carry between bytes,
> +     (1) Add 0x7f so that non-zero carries into 0x80,
> +     (2) Or in the original byte (which might have had 0x80 set).
> +     Then invert and mask such that 0x80 is set iff that byte was zero.  */
> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;
> +  return ~(((x & m) + m) | x | m);
> +}
> +
> +/* With similar caveats, identify bytes that are equal between X1 and X2.  */
> +static inline op_t
> +find_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1) | find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_zero_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   not equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_ne_low (op_t x1, op_t x2)
> +{
> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;

Can this use repeat_bytes?
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = (x1 + m) | x1;    /* msb set if byte not zero.  */
> +  op_t ne2 = (eq + m) | eq;    /* msb set if byte not equal.  */
> +  return (ne2 | ~nz1) & ~m;    /* msb set if x1 zero or x2 not equal.  */
> +}
> +
> +static inline op_t
> +find_zero_ne_all (op_t x1, op_t x2)
> +{
> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;

Likewise. Elsewhere too.
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = ((x1 & m) + m) | x1;
> +  op_t ne2 = ((eq & m) + m) | eq;
> +  return (ne2 | ~nz1) & ~m;
> +}
> +
> +#endif /* _STRING_FZA_H */
> diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
> new file mode 100644
> index 0000000000..f1c0ae0922
> --- /dev/null
> +++ b/sysdeps/generic/string-fzb.h
> @@ -0,0 +1,49 @@
> +/* Zero byte detection, boolean.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZB_H
> +#define _STRING_FZB_H 1
> +
> +#include <endian.h>
> +#include <string-fza.h>
> +
> +/* Determine if any byte within X is zero.  This is a pure boolean test.  */
> +
> +static inline _Bool
> +has_zero (op_t x)
> +{
> +  return find_zero_low (x) != 0;
> +}
> +
> +/* Likewise, but for byte equality between X1 and X2.  */
> +
> +static inline _Bool
> +has_eq (op_t x1, op_t x2)
> +{
> +  return find_eq_low (x1, x2) != 0;
> +}
> +
> +/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
> +
> +static inline _Bool
> +has_zero_eq (op_t x1, op_t x2)
> +{
> +  return find_zero_eq_low (x1, x2);
> +}
> +
> +#endif /* _STRING_FZB_H */
> diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
> new file mode 100644
> index 0000000000..8c6b6dc3d8
> --- /dev/null
> +++ b/sysdeps/generic/string-fzi.h
> @@ -0,0 +1,208 @@
> +/* Zero byte detection; indexes.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZI_H
> +#define _STRING_FZI_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-fza.h>
> +
> +/* An improved bitscan routine, multiplying the De Bruijn sequence with a
> +   0-1 mask separated by the least significant one bit of a scanned integer
> +   or bitboard [1].
> +
> +   [1] https://chessprogramming.wikispaces.com/Kim+Walisch  */
> +static inline unsigned int
> +index_access (const op_t i)
> +{
> +  static const char index[] =
> +  {
> +# if __WORDSIZE == 64
> +     0, 47,  1, 56, 48, 27,  2, 60,
> +    57, 49, 41, 37, 28, 16,  3, 61,
> +    54, 58, 35, 52, 50, 42, 21, 44,
> +    38, 32, 29, 23, 17, 11,  4, 62,
> +    46, 55, 26, 59, 40, 36, 15, 53,
> +    34, 51, 20, 43, 31, 22, 10, 45,
> +    25, 39, 14, 33, 19, 30,  9, 24,
> +    13, 18,  8, 12,  7,  6,  5, 63
> +# else
> +     0,  9,  1, 10, 13, 21,  2, 29,
> +    11, 14, 16, 18, 22, 25,  3, 30,
> +     8, 12, 20, 28, 15, 17, 24,  7,
> +    19, 27, 23,  6, 26,  5,  4, 31
> +# endif
> +  };
> +  return index[i];
> +}
> +
> +/* For architectures which only provides __builtin_clz{l} (HAVE_BUILTIN_CLZ)
> +   and/or __builtin_ctz{l} (HAVE_BUILTIN_CTZ) which uses external libcalls
> +   (for intance __c{l,t}z{s,d}i2 from libgcc) the following wrapper provides
> +   inline implementation for both count leading zeros and count trailing
> +   zeros using branchless computation.  As for builtin, if x is 0 the
> +   result is undefined.*/
> +static inline unsigned int
> +__ctz (op_t x)
> +{
> +#if !HAVE_BUILTIN_CTZ
> +  op_t i;
> +# if __WORDSIZE == 64
> +  i = (x ^ (x - 1)) * 0x03F79D71B4CB0A89ull >> 58;
> +# else
> +  i = (x ^ (x - 1)) * 0x07C4ACDDU >> 27;
> +# endif
> +  return index_access (i);
> +#else
> +  if (sizeof (op_t) == sizeof (long int))
> +    return __builtin_ctzl (x);
> +  else
> +    return __builtin_ctzll (x);
> +#endif
> +};
> +
> +static inline unsigned int
> +__clz (op_t x)
> +{
> +#if !HAVE_BUILTIN_CLZ
> +  unsigned r;
> +  op_t i;
> +
> +  x |= x >> 1;
> +  x |= x >> 2;
> +  x |= x >> 4;
> +  x |= x >> 8;
> +  x |= x >> 16;
> +# if __WORDSIZE == 64
> +  x |= x >> 32;
> +  i = x * 0x03F79D71B4CB0A89ull >> 58;
> +# else
> +  i = x * 0x07C4ACDDU >> 27;
> +# endif
> +  r = index_access (i);
> +  return r ^ (sizeof (op_t) * CHAR_BIT - 1);
> +#else
> +  if (sizeof (op_t) == sizeof (long int))
> +    return __builtin_clzl (x);
> +  else
> +    return __builtin_clzll (x);
> +#endif
> +}
> +
> +/* A subroutine for the index_zero functions.  Given a test word C, return
> +   the (memory order) index of the first byte (in memory order) that is
> +   non-zero.  */
> +static inline unsigned int
> +index_first_ (op_t c)
> +{
> +  _Static_assert (sizeof (op_t) == sizeof (long int)
> +                 || sizeof (op_t) == sizeof (long long int),
> +                 "Unhandled word size");
> +
> +  unsigned r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    r = __ctz (c);
> +  else
> +    r = __clz (c);
> +  return r / CHAR_BIT;
> +}
> +
> +/* Similarly, but return the (memory order) index of the last byte that is
> +   non-zero.  */
> +static inline unsigned int
> +index_last_ (op_t c)
> +{
> +  _Static_assert (sizeof (op_t) == sizeof (long int)
> +                 || sizeof (op_t) == sizeof (long long int),
> +                 "Unhandled word size");
> +
> +  unsigned r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    r = __clz (c);
> +  else
> +    r = __ctz (c);
> +  return sizeof (op_t) - 1 - (r / CHAR_BIT);
> +}
> +
> +/* Given a word X that is known to contain a zero byte, return the
> +   index of the first such within the word in memory order.  */
> +
> +static inline unsigned int
> +index_first_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_low (x);
> +  else
> +    x = find_zero_all (x);
> +  return index_first_ (x);
> +}
> +
> +/* Similarly, but perform the search for byte equality between X1 and X2.  */
> +static inline unsigned int
> +index_first_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_eq_low (x1, x2);
> +  else
> +    x1 = find_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or equality between
> +   X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_eq_low (x1, x2);
> +  else
> +    x1 = find_zero_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or
> +   inequality between X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_ne (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_ne_low (x1, x2);
> +  else
> +    x1 = find_zero_ne_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but search for the last zero within X.  */
> +static inline unsigned int
> +index_last_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_all (x);
> +  else
> +    x = find_zero_low (x);
> +  return index_last_ (x);
> +}
> +
> +static inline unsigned int
> +index_last_eq (op_t x1, op_t x2)
> +{
> +  return index_last_zero (x1 ^ x2);
> +}
> +
> +#endif /* STRING_FZI_H */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 04/17] Add string vectorized find and detection functions
  2022-09-03  3:20   ` Noah Goldstein
@ 2022-09-19 14:00     ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 5+ messages in thread
From: Adhemerval Zanella Netto @ 2022-09-19 14:00 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Richard Henderson, Joseph Myers, caiyinyu



On 03/09/22 00:20, Noah Goldstein wrote:
>> +
>> +/* With similar caveats, identify zero bytes in X1 and bytes that are
>> +   not equal between in X1 and X2.  */
>> +static inline op_t
>> +find_zero_ne_low (op_t x1, op_t x2)
>> +{
>> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;
> 
> Can this use repeat_bytes?

We can, I will update it.

>> +  op_t eq = x1 ^ x2;
>> +  op_t nz1 = (x1 + m) | x1;    /* msb set if byte not zero.  */
>> +  op_t ne2 = (eq + m) | eq;    /* msb set if byte not equal.  */
>> +  return (ne2 | ~nz1) & ~m;    /* msb set if x1 zero or x2 not equal.  */
>> +}
>> +
>> +static inline op_t
>> +find_zero_ne_all (op_t x1, op_t x2)
>> +{
>> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;
> 
> Likewise. Elsewhere too.

Ack.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-19 14:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-03 13:13 [PATCH 04/17] Add string vectorized find and detection functions Wilco Dijkstra
2022-09-19 13:59 ` Adhemerval Zanella Netto
  -- strict thread matches above, loose matches on Subject: below --
2022-09-02 20:39 [PATCH 00/17] Improve generic string routines Adhemerval Zanella
2022-09-02 20:39 ` [PATCH 04/17] Add string vectorized find and detection functions Adhemerval Zanella
2022-09-03  3:20   ` Noah Goldstein
2022-09-19 14:00     ` Adhemerval Zanella Netto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).