public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v5 00/17] Improve generic string routines
@ 2022-09-19 19:59 Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 01/17] Parameterize op_t from memcopy.h Adhemerval Zanella
                   ` (17 more replies)
  0 siblings, 18 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha

It is done by:

  1. parametrizing the internal routines (for instance the find zero
     in a word) so each architecture can reimplement without the need
     to reimplement the whole routine.

  2. vectorizing more string implementations (for instance strcpy
     and strcmp).

  3. Change some implementations to use already possible optimized
     ones (for instance strnlen).  It makes new ports to focus on
     only provide optimized implementation of a hardful symbols
     (for instance memchr) and make its improvement to be used in
     a larger set of routines.

For the rest of #5806 I think we can handle them later and if
performance of generic implementation is closer I think it is better
to just remove old assembly implementations.

I also checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly
implementation and disabling multiarch (it covers both LE and BE
for 64 and 32 bits). I also checked the string routines on alpha, hppa,
and sh.

Changes since v4:
  * Removed __clz and __ctz in favor of count_leading_zero and
    count_trailing_zeros from longlong.h.
  * Use repeat_bytes more often.
  * Added a comment on strcmp final_cmp on why index_first_zero_ne can
    not be used.

Changes since v3:
  * Rebased against master.
  * Dropped strcpy optimization.
  * Refactor strcmp implementation.
  * Some minor changes in comments.

Changes since v2:
  * Move string-fz{a,b,i} to its own patch.
  * Add a inline implementation for __builtin_c{l,t}z to avoid using
    compiler provided symbols.
  * Add a new header, string-maskoff.h, to handle unaligned accesses
    on some implementation.
  * Fixed strcmp on LE machines.
  * Added a unaligned strcpy variant for architecture that define
    _STRING_ARCH_unaligned.
  * Add SH string-fzb.h (which uses cmp/str instruction to find
    a zero in word).

Changes since v1:
  * Marked ChangeLog entries with [BZ #5806], as appropriate.
  * Reorganized the headers, so that armv6t2 and power6 need override
    as little as possible to use their (integer) zero detection insns.
  * Hopefully fixed all of the coding style issues.
  * Adjusted the memrchr algorithm as discussed.
  * Replaced the #ifdef STRRCHR etc that are used by the multiarch
  * files.
  * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
    ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
    aarch64, alpha (qemu) and hppa (qemu).

Adhemerval Zanella (10):
  Add string-maskoff.h generic header
  Add string vectorized find and detection functions
  string: Improve generic strlen
  string: Improve generic strnlen
  string: Improve generic strchr
  string: Improve generic strchrnul
  string: Improve generic strcmp
  string: Improve generic memchr
  string: Improve generic memrchr
  sh: Add string-fzb.h

Richard Henderson (7):
  Parameterize op_t from memcopy.h
  Parameterize OP_T_THRES from memcopy.h
  hppa: Add memcopy.h
  hppa: Add string-fzb.h and string-fzi.h
  alpha: Add string-fzb.h and string-fzi.h
  arm: Add string-fza.h
  powerpc: Add string-fza.h

 string/memchr.c                               | 168 ++++------------
 string/memcmp.c                               |   4 -
 string/memrchr.c                              | 189 +++---------------
 string/strchr.c                               | 172 +++-------------
 string/strchrnul.c                            | 156 +++------------
 string/strcmp.c                               | 119 +++++++++--
 string/strlen.c                               |  90 ++-------
 string/strnlen.c                              | 137 +------------
 sysdeps/alpha/string-fzb.h                    |  51 +++++
 sysdeps/alpha/string-fzi.h                    | 113 +++++++++++
 sysdeps/arm/armv6t2/string-fza.h              |  70 +++++++
 sysdeps/generic/memcopy.h                     |  10 +-
 sysdeps/generic/string-extbyte.h              |  37 ++++
 sysdeps/generic/string-fza.h                  | 106 ++++++++++
 sysdeps/generic/string-fzb.h                  |  49 +++++
 sysdeps/generic/string-fzi.h                  | 120 +++++++++++
 sysdeps/generic/string-maskoff.h              |  73 +++++++
 sysdeps/generic/string-opthr.h                |  25 +++
 sysdeps/generic/string-optype.h               |  31 +++
 sysdeps/hppa/memcopy.h                        |  42 ++++
 sysdeps/hppa/string-fzb.h                     |  69 +++++++
 sysdeps/hppa/string-fzi.h                     | 135 +++++++++++++
 sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
 sysdeps/i386/memcopy.h                        |   3 -
 sysdeps/i386/string-opthr.h                   |  25 +++
 sysdeps/m68k/memcopy.h                        |   3 -
 sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
 .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
 .../power4/multiarch/strchrnul-ppc32.c        |   4 -
 .../power4/multiarch/strnlen-ppc32.c          |  14 +-
 .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
 sysdeps/powerpc/string-fza.h                  |  70 +++++++
 sysdeps/s390/strchr-c.c                       |  11 +-
 sysdeps/s390/strchrnul-c.c                    |   2 -
 sysdeps/s390/strlen-c.c                       |  10 +-
 sysdeps/s390/strnlen-c.c                      |  14 +-
 sysdeps/sh/string-fzb.h                       |  53 +++++
 37 files changed, 1366 insertions(+), 851 deletions(-)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h
 create mode 100644 sysdeps/generic/string-maskoff.h
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h
 create mode 100644 sysdeps/i386/string-opthr.h
 create mode 100644 sysdeps/powerpc/string-fza.h
 create mode 100644 sysdeps/sh/string-fzb.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 01/17] Parameterize op_t from memcopy.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 02/17] Parameterize OP_T_THRES " Adhemerval Zanella
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

It moves the op_t definition out to an specific header, adds
the attribute 'may-alias', and cleanup its duplicated definitions.

Checked with a build and check with run-built-tests=no for all major
Linux ABIs.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
---
 string/memcmp.c                 |  1 -
 sysdeps/generic/memcopy.h       |  6 ++----
 sysdeps/generic/string-optype.h | 31 +++++++++++++++++++++++++++++++
 3 files changed, 33 insertions(+), 5 deletions(-)
 create mode 100644 sysdeps/generic/string-optype.h

diff --git a/string/memcmp.c b/string/memcmp.c
index 40029474e6..6a9ceb8ac3 100644
--- a/string/memcmp.c
+++ b/string/memcmp.c
@@ -46,7 +46,6 @@
 /* Type to use for aligned memory operations.
    This should normally be the biggest type supported by a single load
    and store.  Must be an unsigned type.  */
-# define op_t	unsigned long int
 # define OPSIZ	(sizeof (op_t))
 
 /* Threshold value for when to enter the unrolled loops.  */
diff --git a/sysdeps/generic/memcopy.h b/sysdeps/generic/memcopy.h
index 251632e8ae..efe5f2475d 100644
--- a/sysdeps/generic/memcopy.h
+++ b/sysdeps/generic/memcopy.h
@@ -55,10 +55,8 @@
      [I fail to understand.  I feel stupid.  --roland]
 */
 
-/* Type to use for aligned memory operations.
-   This should normally be the biggest type supported by a single load
-   and store.  */
-#define	op_t	unsigned long int
+/* Type to use for aligned memory operations.  */
+#include <string-optype.h>
 #define OPSIZ	(sizeof (op_t))
 
 /* Type to use for unaligned operations.  */
diff --git a/sysdeps/generic/string-optype.h b/sysdeps/generic/string-optype.h
new file mode 100644
index 0000000000..fb6d67a19e
--- /dev/null
+++ b/sysdeps/generic/string-optype.h
@@ -0,0 +1,31 @@
+/* Define a type to use for word access.  Generic version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_OPTYPE_H
+#define _STRING_OPTYPE_H 1
+
+/* Use the existing parameterization from gmp as a default.  */
+#include <gmp-mparam.h>
+
+#ifdef _LONG_LONG_LIMB
+typedef unsigned long long int __attribute__ ((__may_alias__)) op_t;
+#else
+typedef unsigned long int __attribute__ ((__may_alias__)) op_t;
+#endif
+
+#endif /* string-optype.h */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 02/17] Parameterize OP_T_THRES from memcopy.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 01/17] Parameterize op_t from memcopy.h Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-20 10:49   ` Carlos O'Donell
  2022-09-19 19:59 ` [PATCH v5 03/17] Add string-maskoff.h generic header Adhemerval Zanella
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

It moves OP_T_THRES out of memcopy.h to its own header and adjust
each architecture that redefines it.

Checked with a build and check with run-built-tests=no for all major
Linux ABIs.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
---
 string/memcmp.c                            |  3 ---
 sysdeps/generic/memcopy.h                  |  4 +---
 sysdeps/generic/string-opthr.h             | 25 ++++++++++++++++++++++
 sysdeps/i386/memcopy.h                     |  3 ---
 sysdeps/i386/string-opthr.h                | 25 ++++++++++++++++++++++
 sysdeps/m68k/memcopy.h                     |  3 ---
 sysdeps/powerpc/powerpc32/power4/memcopy.h |  5 -----
 7 files changed, 51 insertions(+), 17 deletions(-)
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/i386/string-opthr.h

diff --git a/string/memcmp.c b/string/memcmp.c
index 6a9ceb8ac3..7c4606c2d0 100644
--- a/string/memcmp.c
+++ b/string/memcmp.c
@@ -48,9 +48,6 @@
    and store.  Must be an unsigned type.  */
 # define OPSIZ	(sizeof (op_t))
 
-/* Threshold value for when to enter the unrolled loops.  */
-# define OP_T_THRES	16
-
 /* Type to use for unaligned operations.  */
 typedef unsigned char byte;
 
diff --git a/sysdeps/generic/memcopy.h b/sysdeps/generic/memcopy.h
index efe5f2475d..a6baa4dfbb 100644
--- a/sysdeps/generic/memcopy.h
+++ b/sysdeps/generic/memcopy.h
@@ -57,6 +57,7 @@
 
 /* Type to use for aligned memory operations.  */
 #include <string-optype.h>
+#include <string-opthr.h>
 #define OPSIZ	(sizeof (op_t))
 
 /* Type to use for unaligned operations.  */
@@ -188,9 +189,6 @@ extern void _wordcopy_bwd_dest_aligned (long int, long int, size_t)
 
 #endif
 
-/* Threshold value for when to enter the unrolled loops.  */
-#define	OP_T_THRES	16
-
 /* Set to 1 if memcpy is safe to use for forward-copying memmove with
    overlapping addresses.  This is 0 by default because memcpy implementations
    are generally not safe for overlapping addresses.  */
diff --git a/sysdeps/generic/string-opthr.h b/sysdeps/generic/string-opthr.h
new file mode 100644
index 0000000000..eabd9fd669
--- /dev/null
+++ b/sysdeps/generic/string-opthr.h
@@ -0,0 +1,25 @@
+/* Define a threshold for word access.  Generic version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_OPTHR_H
+#define _STRING_OPTHR_H 1
+
+/* Threshold value for when to enter the unrolled loops.  */
+#define OP_T_THRES	16
+
+#endif /* string-opthr.h */
diff --git a/sysdeps/i386/memcopy.h b/sysdeps/i386/memcopy.h
index 8cbf182096..66f5665f82 100644
--- a/sysdeps/i386/memcopy.h
+++ b/sysdeps/i386/memcopy.h
@@ -18,9 +18,6 @@
 
 #include <sysdeps/generic/memcopy.h>
 
-#undef	OP_T_THRES
-#define	OP_T_THRES	8
-
 #undef	BYTE_COPY_FWD
 #define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)				      \
   do {									      \
diff --git a/sysdeps/i386/string-opthr.h b/sysdeps/i386/string-opthr.h
new file mode 100644
index 0000000000..ed3e4b2ddb
--- /dev/null
+++ b/sysdeps/i386/string-opthr.h
@@ -0,0 +1,25 @@
+/* Define a threshold for word access.  i386 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef I386_STRING_OPTHR_H
+#define I386_STRING_OPTHR_H 1
+
+/* Threshold value for when to enter the unrolled loops.  */
+#define OP_T_THRES	8
+
+#endif /* I386_STRING_OPTHR_H */
diff --git a/sysdeps/m68k/memcopy.h b/sysdeps/m68k/memcopy.h
index cf147f2c4a..3777baac21 100644
--- a/sysdeps/m68k/memcopy.h
+++ b/sysdeps/m68k/memcopy.h
@@ -20,9 +20,6 @@
 
 #if	defined(__mc68020__) || defined(mc68020)
 
-#undef	OP_T_THRES
-#define	OP_T_THRES	16
-
 /* WORD_COPY_FWD and WORD_COPY_BWD are not symmetric on the 68020,
    because of its weird instruction overlap characteristics.  */
 
diff --git a/sysdeps/powerpc/powerpc32/power4/memcopy.h b/sysdeps/powerpc/powerpc32/power4/memcopy.h
index a98f6662d8..d27caa2277 100644
--- a/sysdeps/powerpc/powerpc32/power4/memcopy.h
+++ b/sysdeps/powerpc/powerpc32/power4/memcopy.h
@@ -50,11 +50,6 @@
      [I fail to understand.  I feel stupid.  --roland]
 */
 
-
-/* Threshold value for when to enter the unrolled loops.  */
-#undef	OP_T_THRES
-#define OP_T_THRES 16
-
 /* Copy exactly NBYTES bytes from SRC_BP to DST_BP,
    without any assumptions about alignment of the pointers.  */
 #undef BYTE_COPY_FWD
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 03/17] Add string-maskoff.h generic header
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 01/17] Parameterize op_t from memcopy.h Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 02/17] Parameterize OP_T_THRES " Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-20 11:43   ` Carlos O'Donell
  2023-01-05 22:49   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 04/17] Add string vectorized find and detection functions Adhemerval Zanella
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha

Macros to operate on unaligned access for string operations:

  - create_mask: create a mask based on pointer alignment to sets up
    non-zero bytes before the beginning of the word so a following
    operation (such as find zero) might ignore these bytes.

  - highbit_mask: create a mask with high bit of each byte being 1,
    and the low 7 bits being all the opposite of the input.

These macros are meant to be used on optimized vectorized string
implementations.
---
 sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 sysdeps/generic/string-maskoff.h

diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
new file mode 100644
index 0000000000..831647bda6
--- /dev/null
+++ b/sysdeps/generic/string-maskoff.h
@@ -0,0 +1,73 @@
+/* Mask off bits.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_MASKOFF_H
+#define _STRING_MASKOFF_H 1
+
+#include <endian.h>
+#include <limits.h>
+#include <stdint.h>
+#include <string-optype.h>
+
+/* Provide a mask based on the pointer alignment that sets up non-zero
+   bytes before the beginning of the word.  It is used to mask off
+   undesirable bits from an aligned read from an unaligned pointer.
+   For instance, on a 64 bits machine with a pointer alignment of
+   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
+   (meaning to mask off the initial 3 bytes).  */
+static inline op_t
+create_mask (uintptr_t i)
+{
+  i = i % sizeof (op_t);
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return ~(((op_t)-1) << (i * CHAR_BIT));
+  else
+    return ~(((op_t)-1) >> (i * CHAR_BIT));
+}
+
+/* Setup an word with each byte being c_in.  For instance, on a 64 bits
+   machine with input as 0xce the functions returns 0xcececececececece.  */
+static inline op_t
+repeat_bytes (unsigned char c_in)
+{
+  return ((op_t)-1 / 0xff) * c_in;
+}
+
+/* Based on mask created by 'create_mask', mask off the high bit of each
+   byte in the mask.  It is used to mask off undesirable bits from an
+   aligned read from an unaligned pointer, and also taking care to avoid
+   match possible bytes meant to be matched.  For instance, on a 64 bits
+   machine with a mask created from a pointer with an alignment of 3
+   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
+   and 0x00000000007f7f7f for LE.  */
+static inline op_t
+highbit_mask (op_t m)
+{
+  return m & repeat_bytes (0x7f);
+}
+
+/* Return the address of the op_t word containing the address P.  For
+   instance on address 0x0011223344556677 and op_t with size of 8,
+   it returns 0x0011223344556670.  */
+static inline op_t *
+word_containing (char const *p)
+{
+  return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
+}
+
+#endif /* _STRING_MASKOFF_H  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 04/17] Add string vectorized find and detection functions
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 03/17] Add string-maskoff.h generic header Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2023-01-05 22:53   ` Noah Goldstein
  2023-01-05 23:04   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 05/17] string: Improve generic strlen Adhemerval Zanella
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

This patch adds generic string find and detection meant to be used in
generic vectorized string implementation.  The idea is to decompose the
basic string operation so each architecture can reimplement if it
provides any specialized hardware instruction.

The 'string-fza.h' provides zero byte detection functions (find_zero_low,
find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
provided by 'string-fzb.h' and 'string-fzi'.

The 'string-fzb.h' provides boolean zero byte detection with the
functions:

  - has_zero: determine if any byte within a word is zero.
  - has_eq: determine byte equality between two words.
  - has_zero_eq: determine if any byte within a word is zero along with
    byte equality between two words.

The 'string-fzi.h' provides zero byte detection along with its positions:

  - index_first_zero: return index of first zero byte within a word.
  - index_first_eq: return index of first byte different between two words.
  - index_first_zero_eq: return index of first zero byte within a word or
    first byte different between two words.
  - index_first_zero_ne: return index of first zero byte within a word or
    first byte equal between two words.
  - index_last_zero: return index of last zero byte within a word.
  - index_last_eq: return index of last byte different between two words.

Co-authored-by: Richard Henderson <rth@twiddle.net>
---
 sysdeps/generic/string-extbyte.h |  37 ++++++++++
 sysdeps/generic/string-fza.h     | 106 +++++++++++++++++++++++++++
 sysdeps/generic/string-fzb.h     |  49 +++++++++++++
 sysdeps/generic/string-fzi.h     | 120 +++++++++++++++++++++++++++++++
 4 files changed, 312 insertions(+)
 create mode 100644 sysdeps/generic/string-extbyte.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzi.h

diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
new file mode 100644
index 0000000000..c8fecd259f
--- /dev/null
+++ b/sysdeps/generic/string-extbyte.h
@@ -0,0 +1,37 @@
+/* Extract by from memory word.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_EXTBYTE_H
+#define _STRING_EXTBYTE_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-optype.h>
+
+/* Extract the byte at index IDX from word X, with index 0 being the
+   least significant byte.  */
+static inline unsigned char
+extractbyte (op_t x, unsigned int idx)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    return x >> (idx * CHAR_BIT);
+  else
+    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
+}
+
+#endif /* _STRING_EXTBYTE_H */
diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
new file mode 100644
index 0000000000..54be34e5f0
--- /dev/null
+++ b/sysdeps/generic/string-fza.h
@@ -0,0 +1,106 @@
+/* Basic zero byte detection.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZA_H
+#define _STRING_FZA_H 1
+
+#include <limits.h>
+#include <string-optype.h>
+#include <string-maskoff.h>
+
+/* This function returns non-zero if any byte in X is zero.
+   More specifically, at least one bit set within the least significant
+   byte that was zero; other bytes within the word are indeterminate.  */
+static inline op_t
+find_zero_low (op_t x)
+{
+  /* This expression comes from
+       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
+     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
+  op_t lsb = repeat_bytes (0x01);
+  op_t msb = repeat_bytes (0x80);
+  return (x - lsb) & ~x & msb;
+}
+
+/* This function returns at least one bit set within every byte of X that
+   is zero.  The result is exact in that, unlike find_zero_low, all bytes
+   are determinate.  This is usually used for finding the index of the
+   most significant byte that was zero.  */
+static inline op_t
+find_zero_all (op_t x)
+{
+  /* For each byte, find not-zero by
+     (0) And 0x7f so that we cannot carry between bytes,
+     (1) Add 0x7f so that non-zero carries into 0x80,
+     (2) Or in the original byte (which might have had 0x80 set).
+     Then invert and mask such that 0x80 is set iff that byte was zero.  */
+  op_t m = ((op_t)-1 / 0xff) * 0x7f;
+  return ~(((x & m) + m) | x | m);
+}
+
+/* With similar caveats, identify bytes that are equal between X1 and X2.  */
+static inline op_t
+find_eq_low (op_t x1, op_t x2)
+{
+  return find_zero_low (x1 ^ x2);
+}
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1 ^ x2);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   equal between in X1 and X2.  */
+static inline op_t
+find_zero_eq_low (op_t x1, op_t x2)
+{
+  return find_zero_low (x1) | find_zero_low (x1 ^ x2);
+}
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
+}
+
+/* With similar caveats, identify zero bytes in X1 and bytes that are
+   not equal between in X1 and X2.  */
+static inline op_t
+find_zero_ne_low (op_t x1, op_t x2)
+{
+  op_t m = repeat_bytes (0x7f);
+  op_t eq = x1 ^ x2;
+  op_t nz1 = (x1 + m) | x1;	/* msb set if byte not zero.  */
+  op_t ne2 = (eq + m) | eq;	/* msb set if byte not equal.  */
+  return (ne2 | ~nz1) & ~m;	/* msb set if x1 zero or x2 not equal.  */
+}
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  op_t m = repeat_bytes (0x7f);
+  op_t eq = x1 ^ x2;
+  op_t nz1 = ((x1 & m) + m) | x1;
+  op_t ne2 = ((eq & m) + m) | eq;
+  return (ne2 | ~nz1) & ~m;
+}
+
+#endif /* _STRING_FZA_H */
diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
new file mode 100644
index 0000000000..f1c0ae0922
--- /dev/null
+++ b/sysdeps/generic/string-fzb.h
@@ -0,0 +1,49 @@
+/* Zero byte detection, boolean.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZB_H
+#define _STRING_FZB_H 1
+
+#include <endian.h>
+#include <string-fza.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  return find_zero_low (x) != 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return find_eq_low (x1, x2) != 0;
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return find_zero_eq_low (x1, x2);
+}
+
+#endif /* _STRING_FZB_H */
diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
new file mode 100644
index 0000000000..888e1b8baa
--- /dev/null
+++ b/sysdeps/generic/string-fzi.h
@@ -0,0 +1,120 @@
+/* Zero byte detection; indexes.  Generic C version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZI_H
+#define _STRING_FZI_H 1
+
+#include <limits.h>
+#include <endian.h>
+#include <string-fza.h>
+#include <gmp.h>
+#include <stdlib/gmp-impl.h>
+#include <stdlib/longlong.h>
+
+/* A subroutine for the index_zero functions.  Given a test word C, return
+   the (memory order) index of the first byte (in memory order) that is
+   non-zero.  */
+static inline unsigned int
+index_first_ (op_t c)
+{
+  int r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    count_trailing_zeros (r, c);
+  else
+    count_leading_zeros (r, c);
+  return r / CHAR_BIT;
+}
+
+/* Similarly, but return the (memory order) index of the last byte that is
+   non-zero.  */
+static inline unsigned int
+index_last_ (op_t c)
+{
+  int r;
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    count_leading_zeros (r, c);
+  else
+    count_trailing_zeros (r, c);
+  return sizeof (op_t) - 1 - (r / CHAR_BIT);
+}
+
+/* Given a word X that is known to contain a zero byte, return the index of
+   the first such within the word in memory order.  */
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_low (x);
+  else
+    x = find_zero_all (x);
+  return index_first_ (x);
+}
+
+/* Similarly, but perform the search for byte equality between X1 and X2.  */
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_eq_low (x1, x2);
+  else
+    x1 = find_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or equality between
+   X1 and X2.  */
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_eq_low (x1, x2);
+  else
+    x1 = find_zero_eq_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but perform the search for zero within X1 or inequality between
+   X1 and X2.  */
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x1 = find_zero_ne_low (x1, x2);
+  else
+    x1 = find_zero_ne_all (x1, x2);
+  return index_first_ (x1);
+}
+
+/* Similarly, but search for the last zero within X.  */
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  if (__BYTE_ORDER == __LITTLE_ENDIAN)
+    x = find_zero_all (x);
+  else
+    x = find_zero_low (x);
+  return index_last_ (x);
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* STRING_FZI_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 05/17] string: Improve generic strlen
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (3 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 04/17] Add string vectorized find and detection functions Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 06/17] string: Improve generic strnlen Adhemerval Zanella
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff functions to
    remove unwanted data.  This strategy follow arch-specific
    optimization used on powerpc, sparc, and SH.

  - Use of has_zero and index_first_zero parametrized functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powercp64-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/strlen.c         | 90 +++++++++--------------------------------
 sysdeps/s390/strlen-c.c | 10 +++--
 2 files changed, 26 insertions(+), 74 deletions(-)

diff --git a/string/strlen.c b/string/strlen.c
index 54f3fb8167..ed71c22414 100644
--- a/string/strlen.c
+++ b/string/strlen.c
@@ -17,84 +17,34 @@
 
 #include <string.h>
 #include <stdlib.h>
-
-#undef strlen
-
-#ifndef STRLEN
-# define STRLEN strlen
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
+
+#ifdef STRLEN
+# define __strlen STRLEN
 #endif
 
 /* Return the length of the null-terminated string STR.  Scan for
    the null terminator quickly by testing four bytes at a time.  */
 size_t
-STRLEN (const char *str)
+__strlen (const char *str)
 {
-  const char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == '\0')
-      return char_ptr - str;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  /* Align pointer to sizeof op_t.  */
+  const uintptr_t s_int = (uintptr_t) str;
+  const op_t *word_ptr = word_containing (str);
 
-  longword_ptr = (unsigned long int *) char_ptr;
+  /* Read and MASK the first word. */
+  op_t word = *word_ptr | create_mask (s_int);
 
-  /* Computing (longword - lomagic) sets the high bit of any corresponding
-     byte that is either zero or greater than 0x80.  The latter case can be
-     filtered out by computing (~longword & himagic).  The final result
-     will always be non-zero if one of the bytes of longword is zero.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
-    {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
-    }
-  if (sizeof (longword) > 8)
-    abort ();
+  while (! has_zero (word))
+    word = *++word_ptr;
 
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
-    {
-      longword = *longword_ptr++;
-
-      if (((longword - lomagic) & ~longword & himagic) != 0)
-	{
-	  /* Which of the bytes was the zero?  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
-
-	  if (cp[0] == 0)
-	    return cp - str;
-	  if (cp[1] == 0)
-	    return cp - str + 1;
-	  if (cp[2] == 0)
-	    return cp - str + 2;
-	  if (cp[3] == 0)
-	    return cp - str + 3;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (cp[4] == 0)
-		return cp - str + 4;
-	      if (cp[5] == 0)
-		return cp - str + 5;
-	      if (cp[6] == 0)
-		return cp - str + 6;
-	      if (cp[7] == 0)
-		return cp - str + 7;
-	    }
-	}
-    }
+  return ((const char *) word_ptr) + index_first_zero (word) - str;
 }
+#ifndef STRLEN
+weak_alias (__strlen, strlen)
 libc_hidden_builtin_def (strlen)
+#endif
diff --git a/sysdeps/s390/strlen-c.c b/sysdeps/s390/strlen-c.c
index c96767e329..06413c9a57 100644
--- a/sysdeps/s390/strlen-c.c
+++ b/sysdeps/s390/strlen-c.c
@@ -21,12 +21,14 @@
 #if HAVE_STRLEN_C
 # if HAVE_STRLEN_IFUNC
 #  define STRLEN STRLEN_C
+# endif
+
+# include <string/strlen.c>
+
+# if HAVE_STRLEN_IFUNC
 #  if defined SHARED && IS_IN (libc)
-#   undef libc_hidden_builtin_def
-#   define libc_hidden_builtin_def(name)		\
-  __hidden_ver1 (__strlen_c, __GI_strlen, __strlen_c);
+__hidden_ver1 (__strlen_c, __GI_strlen, __strlen_c);
 #  endif
 # endif
 
-# include <string/strlen.c>
 #endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 06/17] string: Improve generic strnlen
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (4 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 05/17] string: Improve generic strlen Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 07/17] string: Improve generic strchr Adhemerval Zanella
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

With an optimized memchr, new strnlen implementation basically calls
memchr and adjust the result pointer value.

It also cleanups the multiple inclusion by leaving the ifunc
implementation to undef the weak_alias and libc_hidden_def.

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/strnlen.c                              | 137 +-----------------
 sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
 .../power4/multiarch/strnlen-ppc32.c          |  14 +-
 sysdeps/s390/strnlen-c.c                      |  14 +-
 4 files changed, 27 insertions(+), 152 deletions(-)

diff --git a/string/strnlen.c b/string/strnlen.c
index e463ed79bf..a6205741dc 100644
--- a/string/strnlen.c
+++ b/string/strnlen.c
@@ -1,10 +1,6 @@
 /* Find the length of STRING, but scan at most MAXLEN characters.
    Copyright (C) 1991-2022 Free Software Foundation, Inc.
 
-   Based on strlen written by Torbjorn Granlund (tege@sics.se),
-   with help from Dan Sahlin (dan@sics.se);
-   commentary by Jim Blandy (jimb@ai.mit.edu).
-
    The GNU C Library is free software; you can redistribute it and/or
    modify it under the terms of the GNU Lesser General Public License as
    published by the Free Software Foundation; either version 2.1 of the
@@ -20,7 +16,6 @@
    not, see <https://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <stdlib.h>
 
 /* Find the length of S, but scan at most MAXLEN characters.  If no
    '\0' terminator is found in that many characters, return MAXLEN.  */
@@ -32,134 +27,12 @@
 size_t
 __strnlen (const char *str, size_t maxlen)
 {
-  const char *char_ptr, *end_ptr = str + maxlen;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, himagic, lomagic;
-
-  if (maxlen == 0)
-    return 0;
-
-  if (__glibc_unlikely (end_ptr < str))
-    end_ptr = (const char *) ~0UL;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = str; ((unsigned long int) char_ptr
-			& (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == '\0')
-      {
-	if (char_ptr > end_ptr)
-	  char_ptr = end_ptr;
-	return char_ptr - str;
-      }
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  himagic = 0x80808080L;
-  lomagic = 0x01010101L;
-  if (sizeof (longword) > 4)
-    {
-      /* 64-bit version of the magic.  */
-      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-      himagic = ((himagic << 16) << 16) | himagic;
-      lomagic = ((lomagic << 16) << 16) | lomagic;
-    }
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (longword_ptr < (unsigned long int *) end_ptr)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.  */
-
-      longword = *longword_ptr++;
-
-      if ((longword - lomagic) & himagic)
-	{
-	  /* Which of the bytes was the zero?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const char *cp = (const char *) (longword_ptr - 1);
-
-	  char_ptr = cp;
-	  if (cp[0] == 0)
-	    break;
-	  char_ptr = cp + 1;
-	  if (cp[1] == 0)
-	    break;
-	  char_ptr = cp + 2;
-	  if (cp[2] == 0)
-	    break;
-	  char_ptr = cp + 3;
-	  if (cp[3] == 0)
-	    break;
-	  if (sizeof (longword) > 4)
-	    {
-	      char_ptr = cp + 4;
-	      if (cp[4] == 0)
-		break;
-	      char_ptr = cp + 5;
-	      if (cp[5] == 0)
-		break;
-	      char_ptr = cp + 6;
-	      if (cp[6] == 0)
-		break;
-	      char_ptr = cp + 7;
-	      if (cp[7] == 0)
-		break;
-	    }
-	}
-      char_ptr = end_ptr;
-    }
-
-  if (char_ptr > end_ptr)
-    char_ptr = end_ptr;
-  return char_ptr - str;
+  const char *found = memchr (str, '\0', maxlen);
+  return found ? found - str : maxlen;
 }
+
 #ifndef STRNLEN
-libc_hidden_def (__strnlen)
 weak_alias (__strnlen, strnlen)
-#endif
+libc_hidden_def (__strnlen)
 libc_hidden_def (strnlen)
+#endif
diff --git a/sysdeps/i386/i686/multiarch/strnlen-c.c b/sysdeps/i386/i686/multiarch/strnlen-c.c
index 351e939a93..beb0350d53 100644
--- a/sysdeps/i386/i686/multiarch/strnlen-c.c
+++ b/sysdeps/i386/i686/multiarch/strnlen-c.c
@@ -1,10 +1,10 @@
 #define STRNLEN  __strnlen_ia32
+#include <string/strnlen.c>
+
 #ifdef SHARED
-# undef libc_hidden_def
-# define libc_hidden_def(name)  \
-    __hidden_ver1 (__strnlen_ia32, __GI_strnlen, __strnlen_ia32); \
-    strong_alias (__strnlen_ia32, __strnlen_ia32_1); \
-    __hidden_ver1 (__strnlen_ia32_1, __GI___strnlen, __strnlen_ia32_1);
+/* Alias for internal symbol to avoid PLT generation, it redirects the
+   libc_hidden_def (__strnlen/strlen) to default implementation.  */
+__hidden_ver1 (__strnlen_ia32, __GI_strnlen, __strnlen_ia32);
+strong_alias (__strnlen_ia32, __strnlen_ia32_1);
+__hidden_ver1 (__strnlen_ia32_1, __GI___strnlen, __strnlen_ia32_1);
 #endif
-
-#include "string/strnlen.c"
diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
index 576f2bc456..a692f5c3e8 100644
--- a/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
+++ b/sysdeps/powerpc/powerpc32/power4/multiarch/strnlen-ppc32.c
@@ -17,12 +17,12 @@
    <https://www.gnu.org/licenses/>.  */
 
 #define STRNLEN  __strnlen_ppc
+#include <string/strnlen.c>
+
 #ifdef SHARED
-# undef libc_hidden_def
-# define libc_hidden_def(name)  \
-    __hidden_ver1 (__strnlen_ppc, __GI_strnlen, __strnlen_ppc); \
-    strong_alias (__strnlen_ppc, __strnlen_ppc_1); \
-    __hidden_ver1 (__strnlen_ppc_1, __GI___strnlen, __strnlen_ppc_1);
+/* Alias for internal symbol to avoid PLT generation, it redirects the
+   libc_hidden_def (__strnlen/strlen) to default implementation.  */
+__hidden_ver1 (__strnlen_ppc, __GI_strnlen, __strnlen_ppc); \
+strong_alias (__strnlen_ppc, __strnlen_ppc_1); \
+__hidden_ver1 (__strnlen_ppc_1, __GI___strnlen, __strnlen_ppc_1);
 #endif
-
-#include <string/strnlen.c>
diff --git a/sysdeps/s390/strnlen-c.c b/sysdeps/s390/strnlen-c.c
index 4f1e494dca..c3fdc485ea 100644
--- a/sysdeps/s390/strnlen-c.c
+++ b/sysdeps/s390/strnlen-c.c
@@ -21,14 +21,16 @@
 #if HAVE_STRNLEN_C
 # if HAVE_STRNLEN_IFUNC
 #  define STRNLEN STRNLEN_C
+# endif
+
+# include <string/strnlen.c>
+
+# if HAVE_STRNLEN_IFUNC
 #  if defined SHARED && IS_IN (libc)
-#   undef libc_hidden_def
-#   define libc_hidden_def(name)					\
-  __hidden_ver1 (__strnlen_c, __GI_strnlen, __strnlen_c);	\
-  strong_alias (__strnlen_c, __strnlen_c_1);			\
-  __hidden_ver1 (__strnlen_c_1, __GI___strnlen, __strnlen_c_1);
+__hidden_ver1 (__strnlen_c, __GI_strnlen, __strnlen_c);
+strong_alias (__strnlen_c, __strnlen_c_1);
+__hidden_ver1 (__strnlen_c_1, __GI___strnlen, __strnlen_c_1);
 #  endif
 # endif
 
-# include <string/strnlen.c>
 #endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 07/17] string: Improve generic strchr
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (5 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 06/17] string: Improve generic strnlen Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2023-01-05 23:09   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 08/17] string: Improve generic strchrnul Adhemerval Zanella
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow arch-specific
    optimization used on aarch64 and powerpc.

  - Use string-fz{b,i} and string-extbyte function.

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/strchr.c         | 172 +++++++---------------------------------
 sysdeps/s390/strchr-c.c |  11 +--
 2 files changed, 34 insertions(+), 149 deletions(-)

diff --git a/string/strchr.c b/string/strchr.c
index bfd0c4e4bc..6bbee7f79d 100644
--- a/string/strchr.c
+++ b/string/strchr.c
@@ -22,164 +22,48 @@
 
 #include <string.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-extbyte.h>
+#include <string-maskoff.h>
 
 #undef strchr
+#undef index
 
-#ifndef STRCHR
-# define STRCHR strchr
+#ifdef STRCHR
+# define strchr STRCHR
 #endif
 
 /* Find the first occurrence of C in S.  */
 char *
-STRCHR (const char *s, int c_in)
+strchr (const char *s, int c_in)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c)
-      return (void *) char_ptr;
-    else if (*char_ptr == '\0')
-      return NULL;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
+  /* Set up a word, each of whose bytes is C.  */
+  unsigned char c = (unsigned char) c_in;
+  op_t repeated_c = repeat_bytes (c_in);
 
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
+  /* Align the input address to op_t.  */
+  uintptr_t s_int = (uintptr_t) s;
+  const op_t *word_ptr = word_containing (s);
 
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
+  /* Read the first aligned word, but force bytes before the string to
+     match neither zero nor goal (we make sure the high bit of each byte
+     is 1, and the low 7 bits are all the opposite of the goal byte).  */
+  op_t bmask = create_mask (s_int);
+  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
 
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
+  while (! has_zero_eq (word, repeated_c))
+    word = *++word_ptr;
 
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0
-
-	  /* That caught zeroes.  Now test for C.  */
-	  || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	      & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (*++cp == c)
-	    return (char *) cp;
-	  else if (*cp == '\0')
-	    return NULL;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	      if (*++cp == c)
-		return (char *) cp;
-	      else if (*cp == '\0')
-		return NULL;
-	    }
-	}
-    }
+  op_t found = index_first_zero_eq (word, repeated_c);
 
+  if (extractbyte (word, found) == c)
+    return (char *) (word_ptr) + found;
   return NULL;
 }
-
-#ifdef weak_alias
-# undef index
+#ifndef STRCHR
 weak_alias (strchr, index)
-#endif
 libc_hidden_builtin_def (strchr)
+#endif
diff --git a/sysdeps/s390/strchr-c.c b/sysdeps/s390/strchr-c.c
index 4ac3a62fba..a5a1781b1c 100644
--- a/sysdeps/s390/strchr-c.c
+++ b/sysdeps/s390/strchr-c.c
@@ -21,13 +21,14 @@
 #if HAVE_STRCHR_C
 # if HAVE_STRCHR_IFUNC
 #  define STRCHR STRCHR_C
-#  undef weak_alias
+# endif
+
+# include <string/strchr.c>
+
+# if HAVE_STRCHR_IFUNC
 #  if defined SHARED && IS_IN (libc)
-#   undef libc_hidden_builtin_def
-#   define libc_hidden_builtin_def(name)			\
-     __hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
+__hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
 #  endif
 # endif
 
-# include <string/strchr.c>
 #endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 08/17] string: Improve generic strchrnul
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (6 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 07/17] string: Improve generic strchr Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2023-01-05 23:17   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 09/17] string: Improve generic strcmp Adhemerval Zanella
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow arch-specific
    optimization used on aarch64 and powerpc.

  - Use string-fz{b,i} functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
and powerpc-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/strchrnul.c                            | 156 +++---------------
 .../power4/multiarch/strchrnul-ppc32.c        |   4 -
 sysdeps/s390/strchrnul-c.c                    |   2 -
 3 files changed, 24 insertions(+), 138 deletions(-)

diff --git a/string/strchrnul.c b/string/strchrnul.c
index 0cc1fc6bb0..67defa3dab 100644
--- a/string/strchrnul.c
+++ b/string/strchrnul.c
@@ -1,10 +1,5 @@
 /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
-   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
-   with help from Dan Sahlin (dan@sics.se) and
-   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
-   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
-   and implemented by Roland McGrath (roland@ai.mit.edu).
 
    The GNU C Library is free software; you can redistribute it and/or
    modify it under the terms of the GNU Lesser General Public
@@ -21,146 +16,43 @@
    <https://www.gnu.org/licenses/>.  */
 
 #include <string.h>
-#include <memcopy.h>
 #include <stdlib.h>
+#include <stdint.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
 
 #undef __strchrnul
 #undef strchrnul
 
-#ifndef STRCHRNUL
-# define STRCHRNUL __strchrnul
+#ifdef STRCHRNUL
+# define __strchrnul STRCHRNUL
 #endif
 
 /* Find the first occurrence of C in S or the final NUL byte.  */
 char *
-STRCHRNUL (const char *s, int c_in)
+__strchrnul (const char *str, int c_in)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
-  /* Handle the first few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
-       ++char_ptr)
-    if (*char_ptr == c || *char_ptr == '\0')
-      return (void *) char_ptr;
-
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
-
-  longword_ptr = (unsigned long int *) char_ptr;
-
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-  if (sizeof (longword) > 4)
-    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
-    charmask |= (charmask << 16) << 16;
-  if (sizeof (longword) > 8)
-    abort ();
-
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  for (;;)
-    {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
+  /* Set up a word, each of whose bytes is C.  */
+  op_t repeated_c = repeat_bytes (c_in);
 
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
+  /* Align the input address to op_t.  */
+  uintptr_t s_int = (uintptr_t) str;
+  const op_t *word_ptr = word_containing (str);
 
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
+  /* Read the first aligned word, but force bytes before the string to
+     match neither zero nor goal (we make sure the high bit of each byte
+     is 1, and the low 7 bits are all the opposite of the goal byte).  */
+  op_t bmask = create_mask (s_int);
+  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
 
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
+  while (! has_zero_eq (word, repeated_c))
+    word = *++word_ptr;
 
-	 3) But wait!  Aren't we looking for C as well as zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *longword_ptr++;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0
-
-	  /* That caught zeroes.  Now test for C.  */
-	  || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
-	      & ~magic_bits) != 0)
-	{
-	  /* Which of the bytes was C or zero?
-	     If none of them were, it was a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
-
-	  if (*cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (*++cp == c || *cp == '\0')
-	    return (char *) cp;
-	  if (sizeof (longword) > 4)
-	    {
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	      if (*++cp == c || *cp == '\0')
-		return (char *) cp;
-	    }
-	}
-    }
-
-  /* This should never happen.  */
-  return NULL;
+  op_t found = index_first_zero_eq (word, repeated_c);
+  return (char *) (word_ptr) + found;
 }
-
+#ifndef STRCHRNUL
 weak_alias (__strchrnul, strchrnul)
+#endif
diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
index ed86b5e671..9c85e269f7 100644
--- a/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
+++ b/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
@@ -19,10 +19,6 @@
 #include <string.h>
 
 #define STRCHRNUL  __strchrnul_ppc
-
-#undef weak_alias
-#define weak_alias(a,b )
-
 extern __typeof (strchrnul) __strchrnul_ppc attribute_hidden;
 
 #include <string/strchrnul.c>
diff --git a/sysdeps/s390/strchrnul-c.c b/sysdeps/s390/strchrnul-c.c
index 4ffac54edd..2ebbcc62f7 100644
--- a/sysdeps/s390/strchrnul-c.c
+++ b/sysdeps/s390/strchrnul-c.c
@@ -22,8 +22,6 @@
 # if HAVE_STRCHRNUL_IFUNC
 #  define STRCHRNUL STRCHRNUL_C
 #  define __strchrnul STRCHRNUL
-#  undef weak_alias
-#  define weak_alias(name, alias)
 # endif
 
 # include <string/strchrnul.c>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 09/17] string: Improve generic strcmp
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (7 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 08/17] string: Improve generic strchrnul Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 10/17] string: Improve generic memchr Adhemerval Zanella
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New generic implementation tries to use word operations along with
the new string-fz{b,i} functions even for inputs with different
alignments (with still uses aligned access plus merge operation
to get a correct word by word comparison).

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
and powerpc-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/strcmp.c | 119 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 103 insertions(+), 16 deletions(-)

diff --git a/string/strcmp.c b/string/strcmp.c
index d4962be4ec..9cc726d877 100644
--- a/string/strcmp.c
+++ b/string/strcmp.c
@@ -15,33 +15,120 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <stdint.h>
+#include <string-extbyte.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
 #include <string.h>
+#include <memcopy.h>
 
-#undef strcmp
-
-#ifndef STRCMP
-# define STRCMP strcmp
+#ifdef STRCMP
+# define strcmp STRCMP
 #endif
 
+static inline int
+final_cmp (const op_t w1, const op_t w2)
+{
+  /* It can not use index_first_zero_ne because it must not compare past the
+     final '/0' is present (and final_cmp is called before has_zero check).
+   */
+  for (size_t i = 0; i < sizeof (op_t); i++)
+    {
+      unsigned char c1 = extractbyte (w1, i);
+      unsigned char c2 = extractbyte (w2, i);
+      if (c1 == '\0' || c1 != c2)
+        return c1 - c2;
+    }
+  return 0;
+}
+
+/* Aligned loop: if a difference is found, exit to compare the bytes.  Else
+   if a zero is found we have equal strings.  */
+static inline int
+strcmp_aligned_loop (const op_t *x1, const op_t *x2, op_t w1)
+{
+  op_t w2 = *x2++;
+
+  while (w1 == w2)
+    {
+      if (has_zero (w1))
+	return 0;
+      w1 = *x1++;
+      w2 = *x2++;
+    }
+
+  return final_cmp (w1, w2);
+}
+
+/* Unaligned loop: align the first partial of P2, with 0xff for the rest of
+   the bytes so that we can also apply the has_zero test to see if we have
+   already reached EOS.  If we have, then we can simply fall through to the
+   final comparison.  */
+static inline int
+strcmp_unaligned_loop (const op_t *x1, const op_t *x2, op_t w1, uintptr_t ofs)
+{
+  op_t w2a = *x2++;
+  uintptr_t sh_1 = ofs * CHAR_BIT;
+  uintptr_t sh_2 = sizeof(op_t) * CHAR_BIT - sh_1;
+
+  op_t w2 = MERGE (w2a, sh_1, (op_t)-1, sh_2);
+  if (!has_zero (w2))
+    {
+      op_t w2b;
+
+      /* Unaligned loop.  The invariant is that W2B, which is "ahead" of W1,
+	 does not contain end-of-string.  Therefore it is safe (and necessary)
+	 to read another word from each while we do not have a difference.  */
+      while (1)
+	{
+	  w2b = *x2++;
+	  w2 = MERGE (w2a, sh_1, w2b, sh_2);
+	  if (w1 != w2)
+	    return final_cmp (w1, w2);
+	  if (has_zero (w2b))
+	    break;
+	  w1 = *x1++;
+	  w2a = w2b;
+	}
+
+      /* Zero found in the second partial of P2.  If we had EOS in the aligned
+	 word, we have equality.  */
+      if (has_zero (w1))
+	return 0;
+
+      /* Load the final word of P1 and align the final partial of P2.  */
+      w1 = *x1++;
+      w2 = MERGE (w2b, sh_1, 0, sh_2);
+    }
+
+  return final_cmp (w1, w2);
+}
+
 /* Compare S1 and S2, returning less than, equal to or
    greater than zero if S1 is lexicographically less than,
    equal to or greater than S2.  */
 int
-STRCMP (const char *p1, const char *p2)
+strcmp (const char *p1, const char *p2)
 {
-  const unsigned char *s1 = (const unsigned char *) p1;
-  const unsigned char *s2 = (const unsigned char *) p2;
-  unsigned char c1, c2;
-
-  do
+  /* Handle the unaligned bytes of p1 first.  */
+  uintptr_t n = -(uintptr_t)p1 % sizeof(op_t);
+  for (int i = 0; i < n; ++i)
     {
-      c1 = (unsigned char) *s1++;
-      c2 = (unsigned char) *s2++;
-      if (c1 == '\0')
-	return c1 - c2;
+      unsigned char c1 = *p1++;
+      unsigned char c2 = *p2++;
+      int diff = c1 - c2;
+      if (c1 == '\0' || diff != 0)
+	return diff;
     }
-  while (c1 == c2);
 
-  return c1 - c2;
+  /* P1 is now aligned to unsigned long.  P2 may or may not be.  */
+  const op_t *x1 = (const op_t *) p1;
+  op_t w1 = *x1++;
+  uintptr_t ofs = (uintptr_t) p2 % sizeof(op_t);
+  return ofs == 0
+    ? strcmp_aligned_loop (x1, (const op_t *)p2, w1)
+    : strcmp_unaligned_loop (x1, (const op_t *)(p2 - ofs), w1, ofs);
 }
+#ifndef STRCMP
 libc_hidden_builtin_def (strcmp)
+#endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 10/17] string: Improve generic memchr
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (8 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 09/17] string: Improve generic strcmp Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2023-01-05 23:47   ` Noah Goldstein
  2023-01-05 23:49   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 11/17] string: Improve generic memrchr Adhemerval Zanella
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New algorithm have the following key differences:

  - Reads first word unaligned and use string-maskoff function to
    remove unwanted data.  This strategy follow arch-specific
    optimization used on aarch64 and powerpc.

  - Use string-fz{b,i} and string-opthr functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/memchr.c                               | 168 +++++-------------
 .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
 .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
 3 files changed, 48 insertions(+), 143 deletions(-)

diff --git a/string/memchr.c b/string/memchr.c
index 422bcd0cd6..08d518b02d 100644
--- a/string/memchr.c
+++ b/string/memchr.c
@@ -1,10 +1,6 @@
-/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
+/* Scan memory for a character.  Generic version
+   Copyright (C) 1991-2022 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
-   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
-   with help from Dan Sahlin (dan@sics.se) and
-   commentary by Jim Blandy (jimb@ai.mit.edu);
-   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
-   and implemented by Roland McGrath (roland@ai.mit.edu).
 
    The GNU C Library is free software; you can redistribute it and/or
    modify it under the terms of the GNU Lesser General Public
@@ -20,143 +16,65 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef _LIBC
-# include <config.h>
-#endif
-
+#include <intprops.h>
+#include <string-fza.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
+#include <string-opthr.h>
 #include <string.h>
 
-#include <stddef.h>
+#undef memchr
 
-#include <limits.h>
-
-#undef __memchr
-#ifdef _LIBC
-# undef memchr
+#ifdef MEMCHR
+# define __memchr MEMCHR
 #endif
 
-#ifndef weak_alias
-# define __memchr memchr
-#endif
-
-#ifndef MEMCHR
-# define MEMCHR __memchr
-#endif
+static inline const char *
+sadd (uintptr_t x, uintptr_t y)
+{
+  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
+  return (const char *)ret;
+}
 
 /* Search no more than N bytes of S for C.  */
 void *
-MEMCHR (void const *s, int c_in, size_t n)
+__memchr (void const *s, int c_in, size_t n)
 {
-  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
-     long instead of a 64-bit uintmax_t tends to give better
-     performance.  On 64-bit hardware, unsigned long is generally 64
-     bits already.  Change this typedef to experiment with
-     performance.  */
-  typedef unsigned long int longword;
+  if (__glibc_unlikely (n == 0))
+    return NULL;
 
-  const unsigned char *char_ptr;
-  const longword *longword_ptr;
-  longword repeated_one;
-  longword repeated_c;
-  unsigned char c;
+  uintptr_t s_int = (uintptr_t) s;
 
-  c = (unsigned char) c_in;
+  /* Set up a word, each of whose bytes is C.  */
+  op_t repeated_c = repeat_bytes (c_in);
+  op_t before_mask = create_mask (s_int);
 
-  /* Handle the first few bytes by reading one byte at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s;
-       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
-       --n, ++char_ptr)
-    if (*char_ptr == c)
-      return (void *) char_ptr;
+  /* Compute the address of the last byte taking in consideration possible
+     overflow.  */
+  const char *lbyte = sadd (s_int, n - 1);
 
-  longword_ptr = (const longword *) char_ptr;
+  /* Compute the address of the word containing the last byte. */
+  const op_t *lword = word_containing (lbyte);
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to any size longwords.  */
+  /* Read the first word, but munge it so that bytes before the array
+     will not match goal.  */
+  const op_t *word_ptr = word_containing (s);
+  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
 
-  /* Compute auxiliary longword values:
-     repeated_one is a value which has a 1 in every byte.
-     repeated_c has c in every byte.  */
-  repeated_one = 0x01010101;
-  repeated_c = c | (c << 8);
-  repeated_c |= repeated_c << 16;
-  if (0xffffffffU < (longword) -1)
+  while (has_eq (word, repeated_c) == 0)
     {
-      repeated_one |= repeated_one << 31 << 1;
-      repeated_c |= repeated_c << 31 << 1;
-      if (8 < sizeof (longword))
-	{
-	  size_t i;
-
-	  for (i = 64; i < sizeof (longword) * 8; i *= 2)
-	    {
-	      repeated_one |= repeated_one << i;
-	      repeated_c |= repeated_c << i;
-	    }
-	}
+      if (word_ptr == lword)
+	return NULL;
+      word = *++word_ptr;
     }
 
-  /* Instead of the traditional loop which tests each byte, we will test a
-     longword at a time.  The tricky part is testing if *any of the four*
-     bytes in the longword in question are equal to c.  We first use an xor
-     with repeated_c.  This reduces the task to testing whether *any of the
-     four* bytes in longword1 is zero.
-
-     We compute tmp =
-       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
-     That is, we perform the following operations:
-       1. Subtract repeated_one.
-       2. & ~longword1.
-       3. & a mask consisting of 0x80 in every byte.
-     Consider what happens in each byte:
-       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
-	 and step 3 transforms it into 0x80.  A carry can also be propagated
-	 to more significant bytes.
-       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
-	 position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
-	 the byte ends in a single bit of value 0 and k bits of value 1.
-	 After step 2, the result is just k bits of value 1: 2^k - 1.  After
-	 step 3, the result is 0.  And no carry is produced.
-     So, if longword1 has only non-zero bytes, tmp is zero.
-     Whereas if longword1 has a zero byte, call j the position of the least
-     significant zero byte.  Then the result has a zero at positions 0, ...,
-     j-1 and a 0x80 at position j.  We cannot predict the result at the more
-     significant bytes (positions j+1..3), but it does not matter since we
-     already have a non-zero bit at position 8*j+7.
-
-     So, the test whether any byte in longword1 is zero is equivalent to
-     testing whether tmp is nonzero.  */
-
-  while (n >= sizeof (longword))
-    {
-      longword longword1 = *longword_ptr ^ repeated_c;
-
-      if ((((longword1 - repeated_one) & ~longword1)
-	   & (repeated_one << 7)) != 0)
-	break;
-      longword_ptr++;
-      n -= sizeof (longword);
-    }
-
-  char_ptr = (const unsigned char *) longword_ptr;
-
-  /* At this point, we know that either n < sizeof (longword), or one of the
-     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
-     machines, we could determine the first such byte without any further
-     memory accesses, just by looking at the tmp result from the last loop
-     iteration.  But this does not work on big-endian machines.  Choose code
-     that works in both cases.  */
-
-  for (; n > 0; --n, ++char_ptr)
-    {
-      if (*char_ptr == c)
-	return (void *) char_ptr;
-    }
-
-  return NULL;
+  /* We found a match, but it might be in a byte past the end
+     of the array.  */
+  char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
+  return (ret <= lbyte) ? ret : NULL;
 }
-#ifdef weak_alias
+#ifndef MEMCHR
 weak_alias (__memchr, memchr)
-#endif
 libc_hidden_builtin_def (memchr)
+#endif
diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
index fc69df54b3..02877d3c98 100644
--- a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
+++ b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
@@ -18,17 +18,11 @@
 
 #include <string.h>
 
-#define MEMCHR  __memchr_ppc
+extern __typeof (memchr) __memchr_ppc attribute_hidden;
 
-#undef weak_alias
-#define weak_alias(a, b)
+#define MEMCHR  __memchr_ppc
+#include <string/memchr.c>
 
 #ifdef SHARED
-# undef libc_hidden_builtin_def
-# define libc_hidden_builtin_def(name) \
-  __hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
+__hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
 #endif
-
-extern __typeof (memchr) __memchr_ppc attribute_hidden;
-
-#include <string/memchr.c>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
index 3c966f4403..15beca787b 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
@@ -18,14 +18,7 @@
 
 #include <string.h>
 
-#define MEMCHR  __memchr_ppc
-
-#undef weak_alias
-#define weak_alias(a, b)
-
-# undef libc_hidden_builtin_def
-# define libc_hidden_builtin_def(name)
-
 extern __typeof (memchr) __memchr_ppc attribute_hidden;
 
+#define MEMCHR  __memchr_ppc
 #include <string/memchr.c>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 11/17] string: Improve generic memrchr
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (9 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 10/17] string: Improve generic memchr Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2023-01-05 23:51   ` Noah Goldstein
  2022-09-19 19:59 ` [PATCH v5 12/17] hppa: Add memcopy.h Adhemerval Zanella
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

New algorithm have the following key differences:

  - Use string-fz{b,i} functions.

Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly
implementation and disabling multi-arch (it covers both LE and BE
for 64 and 32 bits).

Co-authored-by: Richard Henderson  <rth@twiddle.net>
---
 string/memrchr.c | 189 ++++++++---------------------------------------
 1 file changed, 32 insertions(+), 157 deletions(-)

diff --git a/string/memrchr.c b/string/memrchr.c
index 8eb6829e45..5491689c66 100644
--- a/string/memrchr.c
+++ b/string/memrchr.c
@@ -1,11 +1,6 @@
 /* memrchr -- find the last occurrence of a byte in a memory block
    Copyright (C) 1991-2022 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
-   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
-   with help from Dan Sahlin (dan@sics.se) and
-   commentary by Jim Blandy (jimb@ai.mit.edu);
-   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
-   and implemented by Roland McGrath (roland@ai.mit.edu).
 
    The GNU C Library is free software; you can redistribute it and/or
    modify it under the terms of the GNU Lesser General Public
@@ -21,177 +16,57 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#include <stdlib.h>
-
-#ifdef HAVE_CONFIG_H
-# include <config.h>
-#endif
-
-#if defined _LIBC
-# include <string.h>
-# include <memcopy.h>
-#endif
-
-#if defined HAVE_LIMITS_H || defined _LIBC
-# include <limits.h>
-#endif
-
-#define LONG_MAX_32_BITS 2147483647
-
-#ifndef LONG_MAX
-# define LONG_MAX LONG_MAX_32_BITS
-#endif
-
-#include <sys/types.h>
+#include <string-fzb.h>
+#include <string-fzi.h>
+#include <string-maskoff.h>
+#include <string-opthr.h>
+#include <string.h>
 
 #undef __memrchr
 #undef memrchr
 
-#ifndef weak_alias
-# define __memrchr memrchr
+#ifdef MEMRCHR
+# define __memrchr MEMRCHR
 #endif
 
-/* Search no more than N bytes of S for C.  */
 void *
-#ifndef MEMRCHR
-__memrchr
-#else
-MEMRCHR
-#endif
-     (const void *s, int c_in, size_t n)
+__memrchr (const void *s, int c_in, size_t n)
 {
-  const unsigned char *char_ptr;
-  const unsigned long int *longword_ptr;
-  unsigned long int longword, magic_bits, charmask;
-  unsigned char c;
-
-  c = (unsigned char) c_in;
-
   /* Handle the last few characters by reading one character at a time.
-     Do this until CHAR_PTR is aligned on a longword boundary.  */
-  for (char_ptr = (const unsigned char *) s + n;
-       n > 0 && ((unsigned long int) char_ptr
-		 & (sizeof (longword) - 1)) != 0;
-       --n)
-    if (*--char_ptr == c)
+     Do this until CHAR_PTR is aligned on a word boundary, or
+     the entirety of small inputs.  */
+  const unsigned char *char_ptr = (const unsigned char *) (s + n);
+  size_t align = (uintptr_t) char_ptr  % sizeof (op_t);
+  if (n < OP_T_THRES || align > n)
+    align = n;
+  for (size_t i = 0; i < align; ++i)
+    if (*--char_ptr == c_in)
       return (void *) char_ptr;
 
-  /* All these elucidatory comments refer to 4-byte longwords,
-     but the theory applies equally well to 8-byte longwords.  */
+  const op_t *word_ptr = (const op_t *) char_ptr;
+  n -= align;
+  if (__glibc_unlikely (n == 0))
+    return NULL;
 
-  longword_ptr = (const unsigned long int *) char_ptr;
+  /* Compute the address of the word containing the initial byte. */
+  const op_t *lword = word_containing (s);
 
-  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
-     the "holes."  Note that there is a hole just to the left of
-     each byte, with an extra at the end:
-
-     bits:  01111110 11111110 11111110 11111111
-     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
-
-     The 1-bits make sure that carries propagate to the next 0-bit.
-     The 0-bits provide holes for carries to fall into.  */
-  magic_bits = -1;
-  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
-
-  /* Set up a longword, each of whose bytes is C.  */
-  charmask = c | (c << 8);
-  charmask |= charmask << 16;
-#if LONG_MAX > LONG_MAX_32_BITS
-  charmask |= charmask << 32;
-#endif
+  /* Set up a word, each of whose bytes is C.  */
+  op_t repeated_c = repeat_bytes (c_in);
 
-  /* Instead of the traditional loop which tests each character,
-     we will test a longword at a time.  The tricky part is testing
-     if *any of the four* bytes in the longword in question are zero.  */
-  while (n >= sizeof (longword))
+  while (word_ptr != lword)
     {
-      /* We tentatively exit the loop if adding MAGIC_BITS to
-	 LONGWORD fails to change any of the hole bits of LONGWORD.
-
-	 1) Is this safe?  Will it catch all the zero bytes?
-	 Suppose there is a byte with all zeros.  Any carry bits
-	 propagating from its left will fall into the hole at its
-	 least significant bit and stop.  Since there will be no
-	 carry from its most significant bit, the LSB of the
-	 byte to the left will be unchanged, and the zero will be
-	 detected.
-
-	 2) Is this worthwhile?  Will it ignore everything except
-	 zero bytes?  Suppose every byte of LONGWORD has a bit set
-	 somewhere.  There will be a carry into bit 8.  If bit 8
-	 is set, this will carry into bit 16.  If bit 8 is clear,
-	 one of bits 9-15 must be set, so there will be a carry
-	 into bit 16.  Similarly, there will be a carry into bit
-	 24.  If one of bits 24-30 is set, there will be a carry
-	 into bit 31, so all of the hole bits will be changed.
-
-	 The one misfire occurs when bits 24-30 are clear and bit
-	 31 is set; in this case, the hole at bit 31 is not
-	 changed.  If we had access to the processor carry flag,
-	 we could close this loophole by putting the fourth hole
-	 at bit 32!
-
-	 So it ignores everything except 128's, when they're aligned
-	 properly.
-
-	 3) But wait!  Aren't we looking for C, not zero?
-	 Good point.  So what we do is XOR LONGWORD with a longword,
-	 each of whose bytes is C.  This turns each byte that is C
-	 into a zero.  */
-
-      longword = *--longword_ptr ^ charmask;
-
-      /* Add MAGIC_BITS to LONGWORD.  */
-      if ((((longword + magic_bits)
-
-	    /* Set those bits that were unchanged by the addition.  */
-	    ^ ~longword)
-
-	   /* Look at only the hole bits.  If any of the hole bits
-	      are unchanged, most likely one of the bytes was a
-	      zero.  */
-	   & ~magic_bits) != 0)
+      op_t word = *--word_ptr;
+      if (has_eq (word, repeated_c))
 	{
-	  /* Which of the bytes was C?  If none of them were, it was
-	     a misfire; continue the search.  */
-
-	  const unsigned char *cp = (const unsigned char *) longword_ptr;
-
-#if LONG_MAX > 2147483647
-	  if (cp[7] == c)
-	    return (void *) &cp[7];
-	  if (cp[6] == c)
-	    return (void *) &cp[6];
-	  if (cp[5] == c)
-	    return (void *) &cp[5];
-	  if (cp[4] == c)
-	    return (void *) &cp[4];
-#endif
-	  if (cp[3] == c)
-	    return (void *) &cp[3];
-	  if (cp[2] == c)
-	    return (void *) &cp[2];
-	  if (cp[1] == c)
-	    return (void *) &cp[1];
-	  if (cp[0] == c)
-	    return (void *) cp;
+	  /* We found a match, but it might be in a byte past the start
+	     of the array.  */
+	  char *ret = (char *) word_ptr + index_last_eq (word, repeated_c);
+	  return ret >= (char *) s ? ret : NULL;
 	}
-
-      n -= sizeof (longword);
     }
-
-  char_ptr = (const unsigned char *) longword_ptr;
-
-  while (n-- > 0)
-    {
-      if (*--char_ptr == c)
-	return (void *) char_ptr;
-    }
-
-  return 0;
+  return NULL;
 }
 #ifndef MEMRCHR
-# ifdef weak_alias
 weak_alias (__memrchr, memrchr)
-# endif
 #endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 12/17] hppa: Add memcopy.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (10 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 11/17] string: Improve generic memrchr Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 13/17] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

GCC's combine pass cannot merge (x >> c | y << (32 - c)) into a
double-word shift unless (1) the subtract is in the same basic block
and (2) the result of the subtract is used exactly once.  Neither
condition is true for any use of MERGE.

By forcing the use of a double-word shift, we not only reduce
contention on SAR, but also allow the setting of SAR to be hoisted
outside of a loop.

Checked on hppa-linux-gnu.
---
 sysdeps/hppa/memcopy.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 sysdeps/hppa/memcopy.h

diff --git a/sysdeps/hppa/memcopy.h b/sysdeps/hppa/memcopy.h
new file mode 100644
index 0000000000..288b5e9520
--- /dev/null
+++ b/sysdeps/hppa/memcopy.h
@@ -0,0 +1,42 @@
+/* Definitions for memory copy functions, PA-RISC version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdeps/generic/memcopy.h>
+
+/* Use a single double-word shift instead of two shifts and an ior.
+   If the uses of MERGE were close to the computation of shl/shr,
+   the compiler might have been able to create this itself.
+   But instead that computation is well separated.
+
+   Using an inline function instead of a macro is the easiest way
+   to ensure that the types are correct.  */
+
+#undef MERGE
+
+static inline op_t
+MERGE (op_t w0, int shl, op_t w1, int shr)
+{
+  _Static_assert (OPSIZ == 4 || OPSIZ == 8, "Invalid OPSIZE");
+
+  op_t res;
+  if (OPSIZ == 4)
+    asm ("shrpw %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  else if (OPSIZ == 8)
+    asm ("shrpd %1,%2,%%sar,%0" : "=r"(res) : "r"(w0), "r"(w1), "q"(shr));
+  return res;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 13/17] hppa: Add string-fzb.h and string-fzi.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (11 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 12/17] hppa: Add memcopy.h Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 14/17] alpha: " Adhemerval Zanella
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

Use UXOR,SBZ to test for a zero byte within a word.  While we can
get semi-decent code out of asm-goto, we would do slightly better
with a compiler builtin.

For index_zero et al, sequential testing of bytes is less expensive than
any tricks that involve a count-leading-zeros insn that we don't have.

Checked on hppa-linux-gnu.
---
 sysdeps/hppa/string-fzb.h |  69 +++++++++++++++++++
 sysdeps/hppa/string-fzi.h | 135 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 204 insertions(+)
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzi.h

diff --git a/sysdeps/hppa/string-fzb.h b/sysdeps/hppa/string-fzb.h
new file mode 100644
index 0000000000..dc02757522
--- /dev/null
+++ b/sysdeps/hppa/string-fzb.h
@@ -0,0 +1,69 @@
+/* Zero byte detection, boolean.  HPPA version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZB_H
+#define _STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* It's more useful to expose a control transfer to the compiler
+     than to expose a proper boolean result.  */
+  asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
+	    "b,n %l1" : : "r"(x) : : nbz);
+  return 1;
+ nbz:
+  return 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  asm goto ("uxor,sbz %0,%1,%%r0\n\t"
+	    "b,n %l2" : : "r"(x1), "r"(x2) : : nbz);
+  return 1;
+ nbz:
+  return 0;
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  asm goto ("uxor,sbz %%r0,%0,%%r0\n\t"
+	    "uxor,nbz %0,%1,%%r0\n\t"
+	    "b,n %l2" : : "r"(x1), "r"(x2) : : sbz);
+  return 0;
+ sbz:
+  return 1;
+}
+
+#endif /* _STRING_FZB_H */
diff --git a/sysdeps/hppa/string-fzi.h b/sysdeps/hppa/string-fzi.h
new file mode 100644
index 0000000000..2b8747ddbd
--- /dev/null
+++ b/sysdeps/hppa/string-fzi.h
@@ -0,0 +1,135 @@
+/* string-fzi.h -- zero byte detection; indexes.  HPPA version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZI_H
+#define _STRING_FZI_H 1
+
+#include <string-optype.h>
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the long in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %1,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but perform the search for byte equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  return index_first_zero (x1 ^ x2);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,= %1,23,8,%%r0\n\t"
+       "extrw,u,<> %2,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,= %1,15,8,%%r0\n\t"
+       "extrw,u,<> %2,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,= %1,7,8,%%r0\n\t"
+       "extrw,u,<> %2,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x1), "r"(x1 ^ x2), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   inequality between X1 and X2. */
+
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no clz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %2,23,8,%%r0\n\t"
+       "extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %2,15,8,%%r0\n\t"
+       "extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %2,7,8,%%r0\n\t"
+       "extrw,u,<> %1,7,8,%%r0\n\t"
+       "ldi 0,%0"
+       : "=r"(ret) : "r"(x1), "r"(x1 ^ x2), "0"(3));
+
+  return ret;
+}
+
+/* Similarly, but search for the last zero within X.  */
+
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  unsigned int ret;
+
+  _Static_assert (sizeof (op_t) == 4, "64-bit not supported");
+
+  /* Since we have no ctz insn, direct tests of the bytes is faster
+     than loading up the constants to do the masking.  */
+  asm ("extrw,u,<> %1,15,8,%%r0\n\t"
+       "ldi 1,%0\n\t"
+       "extrw,u,<> %1,23,8,%%r0\n\t"
+       "ldi 2,%0\n\t"
+       "extrw,u,<> %1,31,8,%%r0\n\t"
+       "ldi 3,%0"
+       : "=r"(ret) : "r"(x), "0"(0));
+
+  return ret;
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* _STRING_FZI_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 14/17] alpha: Add string-fzb.h and string-fzi.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (12 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 13/17] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 15/17] arm: Add string-fza.h Adhemerval Zanella
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While alpha has the more important string functions in assembly,
there are still a few for find the generic routines are used.

Use the CMPBGE insn, via the builtin, for testing of zeros.  Use a
simplified expansion of __builtin_ctz when the insn isn't available.

Checked on alpha-linux-gnu.
---
 sysdeps/alpha/string-fzb.h |  51 +++++++++++++++++
 sysdeps/alpha/string-fzi.h | 113 +++++++++++++++++++++++++++++++++++++
 2 files changed, 164 insertions(+)
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h

diff --git a/sysdeps/alpha/string-fzb.h b/sysdeps/alpha/string-fzb.h
new file mode 100644
index 0000000000..6b19a2106c
--- /dev/null
+++ b/sysdeps/alpha/string-fzb.h
@@ -0,0 +1,51 @@
+/* Zero byte detection; boolean.  Alpha version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZB_H
+#define _STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Note that since CMPBGE creates a bit mask rather than a byte mask,
+   we cannot simply provide a target-specific string-fza.h.  */
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  return __builtin_alpha_cmpbge (0, x) != 0;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1 ^ x2);
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1) | has_eq (x1, x2);
+}
+
+#endif /* _STRING_FZB_H */
diff --git a/sysdeps/alpha/string-fzi.h b/sysdeps/alpha/string-fzi.h
new file mode 100644
index 0000000000..c1a4683590
--- /dev/null
+++ b/sysdeps/alpha/string-fzi.h
@@ -0,0 +1,113 @@
+/* string-fzi.h -- zero byte detection; indices.  Alpha version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZI_H
+#define _STRING_FZI_H
+
+#include <limits.h>
+#include <string-optype.h>
+
+/* Note that since CMPBGE creates a bit mask rather than a byte mask,
+   we cannot simply provide a target-specific string-fza.h.  */
+
+/* A subroutine for the index_zero functions.  Given a bitmask C,
+   return the index of the first bit set in memory order.  */
+
+static inline unsigned int
+index_first_ (unsigned long int c)
+{
+#ifdef __alpha_cix__
+  return __builtin_ctzl (c);
+#else
+  c = c & -c;
+  return (c & 0xf0 ? 4 : 0) + (c & 0xcc ? 2 : 0) + (c & 0xaa ? 1 : 0);
+#endif
+}
+
+/* Similarly, but return the (memory order) index of the last bit
+   that is non-zero.  Note that only the least 8 bits may be nonzero.  */
+
+static inline unsigned int
+index_last_ (unsigned long int x)
+{
+#ifdef __alpha_cix__
+  return __builtin_clzl (x) ^ 63;
+#else
+  unsigned r = 0;
+  if (x & 0xf0)
+    r += 4;
+  if (x & (0xc << r))
+    r += 2;
+  if (x & (0x2 << r))
+    r += 1;
+  return r;
+#endif
+}
+
+/* Given a word X that is known to contain a zero byte, return the
+   index of the first such within the word in memory order.  */
+
+static inline unsigned int
+index_first_zero (op_t x)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x));
+}
+
+/* Similarly, but perform the test for byte equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_eq (op_t x1, op_t x2)
+{
+  return index_first_zero (x1 ^ x2);
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   equality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_eq (op_t x1, op_t x2)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x1)
+		       | __builtin_alpha_cmpbge (0, x1 ^ x2));
+}
+
+/* Similarly, but perform the search for zero within X1 or
+   inequality between X1 and X2.  */
+
+static inline unsigned int
+index_first_zero_ne (op_t x1, op_t x2)
+{
+  return index_first_ (__builtin_alpha_cmpbge (0, x1)
+		       | (__builtin_alpha_cmpbge (0, x1 ^ x2) ^ 0xFF));
+}
+
+/* Similarly, but search for the last zero within X.  */
+
+static inline unsigned int
+index_last_zero (op_t x)
+{
+  return index_last_ (__builtin_alpha_cmpbge (0, x));
+}
+
+static inline unsigned int
+index_last_eq (op_t x1, op_t x2)
+{
+  return index_last_zero (x1 ^ x2);
+}
+
+#endif /* _STRING_FZI_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 15/17] arm: Add string-fza.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (13 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 14/17] alpha: " Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 16/17] powerpc: " Adhemerval Zanella
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While arm has the more important string functions in assembly,
there are still a few generic routines used.

Use the UQSUB8 insn for testing of zeros.

Checked on armv7-linux-gnueabihf
---
 sysdeps/arm/armv6t2/string-fza.h | 70 ++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h

diff --git a/sysdeps/arm/armv6t2/string-fza.h b/sysdeps/arm/armv6t2/string-fza.h
new file mode 100644
index 0000000000..4fe2e8383f
--- /dev/null
+++ b/sysdeps/arm/armv6t2/string-fza.h
@@ -0,0 +1,70 @@
+/* Zero byte detection; basics.  ARM version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _STRING_FZA_H
+#define _STRING_FZA_H 1
+
+#include <string-optype.h>
+#include <string-maskoff.h>
+
+/* This function returns at least one bit set within every byte
+   of X that is zero.  */
+
+static inline op_t
+find_zero_all (op_t x)
+{
+  /* Use unsigned saturated subtraction from 1 in each byte.
+     That leaves 1 for every byte that was zero.  */
+  op_t ret, ones = repeat_bytes (0x01);
+  asm ("uqsub8 %0,%1,%2" : "=r"(ret) : "r"(ones), "r"(x));
+  return ret;
+}
+
+/* Identify bytes that are equal between X1 and X2.  */
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1 ^ x2);
+}
+
+/* Identify zero bytes in X1 or equality between X1 and X2.  */
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
+}
+
+/* Identify zero bytes in X1 or inequality between X1 and X2.  */
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  /* Make use of the fact that we'll already have ONES in a register.  */
+  op_t ones = repeat_bytes (0x01);
+  return find_zero_all (x1) | (find_zero_all (x1 ^ x2) ^ ones);
+}
+
+/* Define the "inexact" versions in terms of the exact versions.  */
+#define find_zero_low		find_zero_all
+#define find_eq_low		find_eq_all
+#define find_zero_eq_low	find_zero_eq_all
+#define find_zero_ne_low	find_zero_ne_all
+
+#endif /* _STRING_FZA_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 16/17] powerpc: Add string-fza.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (14 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 15/17] arm: Add string-fza.h Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-09-19 19:59 ` [PATCH v5 17/17] sh: Add string-fzb.h Adhemerval Zanella
  2022-12-05 17:07 ` [PATCH v5 00/17] Improve generic string routines Xi Ruoyao
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha; +Cc: Richard Henderson

From: Richard Henderson <rth@twiddle.net>

While ppc has the more important string functions in assembly,
there are still a few generic routines used.

Use the Power 6 CMPB insn for testing of zeros.

Checked on powerpc64le-linux-gnu.
---
 sysdeps/powerpc/string-fza.h | 70 ++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)
 create mode 100644 sysdeps/powerpc/string-fza.h

diff --git a/sysdeps/powerpc/string-fza.h b/sysdeps/powerpc/string-fza.h
new file mode 100644
index 0000000000..21fc697a9c
--- /dev/null
+++ b/sysdeps/powerpc/string-fza.h
@@ -0,0 +1,70 @@
+/* Zero byte detection; basics.  PowerPC version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _POWERPC_STRING_FZA_H
+#define _POWERPC_STRING_FZA_H 1
+
+/* PowerISA 2.05 (POWER6) provides cmpb instruction.  */
+#ifdef _ARCH_PWR6
+# include <string-optype.h>
+
+/* This function returns 0xff for each byte that is
+   equal between X1 and X2.  */
+
+static inline op_t
+find_eq_all (op_t x1, op_t x2)
+{
+  op_t ret;
+  asm ("cmpb %0,%1,%2" : "=r"(ret) : "r"(x1), "r"(x2));
+  return ret;
+}
+
+/* This function returns 0xff for each byte that is zero in X.  */
+
+static inline op_t
+find_zero_all (op_t x)
+{
+  return find_eq_all (x, 0);
+}
+
+/* Identify zero bytes in X1 or equality between X1 and X2.  */
+
+static inline op_t
+find_zero_eq_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | find_eq_all (x1, x2);
+}
+
+/* Identify zero bytes in X1 or inequality between X1 and X2.  */
+
+static inline op_t
+find_zero_ne_all (op_t x1, op_t x2)
+{
+  return find_zero_all (x1) | ~find_eq_all (x1, x2);
+}
+
+/* Define the "inexact" versions in terms of the exact versions.  */
+# define find_zero_low		find_zero_all
+# define find_eq_low		find_eq_all
+# define find_zero_eq_low	find_zero_eq_all
+# define find_zero_ne_low	find_zero_ne_all
+#else
+# include <sysdeps/generic/string-fza.h>
+#endif /* _ARCH_PWR6  */
+
+#endif /* _POWERPC_STRING_FZA_H  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 17/17] sh: Add string-fzb.h
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (15 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 16/17] powerpc: " Adhemerval Zanella
@ 2022-09-19 19:59 ` Adhemerval Zanella
  2022-12-05 17:07 ` [PATCH v5 00/17] Improve generic string routines Xi Ruoyao
  17 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella @ 2022-09-19 19:59 UTC (permalink / raw)
  To: libc-alpha

Use the SH cmp/str on has_{zero,eq,zero_eq}.

Checked on sh4-linux-gnu.
---
 sysdeps/sh/string-fzb.h | 53 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 sysdeps/sh/string-fzb.h

diff --git a/sysdeps/sh/string-fzb.h b/sysdeps/sh/string-fzb.h
new file mode 100644
index 0000000000..62823b4d0e
--- /dev/null
+++ b/sysdeps/sh/string-fzb.h
@@ -0,0 +1,53 @@
+/* Zero byte detection; boolean.  SH4 version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef STRING_FZB_H
+#define STRING_FZB_H 1
+
+#include <string-optype.h>
+
+/* Determine if any byte within X is zero.  This is a pure boolean test.  */
+
+static inline _Bool
+has_zero (op_t x)
+{
+  op_t zero = 0x0, ret;
+  asm volatile ("cmp/str %1,%2\n"
+		"movt %0\n"
+		: "=r" (ret)
+		: "r" (zero), "r" (x));
+  return ret;
+}
+
+/* Likewise, but for byte equality between X1 and X2.  */
+
+static inline _Bool
+has_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1 ^ x2);
+}
+
+/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
+
+static inline _Bool
+has_zero_eq (op_t x1, op_t x2)
+{
+  return has_zero (x1) | has_eq (x1, x2);
+}
+
+#endif /* STRING_FZB_H */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/17] Parameterize OP_T_THRES from memcopy.h
  2022-09-19 19:59 ` [PATCH v5 02/17] Parameterize OP_T_THRES " Adhemerval Zanella
@ 2022-09-20 10:49   ` Carlos O'Donell
  0 siblings, 0 replies; 55+ messages in thread
From: Carlos O'Donell @ 2022-09-20 10:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 04:59:05PM -0300, Adhemerval Zanella via Libc-alpha wrote:
> From: Richard Henderson <rth@twiddle.net>
> 
> It moves OP_T_THRES out of memcopy.h to its own header and adjust
> each architecture that redefines it.
> 
> Checked with a build and check with run-built-tests=no for all major
> Linux ABIs.

This is a generic refactor which I think can go in regardless.

We can always reorganize again if we end up wtih too many split headers
and want to talk about unrolling in different contexts.

LGTM.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
> ---
>  string/memcmp.c                            |  3 ---
>  sysdeps/generic/memcopy.h                  |  4 +---
>  sysdeps/generic/string-opthr.h             | 25 ++++++++++++++++++++++
>  sysdeps/i386/memcopy.h                     |  3 ---
>  sysdeps/i386/string-opthr.h                | 25 ++++++++++++++++++++++
>  sysdeps/m68k/memcopy.h                     |  3 ---
>  sysdeps/powerpc/powerpc32/power4/memcopy.h |  5 -----
>  7 files changed, 51 insertions(+), 17 deletions(-)
>  create mode 100644 sysdeps/generic/string-opthr.h
>  create mode 100644 sysdeps/i386/string-opthr.h
> 
> diff --git a/string/memcmp.c b/string/memcmp.c
> index 6a9ceb8ac3..7c4606c2d0 100644
> --- a/string/memcmp.c
> +++ b/string/memcmp.c
> @@ -48,9 +48,6 @@
>     and store.  Must be an unsigned type.  */
>  # define OPSIZ	(sizeof (op_t))
>  
> -/* Threshold value for when to enter the unrolled loops.  */
> -# define OP_T_THRES	16
> -
>  /* Type to use for unaligned operations.  */
>  typedef unsigned char byte;
>  
> diff --git a/sysdeps/generic/memcopy.h b/sysdeps/generic/memcopy.h
> index efe5f2475d..a6baa4dfbb 100644
> --- a/sysdeps/generic/memcopy.h
> +++ b/sysdeps/generic/memcopy.h
> @@ -57,6 +57,7 @@
>  
>  /* Type to use for aligned memory operations.  */
>  #include <string-optype.h>
> +#include <string-opthr.h>
>  #define OPSIZ	(sizeof (op_t))
>  
>  /* Type to use for unaligned operations.  */
> @@ -188,9 +189,6 @@ extern void _wordcopy_bwd_dest_aligned (long int, long int, size_t)
>  
>  #endif
>  
> -/* Threshold value for when to enter the unrolled loops.  */
> -#define	OP_T_THRES	16
> -
>  /* Set to 1 if memcpy is safe to use for forward-copying memmove with
>     overlapping addresses.  This is 0 by default because memcpy implementations
>     are generally not safe for overlapping addresses.  */
> diff --git a/sysdeps/generic/string-opthr.h b/sysdeps/generic/string-opthr.h
> new file mode 100644
> index 0000000000..eabd9fd669
> --- /dev/null
> +++ b/sysdeps/generic/string-opthr.h
> @@ -0,0 +1,25 @@
> +/* Define a threshold for word access.  Generic version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_OPTHR_H
> +#define _STRING_OPTHR_H 1
> +
> +/* Threshold value for when to enter the unrolled loops.  */
> +#define OP_T_THRES	16
> +
> +#endif /* string-opthr.h */
> diff --git a/sysdeps/i386/memcopy.h b/sysdeps/i386/memcopy.h
> index 8cbf182096..66f5665f82 100644
> --- a/sysdeps/i386/memcopy.h
> +++ b/sysdeps/i386/memcopy.h
> @@ -18,9 +18,6 @@
>  
>  #include <sysdeps/generic/memcopy.h>
>  
> -#undef	OP_T_THRES
> -#define	OP_T_THRES	8
> -
>  #undef	BYTE_COPY_FWD
>  #define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)				      \
>    do {									      \
> diff --git a/sysdeps/i386/string-opthr.h b/sysdeps/i386/string-opthr.h
> new file mode 100644
> index 0000000000..ed3e4b2ddb
> --- /dev/null
> +++ b/sysdeps/i386/string-opthr.h
> @@ -0,0 +1,25 @@
> +/* Define a threshold for word access.  i386 version.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef I386_STRING_OPTHR_H
> +#define I386_STRING_OPTHR_H 1
> +
> +/* Threshold value for when to enter the unrolled loops.  */
> +#define OP_T_THRES	8
> +
> +#endif /* I386_STRING_OPTHR_H */
> diff --git a/sysdeps/m68k/memcopy.h b/sysdeps/m68k/memcopy.h
> index cf147f2c4a..3777baac21 100644
> --- a/sysdeps/m68k/memcopy.h
> +++ b/sysdeps/m68k/memcopy.h
> @@ -20,9 +20,6 @@
>  
>  #if	defined(__mc68020__) || defined(mc68020)
>  
> -#undef	OP_T_THRES
> -#define	OP_T_THRES	16
> -
>  /* WORD_COPY_FWD and WORD_COPY_BWD are not symmetric on the 68020,
>     because of its weird instruction overlap characteristics.  */
>  
> diff --git a/sysdeps/powerpc/powerpc32/power4/memcopy.h b/sysdeps/powerpc/powerpc32/power4/memcopy.h
> index a98f6662d8..d27caa2277 100644
> --- a/sysdeps/powerpc/powerpc32/power4/memcopy.h
> +++ b/sysdeps/powerpc/powerpc32/power4/memcopy.h
> @@ -50,11 +50,6 @@
>       [I fail to understand.  I feel stupid.  --roland]
>  */
>  
> -
> -/* Threshold value for when to enter the unrolled loops.  */
> -#undef	OP_T_THRES
> -#define OP_T_THRES 16
> -
>  /* Copy exactly NBYTES bytes from SRC_BP to DST_BP,
>     without any assumptions about alignment of the pointers.  */
>  #undef BYTE_COPY_FWD
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2022-09-19 19:59 ` [PATCH v5 03/17] Add string-maskoff.h generic header Adhemerval Zanella
@ 2022-09-20 11:43   ` Carlos O'Donell
  2022-09-22 17:31     ` Adhemerval Zanella Netto
  2023-01-05 22:49   ` Noah Goldstein
  1 sibling, 1 reply; 55+ messages in thread
From: Carlos O'Donell @ 2022-09-20 11:43 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Sep 19, 2022 at 04:59:06PM -0300, Adhemerval Zanella via Libc-alpha wrote:
> Macros to operate on unaligned access for string operations:
> 
>   - create_mask: create a mask based on pointer alignment to sets up
>     non-zero bytes before the beginning of the word so a following
>     operation (such as find zero) might ignore these bytes.
> 
>   - highbit_mask: create a mask with high bit of each byte being 1,
>     and the low 7 bits being all the opposite of the input.

I really appreciate the effort you've put into documenting the purpose
of each function! It's really awesome to reach such nice coments. Thank
you for that. I've gone through this to review the implementation and
the descriptions. I think it needs a little more tweaking.

> These macros are meant to be used on optimized vectorized string
> implementations.
> ---
>  sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>  create mode 100644 sysdeps/generic/string-maskoff.h
> 
> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
> new file mode 100644
> index 0000000000..831647bda6
> --- /dev/null
> +++ b/sysdeps/generic/string-maskoff.h
> @@ -0,0 +1,73 @@
> +/* Mask off bits.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_MASKOFF_H
> +#define _STRING_MASKOFF_H 1
> +
> +#include <endian.h>
> +#include <limits.h>
> +#include <stdint.h>
> +#include <string-optype.h>
> +
> +/* Provide a mask based on the pointer alignment that sets up non-zero
> +   bytes before the beginning of the word.  It is used to mask off
> +   undesirable bits from an aligned read from an unaligned pointer.
> +   For instance, on a 64 bits machine with a pointer alignment of

s/bits/-bit/g

While it is technically correct English to say "A 64-bits machine", this
is not the normative usage.

I suggest we use the normative "64-bit machine." We can talk about 64
bits, and alignment as bits etc.

> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
> +   (meaning to mask off the initial 3 bytes).  */

Missing "for BE" ?

> +static inline op_t
> +create_mask (uintptr_t i)
> +{
> +  i = i % sizeof (op_t);

OK. Wrap the value.

> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    return ~(((op_t)-1) << (i * CHAR_BIT));
> +  else
> +    return ~(((op_t)-1) >> (i * CHAR_BIT));

OK. Shift.

> +}
> +
> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits

s/an/a/g
s/bits/-bit/g

> +   machine with input as 0xce the functions returns 0xcececececececece.  */
> +static inline op_t
> +repeat_bytes (unsigned char c_in)
> +{
> +  return ((op_t)-1 / 0xff) * c_in;
> +}

How does the compiler do here on the various architectures to produce
the deposit/expand instructions that could be used for this operation?

aarch64 gcc trunk:
        ldrb    r3, [r7, #7]    @ zero_extendqisi2
        mov     r2, #16843009
        mul     r3, r2, r3

x86_64 gcc trunk:
        movzx   eax, BYTE PTR [rbp-4]
        imul    eax, eax, 16843009

s390x gcc12:
	ic      %r1,167(%r11)
        lhi     %r2,255
        nr      %r1,%r2
        ms      %r1,.L4-.L3(%r5)
        llgfr   %r1,%r1

Looks OK, and the static inline will get optimized with the rest of
the operations.

> +
> +/* Based on mask created by 'create_mask', mask off the high bit of each

s/on/on a/g

> +   byte in the mask.  It is used to mask off undesirable bits from an
> +   aligned read from an unaligned pointer, and also taking care to avoid

s/and/while/g

> +   match possible bytes meant to be matched.  For instance, on a 64 bits

Suggest:
matching possible bytes not meant to be matched.

s/bits/-bits/g

> +   machine with a mask created from a pointer with an alignment of 3
> +   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
> +   and 0x00000000007f7f7f for LE.  */
> +static inline op_t
> +highbit_mask (op_t m)
> +{
> +  return m & repeat_bytes (0x7f);

OK.

> +}
> +
> +/* Return the address of the op_t word containing the address P.  For
> +   instance on address 0x0011223344556677 and op_t with size of 8,
> +   it returns 0x0011223344556670.  */

Could you expand on this a bit more? It's a bit opaque what we might use
this for (I have some ideas).

> +static inline op_t *
> +word_containing (char const *p)
> +{
> +  return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
> +}
> +
> +#endif /* _STRING_MASKOFF_H  */
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2022-09-20 11:43   ` Carlos O'Donell
@ 2022-09-22 17:31     ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2022-09-22 17:31 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha



On 20/09/22 08:43, Carlos O'Donell wrote:
> On Mon, Sep 19, 2022 at 04:59:06PM -0300, Adhemerval Zanella via Libc-alpha wrote:
>> Macros to operate on unaligned access for string operations:
>>
>>   - create_mask: create a mask based on pointer alignment to sets up
>>     non-zero bytes before the beginning of the word so a following
>>     operation (such as find zero) might ignore these bytes.
>>
>>   - highbit_mask: create a mask with high bit of each byte being 1,
>>     and the low 7 bits being all the opposite of the input.
> 
> I really appreciate the effort you've put into documenting the purpose
> of each function! It's really awesome to reach such nice coments. Thank
> you for that. I've gone through this to review the implementation and
> the descriptions. I think it needs a little more tweaking.
> 
>> These macros are meant to be used on optimized vectorized string
>> implementations.
>> ---
>>  sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>>  1 file changed, 73 insertions(+)
>>  create mode 100644 sysdeps/generic/string-maskoff.h
>>
>> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
>> new file mode 100644
>> index 0000000000..831647bda6
>> --- /dev/null
>> +++ b/sysdeps/generic/string-maskoff.h
>> @@ -0,0 +1,73 @@
>> +/* Mask off bits.  Generic C version.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef _STRING_MASKOFF_H
>> +#define _STRING_MASKOFF_H 1
>> +
>> +#include <endian.h>
>> +#include <limits.h>
>> +#include <stdint.h>
>> +#include <string-optype.h>
>> +
>> +/* Provide a mask based on the pointer alignment that sets up non-zero
>> +   bytes before the beginning of the word.  It is used to mask off
>> +   undesirable bits from an aligned read from an unaligned pointer.
>> +   For instance, on a 64 bits machine with a pointer alignment of
> 
> s/bits/-bit/g
> 
> While it is technically correct English to say "A 64-bits machine", this
> is not the normative usage.
> 
> I suggest we use the normative "64-bit machine." We can talk about 64
> bits, and alignment as bits etc.

Alright.

> 
>> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
>> +   (meaning to mask off the initial 3 bytes).  */
> 
> Missing "for BE" ?

Ack.

> 
>> +static inline op_t
>> +create_mask (uintptr_t i)
>> +{
>> +  i = i % sizeof (op_t);
> 
> OK. Wrap the value.
> 
>> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
>> +    return ~(((op_t)-1) << (i * CHAR_BIT));
>> +  else
>> +    return ~(((op_t)-1) >> (i * CHAR_BIT));
> 
> OK. Shift.
> 
>> +}
>> +
>> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits
> 
> s/an/a/g
> s/bits/-bit/g

Ack.

> 
>> +   machine with input as 0xce the functions returns 0xcececececececece.  */
>> +static inline op_t
>> +repeat_bytes (unsigned char c_in)
>> +{
>> +  return ((op_t)-1 / 0xff) * c_in;
>> +}
> 
> How does the compiler do here on the various architectures to produce
> the deposit/expand instructions that could be used for this operation?
> 
> aarch64 gcc trunk:
>         ldrb    r3, [r7, #7]    @ zero_extendqisi2
>         mov     r2, #16843009
>         mul     r3, r2, r3
> 
> x86_64 gcc trunk:
>         movzx   eax, BYTE PTR [rbp-4]
>         imul    eax, eax, 16843009
> 
> s390x gcc12:
> 	ic      %r1,167(%r11)
>         lhi     %r2,255
>         nr      %r1,%r2
>         ms      %r1,.L4-.L3(%r5)
>         llgfr   %r1,%r1
> 
> Looks OK, and the static inline will get optimized with the rest of
> the operations.
> 
>> +
>> +/* Based on mask created by 'create_mask', mask off the high bit of each
> 
> s/on/on a/g

Ack.

> 
>> +   byte in the mask.  It is used to mask off undesirable bits from an
>> +   aligned read from an unaligned pointer, and also taking care to avoid
> 
> s/and/while/g

Ack.

> 
>> +   match possible bytes meant to be matched.  For instance, on a 64 bits
> 
> Suggest:
> matching possible bytes not meant to be matched.
> 
> s/bits/-bits/g

Ack.

> 
>> +   machine with a mask created from a pointer with an alignment of 3
>> +   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
>> +   and 0x00000000007f7f7f for LE.  */
>> +static inline op_t
>> +highbit_mask (op_t m)
>> +{
>> +  return m & repeat_bytes (0x7f);
> 
> OK.
> 
>> +}
>> +
>> +/* Return the address of the op_t word containing the address P.  For
>> +   instance on address 0x0011223344556677 and op_t with size of 8,
>> +   it returns 0x0011223344556670.  */
> 
> Could you expand on this a bit more? It's a bit opaque what we might use
> this for (I have some ideas).

Maybe: 

/* Return the word aligned address containing the address P.  For instance for
   the address 0x0011223344556677 with op_t with size of 8, it returns
   0x0011223344556670.  */

> 
>> +static inline op_t *
>> +word_containing (char const *p)
>> +{
>> +  return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
>> +}
>> +
>> +#endif /* _STRING_MASKOFF_H  */
>> -- 
>> 2.34.1
>>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 00/17] Improve generic string routines
  2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
                   ` (16 preceding siblings ...)
  2022-09-19 19:59 ` [PATCH v5 17/17] sh: Add string-fzb.h Adhemerval Zanella
@ 2022-12-05 17:07 ` Xi Ruoyao
  2023-01-05 21:56   ` Adhemerval Zanella Netto
  17 siblings, 1 reply; 55+ messages in thread
From: Xi Ruoyao @ 2022-12-05 17:07 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

Hi,

Any status update on this series?

On Mon, 2022-09-19 at 16:59 -0300, Adhemerval Zanella via Libc-alpha
wrote:
> It is done by:
> 
>   1. parametrizing the internal routines (for instance the find zero
>      in a word) so each architecture can reimplement without the need
>      to reimplement the whole routine.
> 
>   2. vectorizing more string implementations (for instance strcpy
>      and strcmp).
> 
>   3. Change some implementations to use already possible optimized
>      ones (for instance strnlen).  It makes new ports to focus on
>      only provide optimized implementation of a hardful symbols
>      (for instance memchr) and make its improvement to be used in
>      a larger set of routines.
> 
> For the rest of #5806 I think we can handle them later and if
> performance of generic implementation is closer I think it is better
> to just remove old assembly implementations.
> 
> I also checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multiarch (it covers both LE and BE
> for 64 and 32 bits). I also checked the string routines on alpha,
> hppa,
> and sh.
> 
> Changes since v4:
>   * Removed __clz and __ctz in favor of count_leading_zero and
>     count_trailing_zeros from longlong.h.
>   * Use repeat_bytes more often.
>   * Added a comment on strcmp final_cmp on why index_first_zero_ne can
>     not be used.
> 
> Changes since v3:
>   * Rebased against master.
>   * Dropped strcpy optimization.
>   * Refactor strcmp implementation.
>   * Some minor changes in comments.
> 
> Changes since v2:
>   * Move string-fz{a,b,i} to its own patch.
>   * Add a inline implementation for __builtin_c{l,t}z to avoid using
>     compiler provided symbols.
>   * Add a new header, string-maskoff.h, to handle unaligned accesses
>     on some implementation.
>   * Fixed strcmp on LE machines.
>   * Added a unaligned strcpy variant for architecture that define
>     _STRING_ARCH_unaligned.
>   * Add SH string-fzb.h (which uses cmp/str instruction to find
>     a zero in word).
> 
> Changes since v1:
>   * Marked ChangeLog entries with [BZ #5806], as appropriate.
>   * Reorganized the headers, so that armv6t2 and power6 need override
>     as little as possible to use their (integer) zero detection insns.
>   * Hopefully fixed all of the coding style issues.
>   * Adjusted the memrchr algorithm as discussed.
>   * Replaced the #ifdef STRRCHR etc that are used by the multiarch
>   * files.
>   * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
>     ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
>     aarch64, alpha (qemu) and hppa (qemu).
> 
> Adhemerval Zanella (10):
>   Add string-maskoff.h generic header
>   Add string vectorized find and detection functions
>   string: Improve generic strlen
>   string: Improve generic strnlen
>   string: Improve generic strchr
>   string: Improve generic strchrnul
>   string: Improve generic strcmp
>   string: Improve generic memchr
>   string: Improve generic memrchr
>   sh: Add string-fzb.h
> 
> Richard Henderson (7):
>   Parameterize op_t from memcopy.h
>   Parameterize OP_T_THRES from memcopy.h
>   hppa: Add memcopy.h
>   hppa: Add string-fzb.h and string-fzi.h
>   alpha: Add string-fzb.h and string-fzi.h
>   arm: Add string-fza.h
>   powerpc: Add string-fza.h
> 
>  string/memchr.c                               | 168 ++++------------
>  string/memcmp.c                               |   4 -
>  string/memrchr.c                              | 189 +++--------------
> -
>  string/strchr.c                               | 172 +++-------------
>  string/strchrnul.c                            | 156 +++------------
>  string/strcmp.c                               | 119 +++++++++--
>  string/strlen.c                               |  90 ++-------
>  string/strnlen.c                              | 137 +------------
>  sysdeps/alpha/string-fzb.h                    |  51 +++++
>  sysdeps/alpha/string-fzi.h                    | 113 +++++++++++
>  sysdeps/arm/armv6t2/string-fza.h              |  70 +++++++
>  sysdeps/generic/memcopy.h                     |  10 +-
>  sysdeps/generic/string-extbyte.h              |  37 ++++
>  sysdeps/generic/string-fza.h                  | 106 ++++++++++
>  sysdeps/generic/string-fzb.h                  |  49 +++++
>  sysdeps/generic/string-fzi.h                  | 120 +++++++++++
>  sysdeps/generic/string-maskoff.h              |  73 +++++++
>  sysdeps/generic/string-opthr.h                |  25 +++
>  sysdeps/generic/string-optype.h               |  31 +++
>  sysdeps/hppa/memcopy.h                        |  42 ++++
>  sysdeps/hppa/string-fzb.h                     |  69 +++++++
>  sysdeps/hppa/string-fzi.h                     | 135 +++++++++++++
>  sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
>  sysdeps/i386/memcopy.h                        |   3 -
>  sysdeps/i386/string-opthr.h                   |  25 +++
>  sysdeps/m68k/memcopy.h                        |   3 -
>  sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>  .../power4/multiarch/strnlen-ppc32.c          |  14 +-
>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>  sysdeps/powerpc/string-fza.h                  |  70 +++++++
>  sysdeps/s390/strchr-c.c                       |  11 +-
>  sysdeps/s390/strchrnul-c.c                    |   2 -
>  sysdeps/s390/strlen-c.c                       |  10 +-
>  sysdeps/s390/strnlen-c.c                      |  14 +-
>  sysdeps/sh/string-fzb.h                       |  53 +++++
>  37 files changed, 1366 insertions(+), 851 deletions(-)
>  create mode 100644 sysdeps/alpha/string-fzb.h
>  create mode 100644 sysdeps/alpha/string-fzi.h
>  create mode 100644 sysdeps/arm/armv6t2/string-fza.h
>  create mode 100644 sysdeps/generic/string-extbyte.h
>  create mode 100644 sysdeps/generic/string-fza.h
>  create mode 100644 sysdeps/generic/string-fzb.h
>  create mode 100644 sysdeps/generic/string-fzi.h
>  create mode 100644 sysdeps/generic/string-maskoff.h
>  create mode 100644 sysdeps/generic/string-opthr.h
>  create mode 100644 sysdeps/generic/string-optype.h
>  create mode 100644 sysdeps/hppa/memcopy.h
>  create mode 100644 sysdeps/hppa/string-fzb.h
>  create mode 100644 sysdeps/hppa/string-fzi.h
>  create mode 100644 sysdeps/i386/string-opthr.h
>  create mode 100644 sysdeps/powerpc/string-fza.h
>  create mode 100644 sysdeps/sh/string-fzb.h
> 

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 00/17] Improve generic string routines
  2022-12-05 17:07 ` [PATCH v5 00/17] Improve generic string routines Xi Ruoyao
@ 2023-01-05 21:56   ` Adhemerval Zanella Netto
  2023-01-05 23:52     ` Noah Goldstein
  0 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-05 21:56 UTC (permalink / raw)
  To: Xi Ruoyao, libc-alpha

Unfortunately no one worked on reviewing it.  It would be good to have
it for 2.37, although I think it is too late.  However, since most
architectures do use arch-specific routines, I think the possible disruption
of using this patchset should be minimal.

On 05/12/22 14:07, Xi Ruoyao wrote:
> Hi,
> 
> Any status update on this series?
> 
> On Mon, 2022-09-19 at 16:59 -0300, Adhemerval Zanella via Libc-alpha
> wrote:
>> It is done by:
>>
>>   1. parametrizing the internal routines (for instance the find zero
>>      in a word) so each architecture can reimplement without the need
>>      to reimplement the whole routine.
>>
>>   2. vectorizing more string implementations (for instance strcpy
>>      and strcmp).
>>
>>   3. Change some implementations to use already possible optimized
>>      ones (for instance strnlen).  It makes new ports to focus on
>>      only provide optimized implementation of a hardful symbols
>>      (for instance memchr) and make its improvement to be used in
>>      a larger set of routines.
>>
>> For the rest of #5806 I think we can handle them later and if
>> performance of generic implementation is closer I think it is better
>> to just remove old assembly implementations.
>>
>> I also checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>> and powerpc64-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multiarch (it covers both LE and BE
>> for 64 and 32 bits). I also checked the string routines on alpha,
>> hppa,
>> and sh.
>>
>> Changes since v4:
>>   * Removed __clz and __ctz in favor of count_leading_zero and
>>     count_trailing_zeros from longlong.h.
>>   * Use repeat_bytes more often.
>>   * Added a comment on strcmp final_cmp on why index_first_zero_ne can
>>     not be used.
>>
>> Changes since v3:
>>   * Rebased against master.
>>   * Dropped strcpy optimization.
>>   * Refactor strcmp implementation.
>>   * Some minor changes in comments.
>>
>> Changes since v2:
>>   * Move string-fz{a,b,i} to its own patch.
>>   * Add a inline implementation for __builtin_c{l,t}z to avoid using
>>     compiler provided symbols.
>>   * Add a new header, string-maskoff.h, to handle unaligned accesses
>>     on some implementation.
>>   * Fixed strcmp on LE machines.
>>   * Added a unaligned strcpy variant for architecture that define
>>     _STRING_ARCH_unaligned.
>>   * Add SH string-fzb.h (which uses cmp/str instruction to find
>>     a zero in word).
>>
>> Changes since v1:
>>   * Marked ChangeLog entries with [BZ #5806], as appropriate.
>>   * Reorganized the headers, so that armv6t2 and power6 need override
>>     as little as possible to use their (integer) zero detection insns.
>>   * Hopefully fixed all of the coding style issues.
>>   * Adjusted the memrchr algorithm as discussed.
>>   * Replaced the #ifdef STRRCHR etc that are used by the multiarch
>>   * files.
>>   * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
>>     ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
>>     aarch64, alpha (qemu) and hppa (qemu).
>>
>> Adhemerval Zanella (10):
>>   Add string-maskoff.h generic header
>>   Add string vectorized find and detection functions
>>   string: Improve generic strlen
>>   string: Improve generic strnlen
>>   string: Improve generic strchr
>>   string: Improve generic strchrnul
>>   string: Improve generic strcmp
>>   string: Improve generic memchr
>>   string: Improve generic memrchr
>>   sh: Add string-fzb.h
>>
>> Richard Henderson (7):
>>   Parameterize op_t from memcopy.h
>>   Parameterize OP_T_THRES from memcopy.h
>>   hppa: Add memcopy.h
>>   hppa: Add string-fzb.h and string-fzi.h
>>   alpha: Add string-fzb.h and string-fzi.h
>>   arm: Add string-fza.h
>>   powerpc: Add string-fza.h
>>
>>  string/memchr.c                               | 168 ++++------------
>>  string/memcmp.c                               |   4 -
>>  string/memrchr.c                              | 189 +++--------------
>> -
>>  string/strchr.c                               | 172 +++-------------
>>  string/strchrnul.c                            | 156 +++------------
>>  string/strcmp.c                               | 119 +++++++++--
>>  string/strlen.c                               |  90 ++-------
>>  string/strnlen.c                              | 137 +------------
>>  sysdeps/alpha/string-fzb.h                    |  51 +++++
>>  sysdeps/alpha/string-fzi.h                    | 113 +++++++++++
>>  sysdeps/arm/armv6t2/string-fza.h              |  70 +++++++
>>  sysdeps/generic/memcopy.h                     |  10 +-
>>  sysdeps/generic/string-extbyte.h              |  37 ++++
>>  sysdeps/generic/string-fza.h                  | 106 ++++++++++
>>  sysdeps/generic/string-fzb.h                  |  49 +++++
>>  sysdeps/generic/string-fzi.h                  | 120 +++++++++++
>>  sysdeps/generic/string-maskoff.h              |  73 +++++++
>>  sysdeps/generic/string-opthr.h                |  25 +++
>>  sysdeps/generic/string-optype.h               |  31 +++
>>  sysdeps/hppa/memcopy.h                        |  42 ++++
>>  sysdeps/hppa/string-fzb.h                     |  69 +++++++
>>  sysdeps/hppa/string-fzi.h                     | 135 +++++++++++++
>>  sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
>>  sysdeps/i386/memcopy.h                        |   3 -
>>  sysdeps/i386/string-opthr.h                   |  25 +++
>>  sysdeps/m68k/memcopy.h                        |   3 -
>>  sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
>>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>  .../power4/multiarch/strnlen-ppc32.c          |  14 +-
>>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>>  sysdeps/powerpc/string-fza.h                  |  70 +++++++
>>  sysdeps/s390/strchr-c.c                       |  11 +-
>>  sysdeps/s390/strchrnul-c.c                    |   2 -
>>  sysdeps/s390/strlen-c.c                       |  10 +-
>>  sysdeps/s390/strnlen-c.c                      |  14 +-
>>  sysdeps/sh/string-fzb.h                       |  53 +++++
>>  37 files changed, 1366 insertions(+), 851 deletions(-)
>>  create mode 100644 sysdeps/alpha/string-fzb.h
>>  create mode 100644 sysdeps/alpha/string-fzi.h
>>  create mode 100644 sysdeps/arm/armv6t2/string-fza.h
>>  create mode 100644 sysdeps/generic/string-extbyte.h
>>  create mode 100644 sysdeps/generic/string-fza.h
>>  create mode 100644 sysdeps/generic/string-fzb.h
>>  create mode 100644 sysdeps/generic/string-fzi.h
>>  create mode 100644 sysdeps/generic/string-maskoff.h
>>  create mode 100644 sysdeps/generic/string-opthr.h
>>  create mode 100644 sysdeps/generic/string-optype.h
>>  create mode 100644 sysdeps/hppa/memcopy.h
>>  create mode 100644 sysdeps/hppa/string-fzb.h
>>  create mode 100644 sysdeps/hppa/string-fzi.h
>>  create mode 100644 sysdeps/i386/string-opthr.h
>>  create mode 100644 sysdeps/powerpc/string-fza.h
>>  create mode 100644 sysdeps/sh/string-fzb.h
>>
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2022-09-19 19:59 ` [PATCH v5 03/17] Add string-maskoff.h generic header Adhemerval Zanella
  2022-09-20 11:43   ` Carlos O'Donell
@ 2023-01-05 22:49   ` Noah Goldstein
  2023-01-05 23:26     ` Alejandro Colomar
  2023-01-09 18:02     ` Adhemerval Zanella Netto
  1 sibling, 2 replies; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 22:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, Sep 19, 2022 at 12:59 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Macros to operate on unaligned access for string operations:
>
>   - create_mask: create a mask based on pointer alignment to sets up
>     non-zero bytes before the beginning of the word so a following
>     operation (such as find zero) might ignore these bytes.
>
>   - highbit_mask: create a mask with high bit of each byte being 1,
>     and the low 7 bits being all the opposite of the input.
>
> These macros are meant to be used on optimized vectorized string
> implementations.
> ---
>  sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>  create mode 100644 sysdeps/generic/string-maskoff.h
>
> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
> new file mode 100644
> index 0000000000..831647bda6
> --- /dev/null
> +++ b/sysdeps/generic/string-maskoff.h
> @@ -0,0 +1,73 @@
> +/* Mask off bits.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_MASKOFF_H
> +#define _STRING_MASKOFF_H 1
> +
> +#include <endian.h>
> +#include <limits.h>
> +#include <stdint.h>
> +#include <string-optype.h>
> +
> +/* Provide a mask based on the pointer alignment that sets up non-zero
> +   bytes before the beginning of the word.  It is used to mask off
> +   undesirable bits from an aligned read from an unaligned pointer.
> +   For instance, on a 64 bits machine with a pointer alignment of
> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
> +   (meaning to mask off the initial 3 bytes).  */
> +static inline op_t
> +create_mask (uintptr_t i)
> +{
> +  i = i % sizeof (op_t);
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    return ~(((op_t)-1) << (i * CHAR_BIT));
> +  else
> +    return ~(((op_t)-1) >> (i * CHAR_BIT));
> +}
> +
> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits
> +   machine with input as 0xce the functions returns 0xcececececececece.  */
> +static inline op_t
> +repeat_bytes (unsigned char c_in)
> +{
> +  return ((op_t)-1 / 0xff) * c_in;
> +}
> +
> +/* Based on mask created by 'create_mask', mask off the high bit of each
> +   byte in the mask.  It is used to mask off undesirable bits from an
> +   aligned read from an unaligned pointer, and also taking care to avoid
> +   match possible bytes meant to be matched.  For instance, on a 64 bits
> +   machine with a mask created from a pointer with an alignment of 3
> +   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
> +   and 0x00000000007f7f7f for LE.  */
> +static inline op_t
> +highbit_mask (op_t m)
> +{
> +  return m & repeat_bytes (0x7f);
> +}
> +
> +/* Return the address of the op_t word containing the address P.  For
> +   instance on address 0x0011223344556677 and op_t with size of 8,
> +   it returns 0x0011223344556670.  */
> +static inline op_t *
> +word_containing (char const *p)
> +{
> +  return (op_t *) (p - (uintptr_t) p % sizeof (op_t));

This can just be (p & (-sizeof(p)) I think.
Other than that look goods.
> +}
> +
> +#endif /* _STRING_MASKOFF_H  */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 04/17] Add string vectorized find and detection functions
  2022-09-19 19:59 ` [PATCH v5 04/17] Add string vectorized find and detection functions Adhemerval Zanella
@ 2023-01-05 22:53   ` Noah Goldstein
  2023-01-09 18:51     ` Adhemerval Zanella Netto
  2023-01-05 23:04   ` Noah Goldstein
  1 sibling, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 22:53 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:02 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> This patch adds generic string find and detection meant to be used in
> generic vectorized string implementation.  The idea is to decompose the
> basic string operation so each architecture can reimplement if it
> provides any specialized hardware instruction.
>
> The 'string-fza.h' provides zero byte detection functions (find_zero_low,
> find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
> find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
> provided by 'string-fzb.h' and 'string-fzi'.
>
> The 'string-fzb.h' provides boolean zero byte detection with the
> functions:
>
>   - has_zero: determine if any byte within a word is zero.
>   - has_eq: determine byte equality between two words.
>   - has_zero_eq: determine if any byte within a word is zero along with
>     byte equality between two words.
>
> The 'string-fzi.h' provides zero byte detection along with its positions:
>
>   - index_first_zero: return index of first zero byte within a word.
>   - index_first_eq: return index of first byte different between two words.
>   - index_first_zero_eq: return index of first zero byte within a word or
>     first byte different between two words.
>   - index_first_zero_ne: return index of first zero byte within a word or
>     first byte equal between two words.
>   - index_last_zero: return index of last zero byte within a word.
>   - index_last_eq: return index of last byte different between two words.
>
> Co-authored-by: Richard Henderson <rth@twiddle.net>
> ---
>  sysdeps/generic/string-extbyte.h |  37 ++++++++++
>  sysdeps/generic/string-fza.h     | 106 +++++++++++++++++++++++++++
>  sysdeps/generic/string-fzb.h     |  49 +++++++++++++
>  sysdeps/generic/string-fzi.h     | 120 +++++++++++++++++++++++++++++++
>  4 files changed, 312 insertions(+)
>  create mode 100644 sysdeps/generic/string-extbyte.h
>  create mode 100644 sysdeps/generic/string-fza.h
>  create mode 100644 sysdeps/generic/string-fzb.h
>  create mode 100644 sysdeps/generic/string-fzi.h
>
> diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
> new file mode 100644
> index 0000000000..c8fecd259f
> --- /dev/null
> +++ b/sysdeps/generic/string-extbyte.h
> @@ -0,0 +1,37 @@
> +/* Extract by from memory word.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_EXTBYTE_H
> +#define _STRING_EXTBYTE_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-optype.h>
> +
> +/* Extract the byte at index IDX from word X, with index 0 being the
> +   least significant byte.  */
> +static inline unsigned char
> +extractbyte (op_t x, unsigned int idx)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    return x >> (idx * CHAR_BIT);
> +  else
> +    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
> +}
> +
> +#endif /* _STRING_EXTBYTE_H */
> diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
> new file mode 100644
> index 0000000000..54be34e5f0
> --- /dev/null
> +++ b/sysdeps/generic/string-fza.h
> @@ -0,0 +1,106 @@
> +/* Basic zero byte detection.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZA_H
> +#define _STRING_FZA_H 1
> +
> +#include <limits.h>
> +#include <string-optype.h>
> +#include <string-maskoff.h>
> +
> +/* This function returns non-zero if any byte in X is zero.
> +   More specifically, at least one bit set within the least significant
> +   byte that was zero; other bytes within the word are indeterminate.  */
> +static inline op_t
> +find_zero_low (op_t x)
> +{
> +  /* This expression comes from
> +       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
> +     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
> +     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
> +  op_t lsb = repeat_bytes (0x01);
> +  op_t msb = repeat_bytes (0x80);
> +  return (x - lsb) & ~x & msb;
> +}
> +
> +/* This function returns at least one bit set within every byte of X that
> +   is zero.  The result is exact in that, unlike find_zero_low, all bytes
> +   are determinate.  This is usually used for finding the index of the
> +   most significant byte that was zero.  */
> +static inline op_t
> +find_zero_all (op_t x)
> +{
> +  /* For each byte, find not-zero by
> +     (0) And 0x7f so that we cannot carry between bytes,
> +     (1) Add 0x7f so that non-zero carries into 0x80,
> +     (2) Or in the original byte (which might have had 0x80 set).
> +     Then invert and mask such that 0x80 is set iff that byte was zero.  */
> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;

Use repeat_byte here?
> +  return ~(((x & m) + m) | x | m);
> +}
> +
> +/* With similar caveats, identify bytes that are equal between X1 and X2.  */
> +static inline op_t
> +find_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1) | find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_zero_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   not equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_ne_low (op_t x1, op_t x2)
> +{
> +  op_t m = repeat_bytes (0x7f);
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = (x1 + m) | x1;    /* msb set if byte not zero.  */
> +  op_t ne2 = (eq + m) | eq;    /* msb set if byte not equal.  */
> +  return (ne2 | ~nz1) & ~m;    /* msb set if x1 zero or x2 not equal.  */
> +}
> +
> +static inline op_t
> +find_zero_ne_all (op_t x1, op_t x2)
> +{
> +  op_t m = repeat_bytes (0x7f);
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = ((x1 & m) + m) | x1;
> +  op_t ne2 = ((eq & m) + m) | eq;
> +  return (ne2 | ~nz1) & ~m;
> +}
> +
> +#endif /* _STRING_FZA_H */
> diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
> new file mode 100644
> index 0000000000..f1c0ae0922
> --- /dev/null
> +++ b/sysdeps/generic/string-fzb.h
> @@ -0,0 +1,49 @@
> +/* Zero byte detection, boolean.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZB_H
> +#define _STRING_FZB_H 1
> +
> +#include <endian.h>
> +#include <string-fza.h>
> +
> +/* Determine if any byte within X is zero.  This is a pure boolean test.  */
> +
> +static inline _Bool
> +has_zero (op_t x)
> +{
> +  return find_zero_low (x) != 0;
> +}
> +
> +/* Likewise, but for byte equality between X1 and X2.  */
> +
> +static inline _Bool
> +has_eq (op_t x1, op_t x2)
> +{
> +  return find_eq_low (x1, x2) != 0;
> +}
> +
> +/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
> +
> +static inline _Bool
> +has_zero_eq (op_t x1, op_t x2)
> +{
> +  return find_zero_eq_low (x1, x2);
> +}
> +
> +#endif /* _STRING_FZB_H */
> diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
> new file mode 100644
> index 0000000000..888e1b8baa
> --- /dev/null
> +++ b/sysdeps/generic/string-fzi.h
> @@ -0,0 +1,120 @@
> +/* Zero byte detection; indexes.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZI_H
> +#define _STRING_FZI_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-fza.h>
> +#include <gmp.h>
> +#include <stdlib/gmp-impl.h>
> +#include <stdlib/longlong.h>
> +
> +/* A subroutine for the index_zero functions.  Given a test word C, return
> +   the (memory order) index of the first byte (in memory order) that is
> +   non-zero.  */
> +static inline unsigned int
> +index_first_ (op_t c)
> +{
> +  int r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    count_trailing_zeros (r, c);
> +  else
> +    count_leading_zeros (r, c);
> +  return r / CHAR_BIT;
> +}
> +
> +/* Similarly, but return the (memory order) index of the last byte that is
> +   non-zero.  */
> +static inline unsigned int
> +index_last_ (op_t c)
> +{
> +  int r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    count_leading_zeros (r, c);
> +  else
> +    count_trailing_zeros (r, c);
> +  return sizeof (op_t) - 1 - (r / CHAR_BIT);
> +}
> +
> +/* Given a word X that is known to contain a zero byte, return the index of
> +   the first such within the word in memory order.  */
> +static inline unsigned int
> +index_first_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_low (x);
> +  else
> +    x = find_zero_all (x);
> +  return index_first_ (x);
> +}
> +
> +/* Similarly, but perform the search for byte equality between X1 and X2.  */
> +static inline unsigned int
> +index_first_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_eq_low (x1, x2);
> +  else
> +    x1 = find_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or equality between
> +   X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_eq_low (x1, x2);
> +  else
> +    x1 = find_zero_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or inequality between
> +   X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_ne (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_ne_low (x1, x2);
> +  else
> +    x1 = find_zero_ne_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but search for the last zero within X.  */
> +static inline unsigned int
> +index_last_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_all (x);
> +  else
> +    x = find_zero_low (x);
> +  return index_last_ (x);
> +}
> +
> +static inline unsigned int
> +index_last_eq (op_t x1, op_t x2)
> +{
> +  return index_last_zero (x1 ^ x2);
> +}
> +
> +#endif /* STRING_FZI_H */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 04/17] Add string vectorized find and detection functions
  2022-09-19 19:59 ` [PATCH v5 04/17] Add string vectorized find and detection functions Adhemerval Zanella
  2023-01-05 22:53   ` Noah Goldstein
@ 2023-01-05 23:04   ` Noah Goldstein
  2023-01-09 19:34     ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:04 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:02 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> This patch adds generic string find and detection meant to be used in
> generic vectorized string implementation.  The idea is to decompose the
> basic string operation so each architecture can reimplement if it
> provides any specialized hardware instruction.
>
> The 'string-fza.h' provides zero byte detection functions (find_zero_low,
> find_zero_all, find_eq_low, find_eq_all, find_zero_eq_low, find_zero_eq_all,
> find_zero_ne_low, and find_zero_ne_all).  They are used on both functions
> provided by 'string-fzb.h' and 'string-fzi'.
>
> The 'string-fzb.h' provides boolean zero byte detection with the
> functions:
>
>   - has_zero: determine if any byte within a word is zero.
>   - has_eq: determine byte equality between two words.
>   - has_zero_eq: determine if any byte within a word is zero along with
>     byte equality between two words.
>
> The 'string-fzi.h' provides zero byte detection along with its positions:
>
>   - index_first_zero: return index of first zero byte within a word.
>   - index_first_eq: return index of first byte different between two words.
>   - index_first_zero_eq: return index of first zero byte within a word or
>     first byte different between two words.
>   - index_first_zero_ne: return index of first zero byte within a word or
>     first byte equal between two words.
>   - index_last_zero: return index of last zero byte within a word.
>   - index_last_eq: return index of last byte different between two words.
>
> Co-authored-by: Richard Henderson <rth@twiddle.net>
> ---
>  sysdeps/generic/string-extbyte.h |  37 ++++++++++
>  sysdeps/generic/string-fza.h     | 106 +++++++++++++++++++++++++++
>  sysdeps/generic/string-fzb.h     |  49 +++++++++++++
>  sysdeps/generic/string-fzi.h     | 120 +++++++++++++++++++++++++++++++
>  4 files changed, 312 insertions(+)
>  create mode 100644 sysdeps/generic/string-extbyte.h
>  create mode 100644 sysdeps/generic/string-fza.h
>  create mode 100644 sysdeps/generic/string-fzb.h
>  create mode 100644 sysdeps/generic/string-fzi.h
>
> diff --git a/sysdeps/generic/string-extbyte.h b/sysdeps/generic/string-extbyte.h
> new file mode 100644
> index 0000000000..c8fecd259f
> --- /dev/null
> +++ b/sysdeps/generic/string-extbyte.h
> @@ -0,0 +1,37 @@
> +/* Extract by from memory word.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_EXTBYTE_H
> +#define _STRING_EXTBYTE_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-optype.h>
> +
> +/* Extract the byte at index IDX from word X, with index 0 being the
> +   least significant byte.  */
> +static inline unsigned char
> +extractbyte (op_t x, unsigned int idx)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    return x >> (idx * CHAR_BIT);
> +  else
> +    return x >> (sizeof (x) - 1 - idx) * CHAR_BIT;
> +}
> +
> +#endif /* _STRING_EXTBYTE_H */
> diff --git a/sysdeps/generic/string-fza.h b/sysdeps/generic/string-fza.h
> new file mode 100644
> index 0000000000..54be34e5f0
> --- /dev/null
> +++ b/sysdeps/generic/string-fza.h
> @@ -0,0 +1,106 @@
> +/* Basic zero byte detection.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZA_H
> +#define _STRING_FZA_H 1
> +
> +#include <limits.h>
> +#include <string-optype.h>
> +#include <string-maskoff.h>
> +
> +/* This function returns non-zero if any byte in X is zero.
> +   More specifically, at least one bit set within the least significant
> +   byte that was zero; other bytes within the word are indeterminate.  */
> +static inline op_t
> +find_zero_low (op_t x)
> +{
> +  /* This expression comes from
> +       https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
> +     Subtracting 1 sets 0x80 in a byte that was 0; anding ~x clears
> +     0x80 in a byte that was >= 128; anding 0x80 isolates that test bit.  */
> +  op_t lsb = repeat_bytes (0x01);
> +  op_t msb = repeat_bytes (0x80);
> +  return (x - lsb) & ~x & msb;
> +}
> +
> +/* This function returns at least one bit set within every byte of X that
> +   is zero.  The result is exact in that, unlike find_zero_low, all bytes
> +   are determinate.  This is usually used for finding the index of the
> +   most significant byte that was zero.  */
> +static inline op_t
> +find_zero_all (op_t x)
> +{
> +  /* For each byte, find not-zero by
> +     (0) And 0x7f so that we cannot carry between bytes,
> +     (1) Add 0x7f so that non-zero carries into 0x80,
> +     (2) Or in the original byte (which might have had 0x80 set).
> +     Then invert and mask such that 0x80 is set iff that byte was zero.  */
> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;
> +  return ~(((x & m) + m) | x | m);
> +}
> +
> +/* With similar caveats, identify bytes that are equal between X1 and X2.  */
> +static inline op_t
> +find_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_eq_low (op_t x1, op_t x2)
> +{
> +  return find_zero_low (x1) | find_zero_low (x1 ^ x2);
> +}
> +
> +static inline op_t
> +find_zero_eq_all (op_t x1, op_t x2)
> +{
> +  return find_zero_all (x1) | find_zero_all (x1 ^ x2);
> +}
> +
> +/* With similar caveats, identify zero bytes in X1 and bytes that are
> +   not equal between in X1 and X2.  */
> +static inline op_t
> +find_zero_ne_low (op_t x1, op_t x2)
> +{
> +  op_t m = repeat_bytes (0x7f);
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = (x1 + m) | x1;    /* msb set if byte not zero.  */
> +  op_t ne2 = (eq + m) | eq;    /* msb set if byte not equal.  */
> +  return (ne2 | ~nz1) & ~m;    /* msb set if x1 zero or x2 not equal.  */
> +}
Cant this just be `(~find_zero_eq_low(x1, x2)) + 1` (seems to get
better codegen)

> +
> +static inline op_t
> +find_zero_ne_all (op_t x1, op_t x2)
> +{
> +  op_t m = repeat_bytes (0x7f);
> +  op_t eq = x1 ^ x2;
> +  op_t nz1 = ((x1 & m) + m) | x1;
> +  op_t ne2 = ((eq & m) + m) | eq;
> +  return (ne2 | ~nz1) & ~m;
> +}
> +
> +#endif /* _STRING_FZA_H */
> diff --git a/sysdeps/generic/string-fzb.h b/sysdeps/generic/string-fzb.h
> new file mode 100644
> index 0000000000..f1c0ae0922
> --- /dev/null
> +++ b/sysdeps/generic/string-fzb.h
> @@ -0,0 +1,49 @@
> +/* Zero byte detection, boolean.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZB_H
> +#define _STRING_FZB_H 1
> +
> +#include <endian.h>
> +#include <string-fza.h>
> +
> +/* Determine if any byte within X is zero.  This is a pure boolean test.  */
> +
> +static inline _Bool
> +has_zero (op_t x)
> +{
> +  return find_zero_low (x) != 0;
> +}
> +
> +/* Likewise, but for byte equality between X1 and X2.  */
> +
> +static inline _Bool
> +has_eq (op_t x1, op_t x2)
> +{
> +  return find_eq_low (x1, x2) != 0;
> +}
> +
> +/* Likewise, but for zeros in X1 and equal bytes between X1 and X2.  */
> +
> +static inline _Bool
> +has_zero_eq (op_t x1, op_t x2)
> +{
> +  return find_zero_eq_low (x1, x2);
> +}
> +
> +#endif /* _STRING_FZB_H */
> diff --git a/sysdeps/generic/string-fzi.h b/sysdeps/generic/string-fzi.h
> new file mode 100644
> index 0000000000..888e1b8baa
> --- /dev/null
> +++ b/sysdeps/generic/string-fzi.h
> @@ -0,0 +1,120 @@
> +/* Zero byte detection; indexes.  Generic C version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _STRING_FZI_H
> +#define _STRING_FZI_H 1
> +
> +#include <limits.h>
> +#include <endian.h>
> +#include <string-fza.h>
> +#include <gmp.h>
> +#include <stdlib/gmp-impl.h>
> +#include <stdlib/longlong.h>
> +
> +/* A subroutine for the index_zero functions.  Given a test word C, return
> +   the (memory order) index of the first byte (in memory order) that is
> +   non-zero.  */
> +static inline unsigned int
> +index_first_ (op_t c)
> +{
> +  int r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    count_trailing_zeros (r, c);
> +  else
> +    count_leading_zeros (r, c);
> +  return r / CHAR_BIT;
> +}
> +
> +/* Similarly, but return the (memory order) index of the last byte that is
> +   non-zero.  */
> +static inline unsigned int
> +index_last_ (op_t c)
> +{
> +  int r;
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    count_leading_zeros (r, c);
> +  else
> +    count_trailing_zeros (r, c);
> +  return sizeof (op_t) - 1 - (r / CHAR_BIT);
> +}
> +
> +/* Given a word X that is known to contain a zero byte, return the index of
> +   the first such within the word in memory order.  */
> +static inline unsigned int
> +index_first_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_low (x);
> +  else
> +    x = find_zero_all (x);
> +  return index_first_ (x);
> +}
> +
> +/* Similarly, but perform the search for byte equality between X1 and X2.  */
> +static inline unsigned int
> +index_first_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_eq_low (x1, x2);
> +  else
> +    x1 = find_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or equality between
> +   X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_eq (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_eq_low (x1, x2);
> +  else
> +    x1 = find_zero_eq_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but perform the search for zero within X1 or inequality between
> +   X1 and X2.  */
> +static inline unsigned int
> +index_first_zero_ne (op_t x1, op_t x2)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x1 = find_zero_ne_low (x1, x2);
> +  else
> +    x1 = find_zero_ne_all (x1, x2);
> +  return index_first_ (x1);
> +}
> +
> +/* Similarly, but search for the last zero within X.  */
> +static inline unsigned int
> +index_last_zero (op_t x)
> +{
> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
> +    x = find_zero_all (x);
> +  else
> +    x = find_zero_low (x);
> +  return index_last_ (x);
> +}
> +
> +static inline unsigned int
> +index_last_eq (op_t x1, op_t x2)
> +{
> +  return index_last_zero (x1 ^ x2);
> +}
> +
> +#endif /* STRING_FZI_H */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 07/17] string: Improve generic strchr
  2022-09-19 19:59 ` [PATCH v5 07/17] string: Improve generic strchr Adhemerval Zanella
@ 2023-01-05 23:09   ` Noah Goldstein
  2023-01-05 23:19     ` Noah Goldstein
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:09 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:01 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> New algorithm have the following key differences:
>
>   - Reads first word unaligned and use string-maskoff function to
>     remove unwanted data.  This strategy follow arch-specific
>     optimization used on aarch64 and powerpc.
>
>   - Use string-fz{b,i} and string-extbyte function.
>
> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multi-arch (it covers both LE and BE
> for 64 and 32 bits).
>
> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> ---
>  string/strchr.c         | 172 +++++++---------------------------------
>  sysdeps/s390/strchr-c.c |  11 +--
>  2 files changed, 34 insertions(+), 149 deletions(-)
>
> diff --git a/string/strchr.c b/string/strchr.c
> index bfd0c4e4bc..6bbee7f79d 100644
> --- a/string/strchr.c
> +++ b/string/strchr.c
> @@ -22,164 +22,48 @@
>
>  #include <string.h>
>  #include <stdlib.h>
> +#include <stdint.h>
> +#include <string-fza.h>
> +#include <string-fzb.h>
> +#include <string-fzi.h>
> +#include <string-extbyte.h>
> +#include <string-maskoff.h>
>
>  #undef strchr
> +#undef index
>
> -#ifndef STRCHR
> -# define STRCHR strchr
> +#ifdef STRCHR
> +# define strchr STRCHR
>  #endif
>
>  /* Find the first occurrence of C in S.  */
>  char *
> -STRCHR (const char *s, int c_in)
> +strchr (const char *s, int c_in)
>  {
> -  const unsigned char *char_ptr;
> -  const unsigned long int *longword_ptr;
> -  unsigned long int longword, magic_bits, charmask;
> -  unsigned char c;
> -
> -  c = (unsigned char) c_in;
> -
> -  /* Handle the first few characters by reading one character at a time.
> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s;
> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> -       ++char_ptr)
> -    if (*char_ptr == c)
> -      return (void *) char_ptr;
> -    else if (*char_ptr == '\0')
> -      return NULL;
> -
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> -
> -  longword_ptr = (unsigned long int *) char_ptr;
> -
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  magic_bits = -1;
> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> -
> -  /* Set up a longword, each of whose bytes is C.  */
> -  charmask = c | (c << 8);
> -  charmask |= charmask << 16;
> -  if (sizeof (longword) > 4)
> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> -    charmask |= (charmask << 16) << 16;
> -  if (sizeof (longword) > 8)
> -    abort ();
> -
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  for (;;)
> -    {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -        1) Is this safe?  Will it catch all the zero bytes?
> -        Suppose there is a byte with all zeros.  Any carry bits
> -        propagating from its left will fall into the hole at its
> -        least significant bit and stop.  Since there will be no
> -        carry from its most significant bit, the LSB of the
> -        byte to the left will be unchanged, and the zero will be
> -        detected.
> +  /* Set up a word, each of whose bytes is C.  */
> +  unsigned char c = (unsigned char) c_in;
> +  op_t repeated_c = repeat_bytes (c_in);
>
> -        2) Is this worthwhile?  Will it ignore everything except
> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> -        somewhere.  There will be a carry into bit 8.  If bit 8
> -        is set, this will carry into bit 16.  If bit 8 is clear,
> -        one of bits 9-15 must be set, so there will be a carry
> -        into bit 16.  Similarly, there will be a carry into bit
> -        24.  If one of bits 24-30 is set, there will be a carry
> -        into bit 31, so all of the hole bits will be changed.
> +  /* Align the input address to op_t.  */
> +  uintptr_t s_int = (uintptr_t) s;
> +  const op_t *word_ptr = word_containing (s);
>
> -        The one misfire occurs when bits 24-30 are clear and bit
> -        31 is set; in this case, the hole at bit 31 is not
> -        changed.  If we had access to the processor carry flag,
> -        we could close this loophole by putting the fourth hole
> -        at bit 32!
> +  /* Read the first aligned word, but force bytes before the string to
> +     match neither zero nor goal (we make sure the high bit of each byte
> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> +  op_t bmask = create_mask (s_int);
> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>
> -        So it ignores everything except 128's, when they're aligned
> -        properly.
> +  while (! has_zero_eq (word, repeated_c))
> +    word = *++word_ptr;
>
> -        3) But wait!  Aren't we looking for C as well as zero?
> -        Good point.  So what we do is XOR LONGWORD with a longword,
> -        each of whose bytes is C.  This turns each byte that is C
> -        into a zero.  */
> -
> -      longword = *longword_ptr++;
> -
> -      /* Add MAGIC_BITS to LONGWORD.  */
> -      if ((((longword + magic_bits)
> -
> -           /* Set those bits that were unchanged by the addition.  */
> -           ^ ~longword)
> -
> -          /* Look at only the hole bits.  If any of the hole bits
> -             are unchanged, most likely one of the bytes was a
> -             zero.  */
> -          & ~magic_bits) != 0
> -
> -         /* That caught zeroes.  Now test for C.  */
> -         || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
> -             & ~magic_bits) != 0)
> -       {
> -         /* Which of the bytes was C or zero?
> -            If none of them were, it was a misfire; continue the search.  */
> -
> -         const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
> -
> -         if (*cp == c)
> -           return (char *) cp;
> -         else if (*cp == '\0')
> -           return NULL;
> -         if (*++cp == c)
> -           return (char *) cp;
> -         else if (*cp == '\0')
> -           return NULL;
> -         if (*++cp == c)
> -           return (char *) cp;
> -         else if (*cp == '\0')
> -           return NULL;
> -         if (*++cp == c)
> -           return (char *) cp;
> -         else if (*cp == '\0')
> -           return NULL;
> -         if (sizeof (longword) > 4)
> -           {
> -             if (*++cp == c)
> -               return (char *) cp;
> -             else if (*cp == '\0')
> -               return NULL;
> -             if (*++cp == c)
> -               return (char *) cp;
> -             else if (*cp == '\0')
> -               return NULL;
> -             if (*++cp == c)
> -               return (char *) cp;
> -             else if (*cp == '\0')
> -               return NULL;
> -             if (*++cp == c)
> -               return (char *) cp;
> -             else if (*cp == '\0')
> -               return NULL;
> -           }
> -       }
> -    }
> +  op_t found = index_first_zero_eq (word, repeated_c);
>
> +  if (extractbyte (word, found) == c)
> +    return (char *) (word_ptr) + found;
>    return NULL;
>  }
> -
> -#ifdef weak_alias
> -# undef index
> +#ifndef STRCHR
>  weak_alias (strchr, index)
> -#endif
>  libc_hidden_builtin_def (strchr)
> +#endif
> diff --git a/sysdeps/s390/strchr-c.c b/sysdeps/s390/strchr-c.c
> index 4ac3a62fba..a5a1781b1c 100644
> --- a/sysdeps/s390/strchr-c.c
> +++ b/sysdeps/s390/strchr-c.c
> @@ -21,13 +21,14 @@
>  #if HAVE_STRCHR_C
>  # if HAVE_STRCHR_IFUNC
>  #  define STRCHR STRCHR_C
> -#  undef weak_alias
> +# endif
> +
> +# include <string/strchr.c>
> +
> +# if HAVE_STRCHR_IFUNC
>  #  if defined SHARED && IS_IN (libc)
> -#   undef libc_hidden_builtin_def
> -#   define libc_hidden_builtin_def(name)                       \
> -     __hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
> +__hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
>  #  endif
>  # endif
>
> -# include <string/strchr.c>
>  #endif
> --
> 2.34.1
>

Can this just be implemented as:

char * r = strchrnul(p, c);
return *r ? r : NULL;

then only have strchrnul impl to worry about?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2022-09-19 19:59 ` [PATCH v5 08/17] string: Improve generic strchrnul Adhemerval Zanella
@ 2023-01-05 23:17   ` Noah Goldstein
  2023-01-09 20:35     ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:17 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> New algorithm have the following key differences:
>
>   - Reads first word unaligned and use string-maskoff function to
>     remove unwanted data.  This strategy follow arch-specific
>     optimization used on aarch64 and powerpc.
>
>   - Use string-fz{b,i} functions.
>
> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
> and powerpc-linux-gnu by removing the arch-specific assembly
> implementation and disabling multi-arch (it covers both LE and BE
> for 64 and 32 bits).
>
> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> ---
>  string/strchrnul.c                            | 156 +++---------------
>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>  sysdeps/s390/strchrnul-c.c                    |   2 -
>  3 files changed, 24 insertions(+), 138 deletions(-)
>
> diff --git a/string/strchrnul.c b/string/strchrnul.c
> index 0cc1fc6bb0..67defa3dab 100644
> --- a/string/strchrnul.c
> +++ b/string/strchrnul.c
> @@ -1,10 +1,5 @@
>  /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>     This file is part of the GNU C Library.
> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> -   with help from Dan Sahlin (dan@sics.se) and
> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>
>     The GNU C Library is free software; you can redistribute it and/or
>     modify it under the terms of the GNU Lesser General Public
> @@ -21,146 +16,43 @@
>     <https://www.gnu.org/licenses/>.  */
>
>  #include <string.h>
> -#include <memcopy.h>
>  #include <stdlib.h>
> +#include <stdint.h>
> +#include <string-fza.h>
> +#include <string-fzb.h>
> +#include <string-fzi.h>
> +#include <string-maskoff.h>
>
>  #undef __strchrnul
>  #undef strchrnul
>
> -#ifndef STRCHRNUL
> -# define STRCHRNUL __strchrnul
> +#ifdef STRCHRNUL
> +# define __strchrnul STRCHRNUL
>  #endif
>
>  /* Find the first occurrence of C in S or the final NUL byte.  */
>  char *
> -STRCHRNUL (const char *s, int c_in)
> +__strchrnul (const char *str, int c_in)
>  {
> -  const unsigned char *char_ptr;
> -  const unsigned long int *longword_ptr;
> -  unsigned long int longword, magic_bits, charmask;
> -  unsigned char c;
> -
> -  c = (unsigned char) c_in;
> -
> -  /* Handle the first few characters by reading one character at a time.
> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s;
> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> -       ++char_ptr)
> -    if (*char_ptr == c || *char_ptr == '\0')
> -      return (void *) char_ptr;
> -
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> -
> -  longword_ptr = (unsigned long int *) char_ptr;
> -
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  magic_bits = -1;
> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> -
> -  /* Set up a longword, each of whose bytes is C.  */
> -  charmask = c | (c << 8);
> -  charmask |= charmask << 16;
> -  if (sizeof (longword) > 4)
> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> -    charmask |= (charmask << 16) << 16;
> -  if (sizeof (longword) > 8)
> -    abort ();
> -
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  for (;;)
> -    {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -        1) Is this safe?  Will it catch all the zero bytes?
> -        Suppose there is a byte with all zeros.  Any carry bits
> -        propagating from its left will fall into the hole at its
> -        least significant bit and stop.  Since there will be no
> -        carry from its most significant bit, the LSB of the
> -        byte to the left will be unchanged, and the zero will be
> -        detected.
> +  /* Set up a word, each of whose bytes is C.  */
> +  op_t repeated_c = repeat_bytes (c_in);
>
> -        2) Is this worthwhile?  Will it ignore everything except
> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> -        somewhere.  There will be a carry into bit 8.  If bit 8
> -        is set, this will carry into bit 16.  If bit 8 is clear,
> -        one of bits 9-15 must be set, so there will be a carry
> -        into bit 16.  Similarly, there will be a carry into bit
> -        24.  If one of bits 24-30 is set, there will be a carry
> -        into bit 31, so all of the hole bits will be changed.
> +  /* Align the input address to op_t.  */
> +  uintptr_t s_int = (uintptr_t) str;
> +  const op_t *word_ptr = word_containing (str);
>
> -        The one misfire occurs when bits 24-30 are clear and bit
> -        31 is set; in this case, the hole at bit 31 is not
> -        changed.  If we had access to the processor carry flag,
> -        we could close this loophole by putting the fourth hole
> -        at bit 32!
> +  /* Read the first aligned word, but force bytes before the string to
> +     match neither zero nor goal (we make sure the high bit of each byte
> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> +  op_t bmask = create_mask (s_int);
> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));

Think much clearer (and probably better codegen) is:
find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)

>
> -        So it ignores everything except 128's, when they're aligned
> -        properly.
> +  while (! has_zero_eq (word, repeated_c))
> +    word = *++word_ptr;
>
> -        3) But wait!  Aren't we looking for C as well as zero?
> -        Good point.  So what we do is XOR LONGWORD with a longword,
> -        each of whose bytes is C.  This turns each byte that is C
> -        into a zero.  */
> -
> -      longword = *longword_ptr++;
> -
> -      /* Add MAGIC_BITS to LONGWORD.  */
> -      if ((((longword + magic_bits)
> -
> -           /* Set those bits that were unchanged by the addition.  */
> -           ^ ~longword)
> -
> -          /* Look at only the hole bits.  If any of the hole bits
> -             are unchanged, most likely one of the bytes was a
> -             zero.  */
> -          & ~magic_bits) != 0
> -
> -         /* That caught zeroes.  Now test for C.  */
> -         || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
> -             & ~magic_bits) != 0)
> -       {
> -         /* Which of the bytes was C or zero?
> -            If none of them were, it was a misfire; continue the search.  */
> -
> -         const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
> -
> -         if (*cp == c || *cp == '\0')
> -           return (char *) cp;
> -         if (*++cp == c || *cp == '\0')
> -           return (char *) cp;
> -         if (*++cp == c || *cp == '\0')
> -           return (char *) cp;
> -         if (*++cp == c || *cp == '\0')
> -           return (char *) cp;
> -         if (sizeof (longword) > 4)
> -           {
> -             if (*++cp == c || *cp == '\0')
> -               return (char *) cp;
> -             if (*++cp == c || *cp == '\0')
> -               return (char *) cp;
> -             if (*++cp == c || *cp == '\0')
> -               return (char *) cp;
> -             if (*++cp == c || *cp == '\0')
> -               return (char *) cp;
> -           }
> -       }
> -    }
> -
> -  /* This should never happen.  */
> -  return NULL;
> +  op_t found = index_first_zero_eq (word, repeated_c);
> +  return (char *) (word_ptr) + found;
>  }
> -
> +#ifndef STRCHRNUL
>  weak_alias (__strchrnul, strchrnul)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
> index ed86b5e671..9c85e269f7 100644
> --- a/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
> +++ b/sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul-ppc32.c
> @@ -19,10 +19,6 @@
>  #include <string.h>
>
>  #define STRCHRNUL  __strchrnul_ppc
> -
> -#undef weak_alias
> -#define weak_alias(a,b )
> -
>  extern __typeof (strchrnul) __strchrnul_ppc attribute_hidden;
>
>  #include <string/strchrnul.c>
> diff --git a/sysdeps/s390/strchrnul-c.c b/sysdeps/s390/strchrnul-c.c
> index 4ffac54edd..2ebbcc62f7 100644
> --- a/sysdeps/s390/strchrnul-c.c
> +++ b/sysdeps/s390/strchrnul-c.c
> @@ -22,8 +22,6 @@
>  # if HAVE_STRCHRNUL_IFUNC
>  #  define STRCHRNUL STRCHRNUL_C
>  #  define __strchrnul STRCHRNUL
> -#  undef weak_alias
> -#  define weak_alias(name, alias)
>  # endif
>
>  # include <string/strchrnul.c>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 07/17] string: Improve generic strchr
  2023-01-05 23:09   ` Noah Goldstein
@ 2023-01-05 23:19     ` Noah Goldstein
  2023-01-09 19:39       ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:19 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Thu, Jan 5, 2023 at 3:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Sep 19, 2022 at 1:01 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > New algorithm have the following key differences:
> >
> >   - Reads first word unaligned and use string-maskoff function to
> >     remove unwanted data.  This strategy follow arch-specific
> >     optimization used on aarch64 and powerpc.
> >
> >   - Use string-fz{b,i} and string-extbyte function.
> >
> > Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> > and powerpc64-linux-gnu by removing the arch-specific assembly
> > implementation and disabling multi-arch (it covers both LE and BE
> > for 64 and 32 bits).
> >
> > Co-authored-by: Richard Henderson  <rth@twiddle.net>
> > ---
> >  string/strchr.c         | 172 +++++++---------------------------------
> >  sysdeps/s390/strchr-c.c |  11 +--
> >  2 files changed, 34 insertions(+), 149 deletions(-)
> >
> > diff --git a/string/strchr.c b/string/strchr.c
> > index bfd0c4e4bc..6bbee7f79d 100644
> > --- a/string/strchr.c
> > +++ b/string/strchr.c
> > @@ -22,164 +22,48 @@
> >
> >  #include <string.h>
> >  #include <stdlib.h>
> > +#include <stdint.h>
> > +#include <string-fza.h>
> > +#include <string-fzb.h>
> > +#include <string-fzi.h>
> > +#include <string-extbyte.h>
> > +#include <string-maskoff.h>
> >
> >  #undef strchr
> > +#undef index
> >
> > -#ifndef STRCHR
> > -# define STRCHR strchr
> > +#ifdef STRCHR
> > +# define strchr STRCHR
> >  #endif
> >
> >  /* Find the first occurrence of C in S.  */
> >  char *
> > -STRCHR (const char *s, int c_in)
> > +strchr (const char *s, int c_in)
> >  {
> > -  const unsigned char *char_ptr;
> > -  const unsigned long int *longword_ptr;
> > -  unsigned long int longword, magic_bits, charmask;
> > -  unsigned char c;
> > -
> > -  c = (unsigned char) c_in;
> > -
> > -  /* Handle the first few characters by reading one character at a time.
> > -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> > -  for (char_ptr = (const unsigned char *) s;
> > -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> > -       ++char_ptr)
> > -    if (*char_ptr == c)
> > -      return (void *) char_ptr;
> > -    else if (*char_ptr == '\0')
> > -      return NULL;
> > -
> > -  /* All these elucidatory comments refer to 4-byte longwords,
> > -     but the theory applies equally well to 8-byte longwords.  */
> > -
> > -  longword_ptr = (unsigned long int *) char_ptr;
> > -
> > -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> > -     the "holes."  Note that there is a hole just to the left of
> > -     each byte, with an extra at the end:
> > -
> > -     bits:  01111110 11111110 11111110 11111111
> > -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> > -
> > -     The 1-bits make sure that carries propagate to the next 0-bit.
> > -     The 0-bits provide holes for carries to fall into.  */
> > -  magic_bits = -1;
> > -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> > -
> > -  /* Set up a longword, each of whose bytes is C.  */
> > -  charmask = c | (c << 8);
> > -  charmask |= charmask << 16;
> > -  if (sizeof (longword) > 4)
> > -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> > -    charmask |= (charmask << 16) << 16;
> > -  if (sizeof (longword) > 8)
> > -    abort ();
> > -
> > -  /* Instead of the traditional loop which tests each character,
> > -     we will test a longword at a time.  The tricky part is testing
> > -     if *any of the four* bytes in the longword in question are zero.  */
> > -  for (;;)
> > -    {
> > -      /* We tentatively exit the loop if adding MAGIC_BITS to
> > -        LONGWORD fails to change any of the hole bits of LONGWORD.
> > -
> > -        1) Is this safe?  Will it catch all the zero bytes?
> > -        Suppose there is a byte with all zeros.  Any carry bits
> > -        propagating from its left will fall into the hole at its
> > -        least significant bit and stop.  Since there will be no
> > -        carry from its most significant bit, the LSB of the
> > -        byte to the left will be unchanged, and the zero will be
> > -        detected.
> > +  /* Set up a word, each of whose bytes is C.  */
> > +  unsigned char c = (unsigned char) c_in;
> > +  op_t repeated_c = repeat_bytes (c_in);
> >
> > -        2) Is this worthwhile?  Will it ignore everything except
> > -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> > -        somewhere.  There will be a carry into bit 8.  If bit 8
> > -        is set, this will carry into bit 16.  If bit 8 is clear,
> > -        one of bits 9-15 must be set, so there will be a carry
> > -        into bit 16.  Similarly, there will be a carry into bit
> > -        24.  If one of bits 24-30 is set, there will be a carry
> > -        into bit 31, so all of the hole bits will be changed.
> > +  /* Align the input address to op_t.  */
> > +  uintptr_t s_int = (uintptr_t) s;
> > +  const op_t *word_ptr = word_containing (s);
> >
> > -        The one misfire occurs when bits 24-30 are clear and bit
> > -        31 is set; in this case, the hole at bit 31 is not
> > -        changed.  If we had access to the processor carry flag,
> > -        we could close this loophole by putting the fourth hole
> > -        at bit 32!
> > +  /* Read the first aligned word, but force bytes before the string to
> > +     match neither zero nor goal (we make sure the high bit of each byte
> > +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> > +  op_t bmask = create_mask (s_int);
> > +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
> >
> > -        So it ignores everything except 128's, when they're aligned
> > -        properly.
> > +  while (! has_zero_eq (word, repeated_c))
> > +    word = *++word_ptr;
> >
> > -        3) But wait!  Aren't we looking for C as well as zero?
> > -        Good point.  So what we do is XOR LONGWORD with a longword,
> > -        each of whose bytes is C.  This turns each byte that is C
> > -        into a zero.  */
> > -
> > -      longword = *longword_ptr++;
> > -
> > -      /* Add MAGIC_BITS to LONGWORD.  */
> > -      if ((((longword + magic_bits)
> > -
> > -           /* Set those bits that were unchanged by the addition.  */
> > -           ^ ~longword)
> > -
> > -          /* Look at only the hole bits.  If any of the hole bits
> > -             are unchanged, most likely one of the bytes was a
> > -             zero.  */
> > -          & ~magic_bits) != 0
> > -
> > -         /* That caught zeroes.  Now test for C.  */
> > -         || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
> > -             & ~magic_bits) != 0)
> > -       {
> > -         /* Which of the bytes was C or zero?
> > -            If none of them were, it was a misfire; continue the search.  */
> > -
> > -         const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
> > -
> > -         if (*cp == c)
> > -           return (char *) cp;
> > -         else if (*cp == '\0')
> > -           return NULL;
> > -         if (*++cp == c)
> > -           return (char *) cp;
> > -         else if (*cp == '\0')
> > -           return NULL;
> > -         if (*++cp == c)
> > -           return (char *) cp;
> > -         else if (*cp == '\0')
> > -           return NULL;
> > -         if (*++cp == c)
> > -           return (char *) cp;
> > -         else if (*cp == '\0')
> > -           return NULL;
> > -         if (sizeof (longword) > 4)
> > -           {
> > -             if (*++cp == c)
> > -               return (char *) cp;
> > -             else if (*cp == '\0')
> > -               return NULL;
> > -             if (*++cp == c)
> > -               return (char *) cp;
> > -             else if (*cp == '\0')
> > -               return NULL;
> > -             if (*++cp == c)
> > -               return (char *) cp;
> > -             else if (*cp == '\0')
> > -               return NULL;
> > -             if (*++cp == c)
> > -               return (char *) cp;
> > -             else if (*cp == '\0')
> > -               return NULL;
> > -           }
> > -       }
> > -    }
> > +  op_t found = index_first_zero_eq (word, repeated_c);
> >
> > +  if (extractbyte (word, found) == c)
> > +    return (char *) (word_ptr) + found;
> >    return NULL;
> >  }
> > -
> > -#ifdef weak_alias
> > -# undef index
> > +#ifndef STRCHR
> >  weak_alias (strchr, index)
> > -#endif
> >  libc_hidden_builtin_def (strchr)
> > +#endif
> > diff --git a/sysdeps/s390/strchr-c.c b/sysdeps/s390/strchr-c.c
> > index 4ac3a62fba..a5a1781b1c 100644
> > --- a/sysdeps/s390/strchr-c.c
> > +++ b/sysdeps/s390/strchr-c.c
> > @@ -21,13 +21,14 @@
> >  #if HAVE_STRCHR_C
> >  # if HAVE_STRCHR_IFUNC
> >  #  define STRCHR STRCHR_C
> > -#  undef weak_alias
> > +# endif
> > +
> > +# include <string/strchr.c>
> > +
> > +# if HAVE_STRCHR_IFUNC
> >  #  if defined SHARED && IS_IN (libc)
> > -#   undef libc_hidden_builtin_def
> > -#   define libc_hidden_builtin_def(name)                       \
> > -     __hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
> > +__hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
> >  #  endif
> >  # endif
> >
> > -# include <string/strchr.c>
> >  #endif
> > --
> > 2.34.1
> >
>
> Can this just be implemented as:
>
> char * r = strchrnul(p, c);
> return *r ? r : NULL;
Thats wrong, should be: `return (*r == c) ? r : NULL;`
>
> then only have strchrnul impl to worry about?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2023-01-05 22:49   ` Noah Goldstein
@ 2023-01-05 23:26     ` Alejandro Colomar
  2023-01-09 18:19       ` Adhemerval Zanella Netto
  2023-01-09 18:02     ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Colomar @ 2023-01-05 23:26 UTC (permalink / raw)
  To: Noah Goldstein, Adhemerval Zanella; +Cc: libc-alpha


[-- Attachment #1.1: Type: text/plain, Size: 3854 bytes --]



On 1/5/23 23:49, Noah Goldstein via Libc-alpha wrote:
> On Mon, Sep 19, 2022 at 12:59 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Macros to operate on unaligned access for string operations:
>>
>>    - create_mask: create a mask based on pointer alignment to sets up
>>      non-zero bytes before the beginning of the word so a following
>>      operation (such as find zero) might ignore these bytes.
>>
>>    - highbit_mask: create a mask with high bit of each byte being 1,
>>      and the low 7 bits being all the opposite of the input.
>>
>> These macros are meant to be used on optimized vectorized string
>> implementations.
>> ---
>>   sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>>   1 file changed, 73 insertions(+)
>>   create mode 100644 sysdeps/generic/string-maskoff.h
>>
>> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
>> new file mode 100644
>> index 0000000000..831647bda6
>> --- /dev/null
>> +++ b/sysdeps/generic/string-maskoff.h
>> @@ -0,0 +1,73 @@
>> +/* Mask off bits.  Generic C version.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef _STRING_MASKOFF_H
>> +#define _STRING_MASKOFF_H 1
>> +
>> +#include <endian.h>
>> +#include <limits.h>
>> +#include <stdint.h>
>> +#include <string-optype.h>
>> +
>> +/* Provide a mask based on the pointer alignment that sets up non-zero
>> +   bytes before the beginning of the word.  It is used to mask off
>> +   undesirable bits from an aligned read from an unaligned pointer.
>> +   For instance, on a 64 bits machine with a pointer alignment of
>> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
>> +   (meaning to mask off the initial 3 bytes).  */
>> +static inline op_t
>> +create_mask (uintptr_t i)
>> +{
>> +  i = i % sizeof (op_t);
>> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
>> +    return ~(((op_t)-1) << (i * CHAR_BIT));
>> +  else
>> +    return ~(((op_t)-1) >> (i * CHAR_BIT));
>> +}
>> +
>> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits
>> +   machine with input as 0xce the functions returns 0xcececececececece.  */
>> +static inline op_t

Hi Adhemerval and Noah,

I don't know what is the minimum C version for compiling glibc, but if you can 
ignore C89, I would propose something:

'static inline' should be restricted to .c files, since if the compiler decides 
to not inline and you have it in a header, you end up with multiple static 
definitions for the same code.

In headers, I use C99 inline, which doesn't emit any object code when the 
compiler decides to not inline.  Then in a .c file, you add a prototype using 
'extern inline', and the compiler will emit code there, exactly once.

Even if you have to support C89, I'd use [[gnu::always_inline]] together with 
'static inline', to make sure that the compiler doesn't do nefarious stuff.

Cheers,

Alex

-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2022-09-19 19:59 ` [PATCH v5 10/17] string: Improve generic memchr Adhemerval Zanella
@ 2023-01-05 23:47   ` Noah Goldstein
  2023-01-09 20:50     ` Adhemerval Zanella Netto
  2023-01-05 23:49   ` Noah Goldstein
  1 sibling, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:47 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> New algorithm have the following key differences:
>
>   - Reads first word unaligned and use string-maskoff function to
>     remove unwanted data.  This strategy follow arch-specific
>     optimization used on aarch64 and powerpc.
>
>   - Use string-fz{b,i} and string-opthr functions.
>
> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multi-arch (it covers both LE and BE
> for 64 and 32 bits).
>
> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> ---
>  string/memchr.c                               | 168 +++++-------------
>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>  3 files changed, 48 insertions(+), 143 deletions(-)
>
> diff --git a/string/memchr.c b/string/memchr.c
> index 422bcd0cd6..08d518b02d 100644
> --- a/string/memchr.c
> +++ b/string/memchr.c
> @@ -1,10 +1,6 @@
> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> +/* Scan memory for a character.  Generic version
> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
>     This file is part of the GNU C Library.
> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> -   with help from Dan Sahlin (dan@sics.se) and
> -   commentary by Jim Blandy (jimb@ai.mit.edu);
> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>
>     The GNU C Library is free software; you can redistribute it and/or
>     modify it under the terms of the GNU Lesser General Public
> @@ -20,143 +16,65 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#ifndef _LIBC
> -# include <config.h>
> -#endif
> -
> +#include <intprops.h>
> +#include <string-fza.h>
> +#include <string-fzb.h>
> +#include <string-fzi.h>
> +#include <string-maskoff.h>
> +#include <string-opthr.h>
>  #include <string.h>
>
> -#include <stddef.h>
> +#undef memchr
>
> -#include <limits.h>
> -
> -#undef __memchr
> -#ifdef _LIBC
> -# undef memchr
> +#ifdef MEMCHR
> +# define __memchr MEMCHR
>  #endif
>
> -#ifndef weak_alias
> -# define __memchr memchr
> -#endif
> -
> -#ifndef MEMCHR
> -# define MEMCHR __memchr
> -#endif
> +static inline const char *
> +sadd (uintptr_t x, uintptr_t y)
> +{
> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
> +  return (const char *)ret;
> +}
>
>  /* Search no more than N bytes of S for C.  */
>  void *
> -MEMCHR (void const *s, int c_in, size_t n)
> +__memchr (void const *s, int c_in, size_t n)
>  {
> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
> -     long instead of a 64-bit uintmax_t tends to give better
> -     performance.  On 64-bit hardware, unsigned long is generally 64
> -     bits already.  Change this typedef to experiment with
> -     performance.  */
> -  typedef unsigned long int longword;
> +  if (__glibc_unlikely (n == 0))
> +    return NULL;
>
> -  const unsigned char *char_ptr;
> -  const longword *longword_ptr;
> -  longword repeated_one;
> -  longword repeated_c;
> -  unsigned char c;
> +  uintptr_t s_int = (uintptr_t) s;
>
> -  c = (unsigned char) c_in;
> +  /* Set up a word, each of whose bytes is C.  */
> +  op_t repeated_c = repeat_bytes (c_in);
> +  op_t before_mask = create_mask (s_int);
>
> -  /* Handle the first few bytes by reading one byte at a time.
> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s;
> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
> -       --n, ++char_ptr)
> -    if (*char_ptr == c)
> -      return (void *) char_ptr;
> +  /* Compute the address of the last byte taking in consideration possible
> +     overflow.  */
> +  const char *lbyte = sadd (s_int, n - 1);

Do you need this? The comparison in the loop is == so letting it
overflow should be fine no?
>
> -  longword_ptr = (const longword *) char_ptr;
> +  /* Compute the address of the word containing the last byte. */
> +  const op_t *lword = word_containing (lbyte);
>
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to any size longwords.  */
> +  /* Read the first word, but munge it so that bytes before the array
> +     will not match goal.  */
> +  const op_t *word_ptr = word_containing (s);
> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);

Likewise, prefer just shifting out the invalid comparisons on the first word.
>
> -  /* Compute auxiliary longword values:
> -     repeated_one is a value which has a 1 in every byte.
> -     repeated_c has c in every byte.  */
> -  repeated_one = 0x01010101;
> -  repeated_c = c | (c << 8);
> -  repeated_c |= repeated_c << 16;
> -  if (0xffffffffU < (longword) -1)
> +  while (has_eq (word, repeated_c) == 0)
>      {
> -      repeated_one |= repeated_one << 31 << 1;
> -      repeated_c |= repeated_c << 31 << 1;
> -      if (8 < sizeof (longword))
> -       {
> -         size_t i;
> -
> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
> -           {
> -             repeated_one |= repeated_one << i;
> -             repeated_c |= repeated_c << i;
> -           }
> -       }
> +      if (word_ptr == lword)
> +       return NULL;
> +      word = *++word_ptr;
>      }
>
> -  /* Instead of the traditional loop which tests each byte, we will test a
> -     longword at a time.  The tricky part is testing if *any of the four*
> -     bytes in the longword in question are equal to c.  We first use an xor
> -     with repeated_c.  This reduces the task to testing whether *any of the
> -     four* bytes in longword1 is zero.
> -
> -     We compute tmp =
> -       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
> -     That is, we perform the following operations:
> -       1. Subtract repeated_one.
> -       2. & ~longword1.
> -       3. & a mask consisting of 0x80 in every byte.
> -     Consider what happens in each byte:
> -       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
> -        and step 3 transforms it into 0x80.  A carry can also be propagated
> -        to more significant bytes.
> -       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
> -        position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
> -        the byte ends in a single bit of value 0 and k bits of value 1.
> -        After step 2, the result is just k bits of value 1: 2^k - 1.  After
> -        step 3, the result is 0.  And no carry is produced.
> -     So, if longword1 has only non-zero bytes, tmp is zero.
> -     Whereas if longword1 has a zero byte, call j the position of the least
> -     significant zero byte.  Then the result has a zero at positions 0, ...,
> -     j-1 and a 0x80 at position j.  We cannot predict the result at the more
> -     significant bytes (positions j+1..3), but it does not matter since we
> -     already have a non-zero bit at position 8*j+7.
> -
> -     So, the test whether any byte in longword1 is zero is equivalent to
> -     testing whether tmp is nonzero.  */
> -
> -  while (n >= sizeof (longword))
> -    {
> -      longword longword1 = *longword_ptr ^ repeated_c;
> -
> -      if ((((longword1 - repeated_one) & ~longword1)
> -          & (repeated_one << 7)) != 0)
> -       break;
> -      longword_ptr++;
> -      n -= sizeof (longword);
> -    }
> -
> -  char_ptr = (const unsigned char *) longword_ptr;
> -
> -  /* At this point, we know that either n < sizeof (longword), or one of the
> -     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
> -     machines, we could determine the first such byte without any further
> -     memory accesses, just by looking at the tmp result from the last loop
> -     iteration.  But this does not work on big-endian machines.  Choose code
> -     that works in both cases.  */
> -
> -  for (; n > 0; --n, ++char_ptr)
> -    {
> -      if (*char_ptr == c)
> -       return (void *) char_ptr;
> -    }
> -
> -  return NULL;
> +  /* We found a match, but it might be in a byte past the end
> +     of the array.  */
> +  char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
> +  return (ret <= lbyte) ? ret : NULL;
>  }
> -#ifdef weak_alias
> +#ifndef MEMCHR
>  weak_alias (__memchr, memchr)
> -#endif
>  libc_hidden_builtin_def (memchr)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> index fc69df54b3..02877d3c98 100644
> --- a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> +++ b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> @@ -18,17 +18,11 @@
>
>  #include <string.h>
>
> -#define MEMCHR  __memchr_ppc
> +extern __typeof (memchr) __memchr_ppc attribute_hidden;
>
> -#undef weak_alias
> -#define weak_alias(a, b)
> +#define MEMCHR  __memchr_ppc
> +#include <string/memchr.c>
>
>  #ifdef SHARED
> -# undef libc_hidden_builtin_def
> -# define libc_hidden_builtin_def(name) \
> -  __hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
> +__hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
>  #endif
> -
> -extern __typeof (memchr) __memchr_ppc attribute_hidden;
> -
> -#include <string/memchr.c>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> index 3c966f4403..15beca787b 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> @@ -18,14 +18,7 @@
>
>  #include <string.h>
>
> -#define MEMCHR  __memchr_ppc
> -
> -#undef weak_alias
> -#define weak_alias(a, b)
> -
> -# undef libc_hidden_builtin_def
> -# define libc_hidden_builtin_def(name)
> -
>  extern __typeof (memchr) __memchr_ppc attribute_hidden;
>
> +#define MEMCHR  __memchr_ppc
>  #include <string/memchr.c>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2022-09-19 19:59 ` [PATCH v5 10/17] string: Improve generic memchr Adhemerval Zanella
  2023-01-05 23:47   ` Noah Goldstein
@ 2023-01-05 23:49   ` Noah Goldstein
  2023-01-09 20:51     ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> New algorithm have the following key differences:
>
>   - Reads first word unaligned and use string-maskoff function to
>     remove unwanted data.  This strategy follow arch-specific
>     optimization used on aarch64 and powerpc.
>
>   - Use string-fz{b,i} and string-opthr functions.
>
> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multi-arch (it covers both LE and BE
> for 64 and 32 bits).
>
> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> ---
>  string/memchr.c                               | 168 +++++-------------
>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>  3 files changed, 48 insertions(+), 143 deletions(-)
>
> diff --git a/string/memchr.c b/string/memchr.c
> index 422bcd0cd6..08d518b02d 100644
> --- a/string/memchr.c
> +++ b/string/memchr.c
> @@ -1,10 +1,6 @@
> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> +/* Scan memory for a character.  Generic version
> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
>     This file is part of the GNU C Library.
> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> -   with help from Dan Sahlin (dan@sics.se) and
> -   commentary by Jim Blandy (jimb@ai.mit.edu);
> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>
>     The GNU C Library is free software; you can redistribute it and/or
>     modify it under the terms of the GNU Lesser General Public
> @@ -20,143 +16,65 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#ifndef _LIBC
> -# include <config.h>
> -#endif
> -
> +#include <intprops.h>
> +#include <string-fza.h>
> +#include <string-fzb.h>
> +#include <string-fzi.h>
> +#include <string-maskoff.h>
> +#include <string-opthr.h>
>  #include <string.h>
>
> -#include <stddef.h>
> +#undef memchr
>
> -#include <limits.h>
> -
> -#undef __memchr
> -#ifdef _LIBC
> -# undef memchr
> +#ifdef MEMCHR
> +# define __memchr MEMCHR
>  #endif
>
> -#ifndef weak_alias
> -# define __memchr memchr
> -#endif
> -
> -#ifndef MEMCHR
> -# define MEMCHR __memchr
> -#endif
> +static inline const char *
> +sadd (uintptr_t x, uintptr_t y)
> +{
> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
> +  return (const char *)ret;
> +}
>
>  /* Search no more than N bytes of S for C.  */
>  void *
> -MEMCHR (void const *s, int c_in, size_t n)
> +__memchr (void const *s, int c_in, size_t n)
>  {
> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
> -     long instead of a 64-bit uintmax_t tends to give better
> -     performance.  On 64-bit hardware, unsigned long is generally 64
> -     bits already.  Change this typedef to experiment with
> -     performance.  */
> -  typedef unsigned long int longword;
> +  if (__glibc_unlikely (n == 0))
> +    return NULL;
>
> -  const unsigned char *char_ptr;
> -  const longword *longword_ptr;
> -  longword repeated_one;
> -  longword repeated_c;
> -  unsigned char c;
> +  uintptr_t s_int = (uintptr_t) s;
>
> -  c = (unsigned char) c_in;
> +  /* Set up a word, each of whose bytes is C.  */
> +  op_t repeated_c = repeat_bytes (c_in);
> +  op_t before_mask = create_mask (s_int);
>
> -  /* Handle the first few bytes by reading one byte at a time.
> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s;
> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
> -       --n, ++char_ptr)
> -    if (*char_ptr == c)
> -      return (void *) char_ptr;
> +  /* Compute the address of the last byte taking in consideration possible
> +     overflow.  */
> +  const char *lbyte = sadd (s_int, n - 1);
>
> -  longword_ptr = (const longword *) char_ptr;
> +  /* Compute the address of the word containing the last byte. */
> +  const op_t *lword = word_containing (lbyte);
>
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to any size longwords.  */
> +  /* Read the first word, but munge it so that bytes before the array
> +     will not match goal.  */
> +  const op_t *word_ptr = word_containing (s);
> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
>
> -  /* Compute auxiliary longword values:
> -     repeated_one is a value which has a 1 in every byte.
> -     repeated_c has c in every byte.  */
> -  repeated_one = 0x01010101;
> -  repeated_c = c | (c << 8);
> -  repeated_c |= repeated_c << 16;
> -  if (0xffffffffU < (longword) -1)
> +  while (has_eq (word, repeated_c) == 0)
>      {
> -      repeated_one |= repeated_one << 31 << 1;
> -      repeated_c |= repeated_c << 31 << 1;
> -      if (8 < sizeof (longword))
> -       {
> -         size_t i;
> -
> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
> -           {
> -             repeated_one |= repeated_one << i;
> -             repeated_c |= repeated_c << i;
> -           }
> -       }
> +      if (word_ptr == lword)
> +       return NULL;
Inuitively making lword, lword - 1 so that normal returns don't need the extra
null check would be faster.
> +      word = *++word_ptr;
>      }
>
> -  /* Instead of the traditional loop which tests each byte, we will test a
> -     longword at a time.  The tricky part is testing if *any of the four*
> -     bytes in the longword in question are equal to c.  We first use an xor
> -     with repeated_c.  This reduces the task to testing whether *any of the
> -     four* bytes in longword1 is zero.
> -
> -     We compute tmp =
> -       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
> -     That is, we perform the following operations:
> -       1. Subtract repeated_one.
> -       2. & ~longword1.
> -       3. & a mask consisting of 0x80 in every byte.
> -     Consider what happens in each byte:
> -       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
> -        and step 3 transforms it into 0x80.  A carry can also be propagated
> -        to more significant bytes.
> -       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
> -        position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
> -        the byte ends in a single bit of value 0 and k bits of value 1.
> -        After step 2, the result is just k bits of value 1: 2^k - 1.  After
> -        step 3, the result is 0.  And no carry is produced.
> -     So, if longword1 has only non-zero bytes, tmp is zero.
> -     Whereas if longword1 has a zero byte, call j the position of the least
> -     significant zero byte.  Then the result has a zero at positions 0, ...,
> -     j-1 and a 0x80 at position j.  We cannot predict the result at the more
> -     significant bytes (positions j+1..3), but it does not matter since we
> -     already have a non-zero bit at position 8*j+7.
> -
> -     So, the test whether any byte in longword1 is zero is equivalent to
> -     testing whether tmp is nonzero.  */
> -
> -  while (n >= sizeof (longword))
> -    {
> -      longword longword1 = *longword_ptr ^ repeated_c;
> -
> -      if ((((longword1 - repeated_one) & ~longword1)
> -          & (repeated_one << 7)) != 0)
> -       break;
> -      longword_ptr++;
> -      n -= sizeof (longword);
> -    }
> -
> -  char_ptr = (const unsigned char *) longword_ptr;
> -
> -  /* At this point, we know that either n < sizeof (longword), or one of the
> -     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
> -     machines, we could determine the first such byte without any further
> -     memory accesses, just by looking at the tmp result from the last loop
> -     iteration.  But this does not work on big-endian machines.  Choose code
> -     that works in both cases.  */
> -
> -  for (; n > 0; --n, ++char_ptr)
> -    {
> -      if (*char_ptr == c)
> -       return (void *) char_ptr;
> -    }
> -
> -  return NULL;
> +  /* We found a match, but it might be in a byte past the end
> +     of the array.  */
> +  char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
> +  return (ret <= lbyte) ? ret : NULL;
>  }
> -#ifdef weak_alias
> +#ifndef MEMCHR
>  weak_alias (__memchr, memchr)
> -#endif
>  libc_hidden_builtin_def (memchr)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> index fc69df54b3..02877d3c98 100644
> --- a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> +++ b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
> @@ -18,17 +18,11 @@
>
>  #include <string.h>
>
> -#define MEMCHR  __memchr_ppc
> +extern __typeof (memchr) __memchr_ppc attribute_hidden;
>
> -#undef weak_alias
> -#define weak_alias(a, b)
> +#define MEMCHR  __memchr_ppc
> +#include <string/memchr.c>
>
>  #ifdef SHARED
> -# undef libc_hidden_builtin_def
> -# define libc_hidden_builtin_def(name) \
> -  __hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
> +__hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
>  #endif
> -
> -extern __typeof (memchr) __memchr_ppc attribute_hidden;
> -
> -#include <string/memchr.c>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> index 3c966f4403..15beca787b 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
> @@ -18,14 +18,7 @@
>
>  #include <string.h>
>
> -#define MEMCHR  __memchr_ppc
> -
> -#undef weak_alias
> -#define weak_alias(a, b)
> -
> -# undef libc_hidden_builtin_def
> -# define libc_hidden_builtin_def(name)
> -
>  extern __typeof (memchr) __memchr_ppc attribute_hidden;
>
> +#define MEMCHR  __memchr_ppc
>  #include <string/memchr.c>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 11/17] string: Improve generic memrchr
  2022-09-19 19:59 ` [PATCH v5 11/17] string: Improve generic memrchr Adhemerval Zanella
@ 2023-01-05 23:51   ` Noah Goldstein
  0 siblings, 0 replies; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:51 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Richard Henderson

On Mon, Sep 19, 2022 at 1:02 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> New algorithm have the following key differences:
>
>   - Use string-fz{b,i} functions.
>
> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multi-arch (it covers both LE and BE
> for 64 and 32 bits).
>
> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> ---
>  string/memrchr.c | 189 ++++++++---------------------------------------
>  1 file changed, 32 insertions(+), 157 deletions(-)
>
> diff --git a/string/memrchr.c b/string/memrchr.c
> index 8eb6829e45..5491689c66 100644
> --- a/string/memrchr.c
> +++ b/string/memrchr.c
> @@ -1,11 +1,6 @@
>  /* memrchr -- find the last occurrence of a byte in a memory block
>     Copyright (C) 1991-2022 Free Software Foundation, Inc.
>     This file is part of the GNU C Library.
> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> -   with help from Dan Sahlin (dan@sics.se) and
> -   commentary by Jim Blandy (jimb@ai.mit.edu);
> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>
>     The GNU C Library is free software; you can redistribute it and/or
>     modify it under the terms of the GNU Lesser General Public
> @@ -21,177 +16,57 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#include <stdlib.h>
> -
> -#ifdef HAVE_CONFIG_H
> -# include <config.h>
> -#endif
> -
> -#if defined _LIBC
> -# include <string.h>
> -# include <memcopy.h>
> -#endif
> -
> -#if defined HAVE_LIMITS_H || defined _LIBC
> -# include <limits.h>
> -#endif
> -
> -#define LONG_MAX_32_BITS 2147483647
> -
> -#ifndef LONG_MAX
> -# define LONG_MAX LONG_MAX_32_BITS
> -#endif
> -
> -#include <sys/types.h>
> +#include <string-fzb.h>
> +#include <string-fzi.h>
> +#include <string-maskoff.h>
> +#include <string-opthr.h>
> +#include <string.h>
>
>  #undef __memrchr
>  #undef memrchr
>
> -#ifndef weak_alias
> -# define __memrchr memrchr
> +#ifdef MEMRCHR
> +# define __memrchr MEMRCHR
>  #endif
>
> -/* Search no more than N bytes of S for C.  */
>  void *
> -#ifndef MEMRCHR
> -__memrchr
> -#else
> -MEMRCHR
> -#endif
> -     (const void *s, int c_in, size_t n)
> +__memrchr (const void *s, int c_in, size_t n)
>  {
> -  const unsigned char *char_ptr;
> -  const unsigned long int *longword_ptr;
> -  unsigned long int longword, magic_bits, charmask;
> -  unsigned char c;
> -
> -  c = (unsigned char) c_in;
> -
>    /* Handle the last few characters by reading one character at a time.
> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> -  for (char_ptr = (const unsigned char *) s + n;
> -       n > 0 && ((unsigned long int) char_ptr
> -                & (sizeof (longword) - 1)) != 0;
> -       --n)
> -    if (*--char_ptr == c)
> +     Do this until CHAR_PTR is aligned on a word boundary, or
> +     the entirety of small inputs.  */
> +  const unsigned char *char_ptr = (const unsigned char *) (s + n);
> +  size_t align = (uintptr_t) char_ptr  % sizeof (op_t);
> +  if (n < OP_T_THRES || align > n)
> +    align = n;
> +  for (size_t i = 0; i < align; ++i)
> +    if (*--char_ptr == c_in)
>        return (void *) char_ptr;
>
> -  /* All these elucidatory comments refer to 4-byte longwords,
> -     but the theory applies equally well to 8-byte longwords.  */
> +  const op_t *word_ptr = (const op_t *) char_ptr;
> +  n -= align;
> +  if (__glibc_unlikely (n == 0))
> +    return NULL;
>
> -  longword_ptr = (const unsigned long int *) char_ptr;
> +  /* Compute the address of the word containing the initial byte. */
> +  const op_t *lword = word_containing (s);
>
> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> -     the "holes."  Note that there is a hole just to the left of
> -     each byte, with an extra at the end:
> -
> -     bits:  01111110 11111110 11111110 11111111
> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> -
> -     The 1-bits make sure that carries propagate to the next 0-bit.
> -     The 0-bits provide holes for carries to fall into.  */
> -  magic_bits = -1;
> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> -
> -  /* Set up a longword, each of whose bytes is C.  */
> -  charmask = c | (c << 8);
> -  charmask |= charmask << 16;
> -#if LONG_MAX > LONG_MAX_32_BITS
> -  charmask |= charmask << 32;
> -#endif
> +  /* Set up a word, each of whose bytes is C.  */
> +  op_t repeated_c = repeat_bytes (c_in);
>
> -  /* Instead of the traditional loop which tests each character,
> -     we will test a longword at a time.  The tricky part is testing
> -     if *any of the four* bytes in the longword in question are zero.  */
> -  while (n >= sizeof (longword))
> +  while (word_ptr != lword)
again I would make lword, lword - 1, and move the out-of-bounds return check to
the return outside of the loop.
>      {
> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> -
> -        1) Is this safe?  Will it catch all the zero bytes?
> -        Suppose there is a byte with all zeros.  Any carry bits
> -        propagating from its left will fall into the hole at its
> -        least significant bit and stop.  Since there will be no
> -        carry from its most significant bit, the LSB of the
> -        byte to the left will be unchanged, and the zero will be
> -        detected.
> -
> -        2) Is this worthwhile?  Will it ignore everything except
> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> -        somewhere.  There will be a carry into bit 8.  If bit 8
> -        is set, this will carry into bit 16.  If bit 8 is clear,
> -        one of bits 9-15 must be set, so there will be a carry
> -        into bit 16.  Similarly, there will be a carry into bit
> -        24.  If one of bits 24-30 is set, there will be a carry
> -        into bit 31, so all of the hole bits will be changed.
> -
> -        The one misfire occurs when bits 24-30 are clear and bit
> -        31 is set; in this case, the hole at bit 31 is not
> -        changed.  If we had access to the processor carry flag,
> -        we could close this loophole by putting the fourth hole
> -        at bit 32!
> -
> -        So it ignores everything except 128's, when they're aligned
> -        properly.
> -
> -        3) But wait!  Aren't we looking for C, not zero?
> -        Good point.  So what we do is XOR LONGWORD with a longword,
> -        each of whose bytes is C.  This turns each byte that is C
> -        into a zero.  */
> -
> -      longword = *--longword_ptr ^ charmask;
> -
> -      /* Add MAGIC_BITS to LONGWORD.  */
> -      if ((((longword + magic_bits)
> -
> -           /* Set those bits that were unchanged by the addition.  */
> -           ^ ~longword)
> -
> -          /* Look at only the hole bits.  If any of the hole bits
> -             are unchanged, most likely one of the bytes was a
> -             zero.  */
> -          & ~magic_bits) != 0)
> +      op_t word = *--word_ptr;
> +      if (has_eq (word, repeated_c))
>         {
> -         /* Which of the bytes was C?  If none of them were, it was
> -            a misfire; continue the search.  */
> -
> -         const unsigned char *cp = (const unsigned char *) longword_ptr;
> -
> -#if LONG_MAX > 2147483647
> -         if (cp[7] == c)
> -           return (void *) &cp[7];
> -         if (cp[6] == c)
> -           return (void *) &cp[6];
> -         if (cp[5] == c)
> -           return (void *) &cp[5];
> -         if (cp[4] == c)
> -           return (void *) &cp[4];
> -#endif
> -         if (cp[3] == c)
> -           return (void *) &cp[3];
> -         if (cp[2] == c)
> -           return (void *) &cp[2];
> -         if (cp[1] == c)
> -           return (void *) &cp[1];
> -         if (cp[0] == c)
> -           return (void *) cp;
> +         /* We found a match, but it might be in a byte past the start
> +            of the array.  */
> +         char *ret = (char *) word_ptr + index_last_eq (word, repeated_c);
> +         return ret >= (char *) s ? ret : NULL;
>         }
> -
> -      n -= sizeof (longword);
>      }
> -
> -  char_ptr = (const unsigned char *) longword_ptr;
> -
> -  while (n-- > 0)
> -    {
> -      if (*--char_ptr == c)
> -       return (void *) char_ptr;
> -    }
> -
> -  return 0;
> +  return NULL;
>  }
>  #ifndef MEMRCHR
> -# ifdef weak_alias
>  weak_alias (__memrchr, memrchr)
> -# endif
>  #endif
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 00/17] Improve generic string routines
  2023-01-05 21:56   ` Adhemerval Zanella Netto
@ 2023-01-05 23:52     ` Noah Goldstein
  2023-01-06 13:43       ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-05 23:52 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Xi Ruoyao, libc-alpha

On Thu, Jan 5, 2023 at 1:56 PM Adhemerval Zanella Netto via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Unfortunately no one worked on reviewing it.  It would be good to have
> it for 2.37, although I think it is too late.  However, since most
> architectures do use arch-specific routines, I think the possible disruption
> of using this patchset should be minimal.

I can start reviewing this. Not sure I can do all the arch headers but
can get up
to 11/17.
>
> On 05/12/22 14:07, Xi Ruoyao wrote:
> > Hi,
> >
> > Any status update on this series?
> >
> > On Mon, 2022-09-19 at 16:59 -0300, Adhemerval Zanella via Libc-alpha
> > wrote:
> >> It is done by:
> >>
> >>   1. parametrizing the internal routines (for instance the find zero
> >>      in a word) so each architecture can reimplement without the need
> >>      to reimplement the whole routine.
> >>
> >>   2. vectorizing more string implementations (for instance strcpy
> >>      and strcmp).
> >>
> >>   3. Change some implementations to use already possible optimized
> >>      ones (for instance strnlen).  It makes new ports to focus on
> >>      only provide optimized implementation of a hardful symbols
> >>      (for instance memchr) and make its improvement to be used in
> >>      a larger set of routines.
> >>
> >> For the rest of #5806 I think we can handle them later and if
> >> performance of generic implementation is closer I think it is better
> >> to just remove old assembly implementations.
> >>
> >> I also checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> >> and powerpc64-linux-gnu by removing the arch-specific assembly
> >> implementation and disabling multiarch (it covers both LE and BE
> >> for 64 and 32 bits). I also checked the string routines on alpha,
> >> hppa,
> >> and sh.
> >>
> >> Changes since v4:
> >>   * Removed __clz and __ctz in favor of count_leading_zero and
> >>     count_trailing_zeros from longlong.h.
> >>   * Use repeat_bytes more often.
> >>   * Added a comment on strcmp final_cmp on why index_first_zero_ne can
> >>     not be used.
> >>
> >> Changes since v3:
> >>   * Rebased against master.
> >>   * Dropped strcpy optimization.
> >>   * Refactor strcmp implementation.
> >>   * Some minor changes in comments.
> >>
> >> Changes since v2:
> >>   * Move string-fz{a,b,i} to its own patch.
> >>   * Add a inline implementation for __builtin_c{l,t}z to avoid using
> >>     compiler provided symbols.
> >>   * Add a new header, string-maskoff.h, to handle unaligned accesses
> >>     on some implementation.
> >>   * Fixed strcmp on LE machines.
> >>   * Added a unaligned strcpy variant for architecture that define
> >>     _STRING_ARCH_unaligned.
> >>   * Add SH string-fzb.h (which uses cmp/str instruction to find
> >>     a zero in word).
> >>
> >> Changes since v1:
> >>   * Marked ChangeLog entries with [BZ #5806], as appropriate.
> >>   * Reorganized the headers, so that armv6t2 and power6 need override
> >>     as little as possible to use their (integer) zero detection insns.
> >>   * Hopefully fixed all of the coding style issues.
> >>   * Adjusted the memrchr algorithm as discussed.
> >>   * Replaced the #ifdef STRRCHR etc that are used by the multiarch
> >>   * files.
> >>   * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
> >>     ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
> >>     aarch64, alpha (qemu) and hppa (qemu).
> >>
> >> Adhemerval Zanella (10):
> >>   Add string-maskoff.h generic header
> >>   Add string vectorized find and detection functions
> >>   string: Improve generic strlen
> >>   string: Improve generic strnlen
> >>   string: Improve generic strchr
> >>   string: Improve generic strchrnul
> >>   string: Improve generic strcmp
> >>   string: Improve generic memchr
> >>   string: Improve generic memrchr
> >>   sh: Add string-fzb.h
> >>
> >> Richard Henderson (7):
> >>   Parameterize op_t from memcopy.h
> >>   Parameterize OP_T_THRES from memcopy.h
> >>   hppa: Add memcopy.h
> >>   hppa: Add string-fzb.h and string-fzi.h
> >>   alpha: Add string-fzb.h and string-fzi.h
> >>   arm: Add string-fza.h
> >>   powerpc: Add string-fza.h
> >>
> >>  string/memchr.c                               | 168 ++++------------
> >>  string/memcmp.c                               |   4 -
> >>  string/memrchr.c                              | 189 +++--------------
> >> -
> >>  string/strchr.c                               | 172 +++-------------
> >>  string/strchrnul.c                            | 156 +++------------
> >>  string/strcmp.c                               | 119 +++++++++--
> >>  string/strlen.c                               |  90 ++-------
> >>  string/strnlen.c                              | 137 +------------
> >>  sysdeps/alpha/string-fzb.h                    |  51 +++++
> >>  sysdeps/alpha/string-fzi.h                    | 113 +++++++++++
> >>  sysdeps/arm/armv6t2/string-fza.h              |  70 +++++++
> >>  sysdeps/generic/memcopy.h                     |  10 +-
> >>  sysdeps/generic/string-extbyte.h              |  37 ++++
> >>  sysdeps/generic/string-fza.h                  | 106 ++++++++++
> >>  sysdeps/generic/string-fzb.h                  |  49 +++++
> >>  sysdeps/generic/string-fzi.h                  | 120 +++++++++++
> >>  sysdeps/generic/string-maskoff.h              |  73 +++++++
> >>  sysdeps/generic/string-opthr.h                |  25 +++
> >>  sysdeps/generic/string-optype.h               |  31 +++
> >>  sysdeps/hppa/memcopy.h                        |  42 ++++
> >>  sysdeps/hppa/string-fzb.h                     |  69 +++++++
> >>  sysdeps/hppa/string-fzi.h                     | 135 +++++++++++++
> >>  sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
> >>  sysdeps/i386/memcopy.h                        |   3 -
> >>  sysdeps/i386/string-opthr.h                   |  25 +++
> >>  sysdeps/m68k/memcopy.h                        |   3 -
> >>  sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
> >>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
> >>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
> >>  .../power4/multiarch/strnlen-ppc32.c          |  14 +-
> >>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
> >>  sysdeps/powerpc/string-fza.h                  |  70 +++++++
> >>  sysdeps/s390/strchr-c.c                       |  11 +-
> >>  sysdeps/s390/strchrnul-c.c                    |   2 -
> >>  sysdeps/s390/strlen-c.c                       |  10 +-
> >>  sysdeps/s390/strnlen-c.c                      |  14 +-
> >>  sysdeps/sh/string-fzb.h                       |  53 +++++
> >>  37 files changed, 1366 insertions(+), 851 deletions(-)
> >>  create mode 100644 sysdeps/alpha/string-fzb.h
> >>  create mode 100644 sysdeps/alpha/string-fzi.h
> >>  create mode 100644 sysdeps/arm/armv6t2/string-fza.h
> >>  create mode 100644 sysdeps/generic/string-extbyte.h
> >>  create mode 100644 sysdeps/generic/string-fza.h
> >>  create mode 100644 sysdeps/generic/string-fzb.h
> >>  create mode 100644 sysdeps/generic/string-fzi.h
> >>  create mode 100644 sysdeps/generic/string-maskoff.h
> >>  create mode 100644 sysdeps/generic/string-opthr.h
> >>  create mode 100644 sysdeps/generic/string-optype.h
> >>  create mode 100644 sysdeps/hppa/memcopy.h
> >>  create mode 100644 sysdeps/hppa/string-fzb.h
> >>  create mode 100644 sysdeps/hppa/string-fzi.h
> >>  create mode 100644 sysdeps/i386/string-opthr.h
> >>  create mode 100644 sysdeps/powerpc/string-fza.h
> >>  create mode 100644 sysdeps/sh/string-fzb.h
> >>
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 00/17] Improve generic string routines
  2023-01-05 23:52     ` Noah Goldstein
@ 2023-01-06 13:43       ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-06 13:43 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Xi Ruoyao, libc-alpha



On 05/01/23 20:52, Noah Goldstein wrote:
> On Thu, Jan 5, 2023 at 1:56 PM Adhemerval Zanella Netto via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Unfortunately no one worked on reviewing it.  It would be good to have
>> it for 2.37, although I think it is too late.  However, since most
>> architectures do use arch-specific routines, I think the possible disruption
>> of using this patchset should be minimal.
> 
> I can start reviewing this. Not sure I can do all the arch headers but
> can get up
> to 11/17.

Thanks, I will try to follow up the reviews to get this sort out for next week.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2023-01-05 22:49   ` Noah Goldstein
  2023-01-05 23:26     ` Alejandro Colomar
@ 2023-01-09 18:02     ` Adhemerval Zanella Netto
  1 sibling, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 18:02 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha



On 05/01/23 19:49, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 12:59 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Macros to operate on unaligned access for string operations:
>>
>>   - create_mask: create a mask based on pointer alignment to sets up
>>     non-zero bytes before the beginning of the word so a following
>>     operation (such as find zero) might ignore these bytes.
>>
>>   - highbit_mask: create a mask with high bit of each byte being 1,
>>     and the low 7 bits being all the opposite of the input.
>>
>> These macros are meant to be used on optimized vectorized string
>> implementations.
>> ---
>>  sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>>  1 file changed, 73 insertions(+)
>>  create mode 100644 sysdeps/generic/string-maskoff.h
>>
>> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
>> new file mode 100644
>> index 0000000000..831647bda6
>> --- /dev/null
>> +++ b/sysdeps/generic/string-maskoff.h
>> @@ -0,0 +1,73 @@
>> +/* Mask off bits.  Generic C version.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#ifndef _STRING_MASKOFF_H
>> +#define _STRING_MASKOFF_H 1
>> +
>> +#include <endian.h>
>> +#include <limits.h>
>> +#include <stdint.h>
>> +#include <string-optype.h>
>> +
>> +/* Provide a mask based on the pointer alignment that sets up non-zero
>> +   bytes before the beginning of the word.  It is used to mask off
>> +   undesirable bits from an aligned read from an unaligned pointer.
>> +   For instance, on a 64 bits machine with a pointer alignment of
>> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
>> +   (meaning to mask off the initial 3 bytes).  */
>> +static inline op_t
>> +create_mask (uintptr_t i)
>> +{
>> +  i = i % sizeof (op_t);
>> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
>> +    return ~(((op_t)-1) << (i * CHAR_BIT));
>> +  else
>> +    return ~(((op_t)-1) >> (i * CHAR_BIT));
>> +}
>> +
>> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits
>> +   machine with input as 0xce the functions returns 0xcececececececece.  */
>> +static inline op_t
>> +repeat_bytes (unsigned char c_in)
>> +{
>> +  return ((op_t)-1 / 0xff) * c_in;
>> +}
>> +
>> +/* Based on mask created by 'create_mask', mask off the high bit of each
>> +   byte in the mask.  It is used to mask off undesirable bits from an
>> +   aligned read from an unaligned pointer, and also taking care to avoid
>> +   match possible bytes meant to be matched.  For instance, on a 64 bits
>> +   machine with a mask created from a pointer with an alignment of 3
>> +   (0x0000000000ffffff) the function returns 0x7f7f7f0000000000 for BE
>> +   and 0x00000000007f7f7f for LE.  */
>> +static inline op_t
>> +highbit_mask (op_t m)
>> +{
>> +  return m & repeat_bytes (0x7f);
>> +}
>> +
>> +/* Return the address of the op_t word containing the address P.  For
>> +   instance on address 0x0011223344556677 and op_t with size of 8,
>> +   it returns 0x0011223344556670.  */
>> +static inline op_t *
>> +word_containing (char const *p)
>> +{
>> +  return (op_t *) (p - (uintptr_t) p % sizeof (op_t));
> 
> This can just be (p & (-sizeof(p)) I think.

Indeed it is simpler.

> Other than that look goods.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 03/17] Add string-maskoff.h generic header
  2023-01-05 23:26     ` Alejandro Colomar
@ 2023-01-09 18:19       ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 18:19 UTC (permalink / raw)
  To: Alejandro Colomar, Noah Goldstein; +Cc: libc-alpha



On 05/01/23 20:26, Alejandro Colomar wrote:
> 
> 
> On 1/5/23 23:49, Noah Goldstein via Libc-alpha wrote:
>> On Mon, Sep 19, 2022 at 12:59 PM Adhemerval Zanella via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>>
>>> Macros to operate on unaligned access for string operations:
>>>
>>>    - create_mask: create a mask based on pointer alignment to sets up
>>>      non-zero bytes before the beginning of the word so a following
>>>      operation (such as find zero) might ignore these bytes.
>>>
>>>    - highbit_mask: create a mask with high bit of each byte being 1,
>>>      and the low 7 bits being all the opposite of the input.
>>>
>>> These macros are meant to be used on optimized vectorized string
>>> implementations.
>>> ---
>>>   sysdeps/generic/string-maskoff.h | 73 ++++++++++++++++++++++++++++++++
>>>   1 file changed, 73 insertions(+)
>>>   create mode 100644 sysdeps/generic/string-maskoff.h
>>>
>>> diff --git a/sysdeps/generic/string-maskoff.h b/sysdeps/generic/string-maskoff.h
>>> new file mode 100644
>>> index 0000000000..831647bda6
>>> --- /dev/null
>>> +++ b/sysdeps/generic/string-maskoff.h
>>> @@ -0,0 +1,73 @@
>>> +/* Mask off bits.  Generic C version.
>>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>>> +   This file is part of the GNU C Library.
>>> +
>>> +   The GNU C Library is free software; you can redistribute it and/or
>>> +   modify it under the terms of the GNU Lesser General Public
>>> +   License as published by the Free Software Foundation; either
>>> +   version 2.1 of the License, or (at your option) any later version.
>>> +
>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> +   Lesser General Public License for more details.
>>> +
>>> +   You should have received a copy of the GNU Lesser General Public
>>> +   License along with the GNU C Library; if not, see
>>> +   <http://www.gnu.org/licenses/>.  */
>>> +
>>> +#ifndef _STRING_MASKOFF_H
>>> +#define _STRING_MASKOFF_H 1
>>> +
>>> +#include <endian.h>
>>> +#include <limits.h>
>>> +#include <stdint.h>
>>> +#include <string-optype.h>
>>> +
>>> +/* Provide a mask based on the pointer alignment that sets up non-zero
>>> +   bytes before the beginning of the word.  It is used to mask off
>>> +   undesirable bits from an aligned read from an unaligned pointer.
>>> +   For instance, on a 64 bits machine with a pointer alignment of
>>> +   3 the function returns 0x0000000000ffffff for LE and 0xffffff0000000000
>>> +   (meaning to mask off the initial 3 bytes).  */
>>> +static inline op_t
>>> +create_mask (uintptr_t i)
>>> +{
>>> +  i = i % sizeof (op_t);
>>> +  if (__BYTE_ORDER == __LITTLE_ENDIAN)
>>> +    return ~(((op_t)-1) << (i * CHAR_BIT));
>>> +  else
>>> +    return ~(((op_t)-1) >> (i * CHAR_BIT));
>>> +}
>>> +
>>> +/* Setup an word with each byte being c_in.  For instance, on a 64 bits
>>> +   machine with input as 0xce the functions returns 0xcececececececece.  */
>>> +static inline op_t
> 
> Hi Adhemerval and Noah,
> 
> I don't know what is the minimum C version for compiling glibc, but if you can ignore C89, I would propose something:
> 
> 'static inline' should be restricted to .c files, since if the compiler decides to not inline and you have it in a header, you end up with multiple static definitions for the same code.
> 
> In headers, I use C99 inline, which doesn't emit any object code when the compiler decides to not inline.  Then in a .c file, you add a prototype using 'extern inline', and the compiler will emit code there, exactly once.
> 
> Even if you have to support C89, I'd use [[gnu::always_inline]] together with 'static inline', to make sure that the compiler doesn't do nefarious stuff.

Although we build glibc with -std=gnu11 we also uses -fgnu89-inline, but regardless of
the inline mode 'multiple definitions' it is usually not an issue.  We do use
__always_inline in performant code, so it should be ok to use it here as well (although
I would expect that compiler will always inline these short functions).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 04/17] Add string vectorized find and detection functions
  2023-01-05 22:53   ` Noah Goldstein
@ 2023-01-09 18:51     ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 18:51 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 19:53, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 1:02 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:

>> +/* This function returns at least one bit set within every byte of X that
>> +   is zero.  The result is exact in that, unlike find_zero_low, all bytes
>> +   are determinate.  This is usually used for finding the index of the
>> +   most significant byte that was zero.  */
>> +static inline op_t
>> +find_zero_all (op_t x)
>> +{
>> +  /* For each byte, find not-zero by
>> +     (0) And 0x7f so that we cannot carry between bytes,
>> +     (1) Add 0x7f so that non-zero carries into 0x80,
>> +     (2) Or in the original byte (which might have had 0x80 set).
>> +     Then invert and mask such that 0x80 is set iff that byte was zero.  */
>> +  op_t m = ((op_t)-1 / 0xff) * 0x7f;
> 
> Use repeat_byte here?

Ack.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 04/17] Add string vectorized find and detection functions
  2023-01-05 23:04   ` Noah Goldstein
@ 2023-01-09 19:34     ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 19:34 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 20:04, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 1:02 PM Adhemerval Zanella via Libc-alpha

>> +/* With similar caveats, identify zero bytes in X1 and bytes that are
>> +   not equal between in X1 and X2.  */
>> +static inline op_t
>> +find_zero_ne_low (op_t x1, op_t x2)
>> +{
>> +  op_t m = repeat_bytes (0x7f);
>> +  op_t eq = x1 ^ x2;
>> +  op_t nz1 = (x1 + m) | x1;    /* msb set if byte not zero.  */
>> +  op_t ne2 = (eq + m) | eq;    /* msb set if byte not equal.  */
>> +  return (ne2 | ~nz1) & ~m;    /* msb set if x1 zero or x2 not equal.  */
>> +}
> Cant this just be `(~find_zero_eq_low(x1, x2)) + 1` (seems to get
> better codegen)?

I think we can, I will change it in next revision.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 07/17] string: Improve generic strchr
  2023-01-05 23:19     ` Noah Goldstein
@ 2023-01-09 19:39       ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 19:39 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 20:19, Noah Goldstein wrote:
> On Thu, Jan 5, 2023 at 3:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, Sep 19, 2022 at 1:01 PM Adhemerval Zanella via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>>
>>> New algorithm have the following key differences:
>>>
>>>   - Reads first word unaligned and use string-maskoff function to
>>>     remove unwanted data.  This strategy follow arch-specific
>>>     optimization used on aarch64 and powerpc.
>>>
>>>   - Use string-fz{b,i} and string-extbyte function.
>>>
>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>>> and powerpc64-linux-gnu by removing the arch-specific assembly
>>> implementation and disabling multi-arch (it covers both LE and BE
>>> for 64 and 32 bits).
>>>
>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>> ---
>>>  string/strchr.c         | 172 +++++++---------------------------------
>>>  sysdeps/s390/strchr-c.c |  11 +--
>>>  2 files changed, 34 insertions(+), 149 deletions(-)
>>>
>>> diff --git a/string/strchr.c b/string/strchr.c
>>> index bfd0c4e4bc..6bbee7f79d 100644
>>> --- a/string/strchr.c
>>> +++ b/string/strchr.c
>>> @@ -22,164 +22,48 @@
>>>
>>>  #include <string.h>
>>>  #include <stdlib.h>
>>> +#include <stdint.h>
>>> +#include <string-fza.h>
>>> +#include <string-fzb.h>
>>> +#include <string-fzi.h>
>>> +#include <string-extbyte.h>
>>> +#include <string-maskoff.h>
>>>
>>>  #undef strchr
>>> +#undef index
>>>
>>> -#ifndef STRCHR
>>> -# define STRCHR strchr
>>> +#ifdef STRCHR
>>> +# define strchr STRCHR
>>>  #endif
>>>
>>>  /* Find the first occurrence of C in S.  */
>>>  char *
>>> -STRCHR (const char *s, int c_in)
>>> +strchr (const char *s, int c_in)
>>>  {
>>> -  const unsigned char *char_ptr;
>>> -  const unsigned long int *longword_ptr;
>>> -  unsigned long int longword, magic_bits, charmask;
>>> -  unsigned char c;
>>> -
>>> -  c = (unsigned char) c_in;
>>> -
>>> -  /* Handle the first few characters by reading one character at a time.
>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>> -  for (char_ptr = (const unsigned char *) s;
>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>>> -       ++char_ptr)
>>> -    if (*char_ptr == c)
>>> -      return (void *) char_ptr;
>>> -    else if (*char_ptr == '\0')
>>> -      return NULL;
>>> -
>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>> -     but the theory applies equally well to 8-byte longwords.  */
>>> -
>>> -  longword_ptr = (unsigned long int *) char_ptr;
>>> -
>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>>> -     the "holes."  Note that there is a hole just to the left of
>>> -     each byte, with an extra at the end:
>>> -
>>> -     bits:  01111110 11111110 11111110 11111111
>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>>> -
>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>>> -     The 0-bits provide holes for carries to fall into.  */
>>> -  magic_bits = -1;
>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>>> -
>>> -  /* Set up a longword, each of whose bytes is C.  */
>>> -  charmask = c | (c << 8);
>>> -  charmask |= charmask << 16;
>>> -  if (sizeof (longword) > 4)
>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>>> -    charmask |= (charmask << 16) << 16;
>>> -  if (sizeof (longword) > 8)
>>> -    abort ();
>>> -
>>> -  /* Instead of the traditional loop which tests each character,
>>> -     we will test a longword at a time.  The tricky part is testing
>>> -     if *any of the four* bytes in the longword in question are zero.  */
>>> -  for (;;)
>>> -    {
>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>>> -
>>> -        1) Is this safe?  Will it catch all the zero bytes?
>>> -        Suppose there is a byte with all zeros.  Any carry bits
>>> -        propagating from its left will fall into the hole at its
>>> -        least significant bit and stop.  Since there will be no
>>> -        carry from its most significant bit, the LSB of the
>>> -        byte to the left will be unchanged, and the zero will be
>>> -        detected.
>>> +  /* Set up a word, each of whose bytes is C.  */
>>> +  unsigned char c = (unsigned char) c_in;
>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>
>>> -        2) Is this worthwhile?  Will it ignore everything except
>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>>> -        one of bits 9-15 must be set, so there will be a carry
>>> -        into bit 16.  Similarly, there will be a carry into bit
>>> -        24.  If one of bits 24-30 is set, there will be a carry
>>> -        into bit 31, so all of the hole bits will be changed.
>>> +  /* Align the input address to op_t.  */
>>> +  uintptr_t s_int = (uintptr_t) s;
>>> +  const op_t *word_ptr = word_containing (s);
>>>
>>> -        The one misfire occurs when bits 24-30 are clear and bit
>>> -        31 is set; in this case, the hole at bit 31 is not
>>> -        changed.  If we had access to the processor carry flag,
>>> -        we could close this loophole by putting the fourth hole
>>> -        at bit 32!
>>> +  /* Read the first aligned word, but force bytes before the string to
>>> +     match neither zero nor goal (we make sure the high bit of each byte
>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>>> +  op_t bmask = create_mask (s_int);
>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>>>
>>> -        So it ignores everything except 128's, when they're aligned
>>> -        properly.
>>> +  while (! has_zero_eq (word, repeated_c))
>>> +    word = *++word_ptr;
>>>
>>> -        3) But wait!  Aren't we looking for C as well as zero?
>>> -        Good point.  So what we do is XOR LONGWORD with a longword,
>>> -        each of whose bytes is C.  This turns each byte that is C
>>> -        into a zero.  */
>>> -
>>> -      longword = *longword_ptr++;
>>> -
>>> -      /* Add MAGIC_BITS to LONGWORD.  */
>>> -      if ((((longword + magic_bits)
>>> -
>>> -           /* Set those bits that were unchanged by the addition.  */
>>> -           ^ ~longword)
>>> -
>>> -          /* Look at only the hole bits.  If any of the hole bits
>>> -             are unchanged, most likely one of the bytes was a
>>> -             zero.  */
>>> -          & ~magic_bits) != 0
>>> -
>>> -         /* That caught zeroes.  Now test for C.  */
>>> -         || ((((longword ^ charmask) + magic_bits) ^ ~(longword ^ charmask))
>>> -             & ~magic_bits) != 0)
>>> -       {
>>> -         /* Which of the bytes was C or zero?
>>> -            If none of them were, it was a misfire; continue the search.  */
>>> -
>>> -         const unsigned char *cp = (const unsigned char *) (longword_ptr - 1);
>>> -
>>> -         if (*cp == c)
>>> -           return (char *) cp;
>>> -         else if (*cp == '\0')
>>> -           return NULL;
>>> -         if (*++cp == c)
>>> -           return (char *) cp;
>>> -         else if (*cp == '\0')
>>> -           return NULL;
>>> -         if (*++cp == c)
>>> -           return (char *) cp;
>>> -         else if (*cp == '\0')
>>> -           return NULL;
>>> -         if (*++cp == c)
>>> -           return (char *) cp;
>>> -         else if (*cp == '\0')
>>> -           return NULL;
>>> -         if (sizeof (longword) > 4)
>>> -           {
>>> -             if (*++cp == c)
>>> -               return (char *) cp;
>>> -             else if (*cp == '\0')
>>> -               return NULL;
>>> -             if (*++cp == c)
>>> -               return (char *) cp;
>>> -             else if (*cp == '\0')
>>> -               return NULL;
>>> -             if (*++cp == c)
>>> -               return (char *) cp;
>>> -             else if (*cp == '\0')
>>> -               return NULL;
>>> -             if (*++cp == c)
>>> -               return (char *) cp;
>>> -             else if (*cp == '\0')
>>> -               return NULL;
>>> -           }
>>> -       }
>>> -    }
>>> +  op_t found = index_first_zero_eq (word, repeated_c);
>>>
>>> +  if (extractbyte (word, found) == c)
>>> +    return (char *) (word_ptr) + found;
>>>    return NULL;
>>>  }
>>> -
>>> -#ifdef weak_alias
>>> -# undef index
>>> +#ifndef STRCHR
>>>  weak_alias (strchr, index)
>>> -#endif
>>>  libc_hidden_builtin_def (strchr)
>>> +#endif
>>> diff --git a/sysdeps/s390/strchr-c.c b/sysdeps/s390/strchr-c.c
>>> index 4ac3a62fba..a5a1781b1c 100644
>>> --- a/sysdeps/s390/strchr-c.c
>>> +++ b/sysdeps/s390/strchr-c.c
>>> @@ -21,13 +21,14 @@
>>>  #if HAVE_STRCHR_C
>>>  # if HAVE_STRCHR_IFUNC
>>>  #  define STRCHR STRCHR_C
>>> -#  undef weak_alias
>>> +# endif
>>> +
>>> +# include <string/strchr.c>
>>> +
>>> +# if HAVE_STRCHR_IFUNC
>>>  #  if defined SHARED && IS_IN (libc)
>>> -#   undef libc_hidden_builtin_def
>>> -#   define libc_hidden_builtin_def(name)                       \
>>> -     __hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
>>> +__hidden_ver1 (__strchr_c, __GI_strchr, __strchr_c);
>>>  #  endif
>>>  # endif
>>>
>>> -# include <string/strchr.c>
>>>  #endif
>>> --
>>> 2.34.1
>>>
>>
>> Can this just be implemented as:
>>
>> char * r = strchrnul(p, c);
>> return *r ? r : NULL;
> Thats wrong, should be: `return (*r == c) ? r : NULL;`
>>
>> then only have strchrnul impl to worry about?

Yes, although I think strchr is a more used symbol than strchrnul. However, 
we can optimize it later by adding a __strchrnul_inline and expand it 
on both strchr and strchrnul.  I will change to use strchrnul as you suggested.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-05 23:17   ` Noah Goldstein
@ 2023-01-09 20:35     ` Adhemerval Zanella Netto
  2023-01-09 20:49       ` Richard Henderson
                         ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 20:35 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 20:17, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> New algorithm have the following key differences:
>>
>>   - Reads first word unaligned and use string-maskoff function to
>>     remove unwanted data.  This strategy follow arch-specific
>>     optimization used on aarch64 and powerpc.
>>
>>   - Use string-fz{b,i} functions.
>>
>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
>> and powerpc-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multi-arch (it covers both LE and BE
>> for 64 and 32 bits).
>>
>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>> ---
>>  string/strchrnul.c                            | 156 +++---------------
>>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>  sysdeps/s390/strchrnul-c.c                    |   2 -
>>  3 files changed, 24 insertions(+), 138 deletions(-)
>>
>> diff --git a/string/strchrnul.c b/string/strchrnul.c
>> index 0cc1fc6bb0..67defa3dab 100644
>> --- a/string/strchrnul.c
>> +++ b/string/strchrnul.c
>> @@ -1,10 +1,5 @@
>>  /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>     This file is part of the GNU C Library.
>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>> -   with help from Dan Sahlin (dan@sics.se) and
>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>
>>     The GNU C Library is free software; you can redistribute it and/or
>>     modify it under the terms of the GNU Lesser General Public
>> @@ -21,146 +16,43 @@
>>     <https://www.gnu.org/licenses/>.  */
>>
>>  #include <string.h>
>> -#include <memcopy.h>
>>  #include <stdlib.h>
>> +#include <stdint.h>
>> +#include <string-fza.h>
>> +#include <string-fzb.h>
>> +#include <string-fzi.h>
>> +#include <string-maskoff.h>
>>
>>  #undef __strchrnul
>>  #undef strchrnul
>>
>> -#ifndef STRCHRNUL
>> -# define STRCHRNUL __strchrnul
>> +#ifdef STRCHRNUL
>> +# define __strchrnul STRCHRNUL
>>  #endif
>>
>>  /* Find the first occurrence of C in S or the final NUL byte.  */
>>  char *
>> -STRCHRNUL (const char *s, int c_in)
>> +__strchrnul (const char *str, int c_in)
>>  {
>> -  const unsigned char *char_ptr;
>> -  const unsigned long int *longword_ptr;
>> -  unsigned long int longword, magic_bits, charmask;
>> -  unsigned char c;
>> -
>> -  c = (unsigned char) c_in;
>> -
>> -  /* Handle the first few characters by reading one character at a time.
>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>> -  for (char_ptr = (const unsigned char *) s;
>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>> -       ++char_ptr)
>> -    if (*char_ptr == c || *char_ptr == '\0')
>> -      return (void *) char_ptr;
>> -
>> -  /* All these elucidatory comments refer to 4-byte longwords,
>> -     but the theory applies equally well to 8-byte longwords.  */
>> -
>> -  longword_ptr = (unsigned long int *) char_ptr;
>> -
>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>> -     the "holes."  Note that there is a hole just to the left of
>> -     each byte, with an extra at the end:
>> -
>> -     bits:  01111110 11111110 11111110 11111111
>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>> -
>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>> -     The 0-bits provide holes for carries to fall into.  */
>> -  magic_bits = -1;
>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>> -
>> -  /* Set up a longword, each of whose bytes is C.  */
>> -  charmask = c | (c << 8);
>> -  charmask |= charmask << 16;
>> -  if (sizeof (longword) > 4)
>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>> -    charmask |= (charmask << 16) << 16;
>> -  if (sizeof (longword) > 8)
>> -    abort ();
>> -
>> -  /* Instead of the traditional loop which tests each character,
>> -     we will test a longword at a time.  The tricky part is testing
>> -     if *any of the four* bytes in the longword in question are zero.  */
>> -  for (;;)
>> -    {
>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>> -
>> -        1) Is this safe?  Will it catch all the zero bytes?
>> -        Suppose there is a byte with all zeros.  Any carry bits
>> -        propagating from its left will fall into the hole at its
>> -        least significant bit and stop.  Since there will be no
>> -        carry from its most significant bit, the LSB of the
>> -        byte to the left will be unchanged, and the zero will be
>> -        detected.
>> +  /* Set up a word, each of whose bytes is C.  */
>> +  op_t repeated_c = repeat_bytes (c_in);
>>
>> -        2) Is this worthwhile?  Will it ignore everything except
>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>> -        one of bits 9-15 must be set, so there will be a carry
>> -        into bit 16.  Similarly, there will be a carry into bit
>> -        24.  If one of bits 24-30 is set, there will be a carry
>> -        into bit 31, so all of the hole bits will be changed.
>> +  /* Align the input address to op_t.  */
>> +  uintptr_t s_int = (uintptr_t) str;
>> +  const op_t *word_ptr = word_containing (str);
>>
>> -        The one misfire occurs when bits 24-30 are clear and bit
>> -        31 is set; in this case, the hole at bit 31 is not
>> -        changed.  If we had access to the processor carry flag,
>> -        we could close this loophole by putting the fourth hole
>> -        at bit 32!
>> +  /* Read the first aligned word, but force bytes before the string to
>> +     match neither zero nor goal (we make sure the high bit of each byte
>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>> +  op_t bmask = create_mask (s_int);
>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
> 
> Think much clearer (and probably better codegen) is:
> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)

It does not seem to work, at least not replacing the two lines with:

  op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT); 

The loader itself can not loader anything (which means strchr is failing
somewhere).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-09 20:35     ` Adhemerval Zanella Netto
@ 2023-01-09 20:49       ` Richard Henderson
  2023-01-09 20:59       ` Noah Goldstein
  2023-01-09 23:33       ` Richard Henderson
  2 siblings, 0 replies; 55+ messages in thread
From: Richard Henderson @ 2023-01-09 20:49 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Noah Goldstein; +Cc: libc-alpha

On 1/9/23 12:35, Adhemerval Zanella Netto wrote:
> 
> 
> On 05/01/23 20:17, Noah Goldstein wrote:
>> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>>
>>> New algorithm have the following key differences:
>>>
>>>    - Reads first word unaligned and use string-maskoff function to
>>>      remove unwanted data.  This strategy follow arch-specific
>>>      optimization used on aarch64 and powerpc.
>>>
>>>    - Use string-fz{b,i} functions.
>>>
>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
>>> and powerpc-linux-gnu by removing the arch-specific assembly
>>> implementation and disabling multi-arch (it covers both LE and BE
>>> for 64 and 32 bits).
>>>
>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>> ---
>>>   string/strchrnul.c                            | 156 +++---------------
>>>   .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>>   sysdeps/s390/strchrnul-c.c                    |   2 -
>>>   3 files changed, 24 insertions(+), 138 deletions(-)
>>>
>>> diff --git a/string/strchrnul.c b/string/strchrnul.c
>>> index 0cc1fc6bb0..67defa3dab 100644
>>> --- a/string/strchrnul.c
>>> +++ b/string/strchrnul.c
>>> @@ -1,10 +1,5 @@
>>>   /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>      This file is part of the GNU C Library.
>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>>> -   with help from Dan Sahlin (dan@sics.se) and
>>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
>>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>>
>>>      The GNU C Library is free software; you can redistribute it and/or
>>>      modify it under the terms of the GNU Lesser General Public
>>> @@ -21,146 +16,43 @@
>>>      <https://www.gnu.org/licenses/>.  */
>>>
>>>   #include <string.h>
>>> -#include <memcopy.h>
>>>   #include <stdlib.h>
>>> +#include <stdint.h>
>>> +#include <string-fza.h>
>>> +#include <string-fzb.h>
>>> +#include <string-fzi.h>
>>> +#include <string-maskoff.h>
>>>
>>>   #undef __strchrnul
>>>   #undef strchrnul
>>>
>>> -#ifndef STRCHRNUL
>>> -# define STRCHRNUL __strchrnul
>>> +#ifdef STRCHRNUL
>>> +# define __strchrnul STRCHRNUL
>>>   #endif
>>>
>>>   /* Find the first occurrence of C in S or the final NUL byte.  */
>>>   char *
>>> -STRCHRNUL (const char *s, int c_in)
>>> +__strchrnul (const char *str, int c_in)
>>>   {
>>> -  const unsigned char *char_ptr;
>>> -  const unsigned long int *longword_ptr;
>>> -  unsigned long int longword, magic_bits, charmask;
>>> -  unsigned char c;
>>> -
>>> -  c = (unsigned char) c_in;
>>> -
>>> -  /* Handle the first few characters by reading one character at a time.
>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>> -  for (char_ptr = (const unsigned char *) s;
>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>>> -       ++char_ptr)
>>> -    if (*char_ptr == c || *char_ptr == '\0')
>>> -      return (void *) char_ptr;
>>> -
>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>> -     but the theory applies equally well to 8-byte longwords.  */
>>> -
>>> -  longword_ptr = (unsigned long int *) char_ptr;
>>> -
>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>>> -     the "holes."  Note that there is a hole just to the left of
>>> -     each byte, with an extra at the end:
>>> -
>>> -     bits:  01111110 11111110 11111110 11111111
>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>>> -
>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>>> -     The 0-bits provide holes for carries to fall into.  */
>>> -  magic_bits = -1;
>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>>> -
>>> -  /* Set up a longword, each of whose bytes is C.  */
>>> -  charmask = c | (c << 8);
>>> -  charmask |= charmask << 16;
>>> -  if (sizeof (longword) > 4)
>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>>> -    charmask |= (charmask << 16) << 16;
>>> -  if (sizeof (longword) > 8)
>>> -    abort ();
>>> -
>>> -  /* Instead of the traditional loop which tests each character,
>>> -     we will test a longword at a time.  The tricky part is testing
>>> -     if *any of the four* bytes in the longword in question are zero.  */
>>> -  for (;;)
>>> -    {
>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>>> -
>>> -        1) Is this safe?  Will it catch all the zero bytes?
>>> -        Suppose there is a byte with all zeros.  Any carry bits
>>> -        propagating from its left will fall into the hole at its
>>> -        least significant bit and stop.  Since there will be no
>>> -        carry from its most significant bit, the LSB of the
>>> -        byte to the left will be unchanged, and the zero will be
>>> -        detected.
>>> +  /* Set up a word, each of whose bytes is C.  */
>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>
>>> -        2) Is this worthwhile?  Will it ignore everything except
>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>>> -        one of bits 9-15 must be set, so there will be a carry
>>> -        into bit 16.  Similarly, there will be a carry into bit
>>> -        24.  If one of bits 24-30 is set, there will be a carry
>>> -        into bit 31, so all of the hole bits will be changed.
>>> +  /* Align the input address to op_t.  */
>>> +  uintptr_t s_int = (uintptr_t) str;
>>> +  const op_t *word_ptr = word_containing (str);
>>>
>>> -        The one misfire occurs when bits 24-30 are clear and bit
>>> -        31 is set; in this case, the hole at bit 31 is not
>>> -        changed.  If we had access to the processor carry flag,
>>> -        we could close this loophole by putting the fourth hole
>>> -        at bit 32!
>>> +  /* Read the first aligned word, but force bytes before the string to
>>> +     match neither zero nor goal (we make sure the high bit of each byte
>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>>> +  op_t bmask = create_mask (s_int);
>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>>
>> Think much clearer (and probably better codegen) is:
>> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
> 
> It does not seem to work, at least not replacing the two lines with:
> 
>    op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
> 
> The loader itself can not loader anything (which means strchr is failing
> somewhere).

You'd need to update the extract arithmetic too.  I'm sure it's still using ctz and the 
aligned pointer.  Which might be why I used the slightly more complex arithmetic here, so 
that the tail of the function didn't need changing...


r~


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2023-01-05 23:47   ` Noah Goldstein
@ 2023-01-09 20:50     ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 20:50 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 20:47, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> New algorithm have the following key differences:
>>
>>   - Reads first word unaligned and use string-maskoff function to
>>     remove unwanted data.  This strategy follow arch-specific
>>     optimization used on aarch64 and powerpc.
>>
>>   - Use string-fz{b,i} and string-opthr functions.
>>
>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>> and powerpc64-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multi-arch (it covers both LE and BE
>> for 64 and 32 bits).
>>
>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>> ---
>>  string/memchr.c                               | 168 +++++-------------
>>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>>  3 files changed, 48 insertions(+), 143 deletions(-)
>>
>> diff --git a/string/memchr.c b/string/memchr.c
>> index 422bcd0cd6..08d518b02d 100644
>> --- a/string/memchr.c
>> +++ b/string/memchr.c
>> @@ -1,10 +1,6 @@
>> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>> +/* Scan memory for a character.  Generic version
>> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>     This file is part of the GNU C Library.
>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>> -   with help from Dan Sahlin (dan@sics.se) and
>> -   commentary by Jim Blandy (jimb@ai.mit.edu);
>> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>
>>     The GNU C Library is free software; you can redistribute it and/or
>>     modify it under the terms of the GNU Lesser General Public
>> @@ -20,143 +16,65 @@
>>     License along with the GNU C Library; if not, see
>>     <https://www.gnu.org/licenses/>.  */
>>
>> -#ifndef _LIBC
>> -# include <config.h>
>> -#endif
>> -
>> +#include <intprops.h>
>> +#include <string-fza.h>
>> +#include <string-fzb.h>
>> +#include <string-fzi.h>
>> +#include <string-maskoff.h>
>> +#include <string-opthr.h>
>>  #include <string.h>
>>
>> -#include <stddef.h>
>> +#undef memchr
>>
>> -#include <limits.h>
>> -
>> -#undef __memchr
>> -#ifdef _LIBC
>> -# undef memchr
>> +#ifdef MEMCHR
>> +# define __memchr MEMCHR
>>  #endif
>>
>> -#ifndef weak_alias
>> -# define __memchr memchr
>> -#endif
>> -
>> -#ifndef MEMCHR
>> -# define MEMCHR __memchr
>> -#endif
>> +static inline const char *
>> +sadd (uintptr_t x, uintptr_t y)
>> +{
>> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
>> +  return (const char *)ret;
>> +}
>>
>>  /* Search no more than N bytes of S for C.  */
>>  void *
>> -MEMCHR (void const *s, int c_in, size_t n)
>> +__memchr (void const *s, int c_in, size_t n)
>>  {
>> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
>> -     long instead of a 64-bit uintmax_t tends to give better
>> -     performance.  On 64-bit hardware, unsigned long is generally 64
>> -     bits already.  Change this typedef to experiment with
>> -     performance.  */
>> -  typedef unsigned long int longword;
>> +  if (__glibc_unlikely (n == 0))
>> +    return NULL;
>>
>> -  const unsigned char *char_ptr;
>> -  const longword *longword_ptr;
>> -  longword repeated_one;
>> -  longword repeated_c;
>> -  unsigned char c;
>> +  uintptr_t s_int = (uintptr_t) s;
>>
>> -  c = (unsigned char) c_in;
>> +  /* Set up a word, each of whose bytes is C.  */
>> +  op_t repeated_c = repeat_bytes (c_in);
>> +  op_t before_mask = create_mask (s_int);
>>
>> -  /* Handle the first few bytes by reading one byte at a time.
>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>> -  for (char_ptr = (const unsigned char *) s;
>> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
>> -       --n, ++char_ptr)
>> -    if (*char_ptr == c)
>> -      return (void *) char_ptr;
>> +  /* Compute the address of the last byte taking in consideration possible
>> +     overflow.  */
>> +  const char *lbyte = sadd (s_int, n - 1);
> 
> Do you need this? The comparison in the loop is == so letting it
> overflow should be fine no?

Do you mean the saturation add or the last lbyte check? For saturation add
I recall that it requires for memchr (..., SIZE_MAX), otherwise the last
byte/word would be incorrect (I fixed some assembly routines that triggered
this issue in the past).

>>
>> -  longword_ptr = (const longword *) char_ptr;
>> +  /* Compute the address of the word containing the last byte. */
>> +  const op_t *lword = word_containing (lbyte);
>>
>> -  /* All these elucidatory comments refer to 4-byte longwords,
>> -     but the theory applies equally well to any size longwords.  */
>> +  /* Read the first word, but munge it so that bytes before the array
>> +     will not match goal.  */
>> +  const op_t *word_ptr = word_containing (s);
>> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
> 
> Likewise, prefer just shifting out the invalid comparisons on the first word.

I will need to check why this is not really working, I think I suggest it
on previous iteration and I could not make it work for some reason.

>>
>> -  /* Compute auxiliary longword values:
>> -     repeated_one is a value which has a 1 in every byte.
>> -     repeated_c has c in every byte.  */
>> -  repeated_one = 0x01010101;
>> -  repeated_c = c | (c << 8);
>> -  repeated_c |= repeated_c << 16;
>> -  if (0xffffffffU < (longword) -1)
>> +  while (has_eq (word, repeated_c) == 0)
>>      {
>> -      repeated_one |= repeated_one << 31 << 1;
>> -      repeated_c |= repeated_c << 31 << 1;
>> -      if (8 < sizeof (longword))
>> -       {
>> -         size_t i;
>> -
>> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
>> -           {
>> -             repeated_one |= repeated_one << i;
>> -             repeated_c |= repeated_c << i;
>> -           }
>> -       }
>> +      if (word_ptr == lword)
>> +       return NULL;
>> +      word = *++word_ptr;
>>      }
>>
>> -  /* Instead of the traditional loop which tests each byte, we will test a
>> -     longword at a time.  The tricky part is testing if *any of the four*
>> -     bytes in the longword in question are equal to c.  We first use an xor
>> -     with repeated_c.  This reduces the task to testing whether *any of the
>> -     four* bytes in longword1 is zero.
>> -
>> -     We compute tmp =
>> -       ((longword1 - repeated_one) & ~longword1) & (repeated_one << 7).
>> -     That is, we perform the following operations:
>> -       1. Subtract repeated_one.
>> -       2. & ~longword1.
>> -       3. & a mask consisting of 0x80 in every byte.
>> -     Consider what happens in each byte:
>> -       - If a byte of longword1 is zero, step 1 and 2 transform it into 0xff,
>> -        and step 3 transforms it into 0x80.  A carry can also be propagated
>> -        to more significant bytes.
>> -       - If a byte of longword1 is nonzero, let its lowest 1 bit be at
>> -        position k (0 <= k <= 7); so the lowest k bits are 0.  After step 1,
>> -        the byte ends in a single bit of value 0 and k bits of value 1.
>> -        After step 2, the result is just k bits of value 1: 2^k - 1.  After
>> -        step 3, the result is 0.  And no carry is produced.
>> -     So, if longword1 has only non-zero bytes, tmp is zero.
>> -     Whereas if longword1 has a zero byte, call j the position of the least
>> -     significant zero byte.  Then the result has a zero at positions 0, ...,
>> -     j-1 and a 0x80 at position j.  We cannot predict the result at the more
>> -     significant bytes (positions j+1..3), but it does not matter since we
>> -     already have a non-zero bit at position 8*j+7.
>> -
>> -     So, the test whether any byte in longword1 is zero is equivalent to
>> -     testing whether tmp is nonzero.  */
>> -
>> -  while (n >= sizeof (longword))
>> -    {
>> -      longword longword1 = *longword_ptr ^ repeated_c;
>> -
>> -      if ((((longword1 - repeated_one) & ~longword1)
>> -          & (repeated_one << 7)) != 0)
>> -       break;
>> -      longword_ptr++;
>> -      n -= sizeof (longword);
>> -    }
>> -
>> -  char_ptr = (const unsigned char *) longword_ptr;
>> -
>> -  /* At this point, we know that either n < sizeof (longword), or one of the
>> -     sizeof (longword) bytes starting at char_ptr is == c.  On little-endian
>> -     machines, we could determine the first such byte without any further
>> -     memory accesses, just by looking at the tmp result from the last loop
>> -     iteration.  But this does not work on big-endian machines.  Choose code
>> -     that works in both cases.  */
>> -
>> -  for (; n > 0; --n, ++char_ptr)
>> -    {
>> -      if (*char_ptr == c)
>> -       return (void *) char_ptr;
>> -    }
>> -
>> -  return NULL;
>> +  /* We found a match, but it might be in a byte past the end
>> +     of the array.  */
>> +  char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
>> +  return (ret <= lbyte) ? ret : NULL;
>>  }
>> -#ifdef weak_alias
>> +#ifndef MEMCHR
>>  weak_alias (__memchr, memchr)
>> -#endif
>>  libc_hidden_builtin_def (memchr)
>> +#endif
>> diff --git a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
>> index fc69df54b3..02877d3c98 100644
>> --- a/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
>> +++ b/sysdeps/powerpc/powerpc32/power4/multiarch/memchr-ppc32.c
>> @@ -18,17 +18,11 @@
>>
>>  #include <string.h>
>>
>> -#define MEMCHR  __memchr_ppc
>> +extern __typeof (memchr) __memchr_ppc attribute_hidden;
>>
>> -#undef weak_alias
>> -#define weak_alias(a, b)
>> +#define MEMCHR  __memchr_ppc
>> +#include <string/memchr.c>
>>
>>  #ifdef SHARED
>> -# undef libc_hidden_builtin_def
>> -# define libc_hidden_builtin_def(name) \
>> -  __hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
>> +__hidden_ver1(__memchr_ppc, __GI_memchr, __memchr_ppc);
>>  #endif
>> -
>> -extern __typeof (memchr) __memchr_ppc attribute_hidden;
>> -
>> -#include <string/memchr.c>
>> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
>> index 3c966f4403..15beca787b 100644
>> --- a/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
>> +++ b/sysdeps/powerpc/powerpc64/multiarch/memchr-ppc64.c
>> @@ -18,14 +18,7 @@
>>
>>  #include <string.h>
>>
>> -#define MEMCHR  __memchr_ppc
>> -
>> -#undef weak_alias
>> -#define weak_alias(a, b)
>> -
>> -# undef libc_hidden_builtin_def
>> -# define libc_hidden_builtin_def(name)
>> -
>>  extern __typeof (memchr) __memchr_ppc attribute_hidden;
>>
>> +#define MEMCHR  __memchr_ppc
>>  #include <string/memchr.c>
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2023-01-05 23:49   ` Noah Goldstein
@ 2023-01-09 20:51     ` Adhemerval Zanella Netto
  2023-01-09 21:26       ` Noah Goldstein
  0 siblings, 1 reply; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-09 20:51 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 05/01/23 20:49, Noah Goldstein wrote:
> On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> New algorithm have the following key differences:
>>
>>   - Reads first word unaligned and use string-maskoff function to
>>     remove unwanted data.  This strategy follow arch-specific
>>     optimization used on aarch64 and powerpc.
>>
>>   - Use string-fz{b,i} and string-opthr functions.
>>
>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>> and powerpc64-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multi-arch (it covers both LE and BE
>> for 64 and 32 bits).
>>
>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>> ---
>>  string/memchr.c                               | 168 +++++-------------
>>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>>  3 files changed, 48 insertions(+), 143 deletions(-)
>>
>> diff --git a/string/memchr.c b/string/memchr.c
>> index 422bcd0cd6..08d518b02d 100644
>> --- a/string/memchr.c
>> +++ b/string/memchr.c
>> @@ -1,10 +1,6 @@
>> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>> +/* Scan memory for a character.  Generic version
>> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>     This file is part of the GNU C Library.
>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>> -   with help from Dan Sahlin (dan@sics.se) and
>> -   commentary by Jim Blandy (jimb@ai.mit.edu);
>> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>
>>     The GNU C Library is free software; you can redistribute it and/or
>>     modify it under the terms of the GNU Lesser General Public
>> @@ -20,143 +16,65 @@
>>     License along with the GNU C Library; if not, see
>>     <https://www.gnu.org/licenses/>.  */
>>
>> -#ifndef _LIBC
>> -# include <config.h>
>> -#endif
>> -
>> +#include <intprops.h>
>> +#include <string-fza.h>
>> +#include <string-fzb.h>
>> +#include <string-fzi.h>
>> +#include <string-maskoff.h>
>> +#include <string-opthr.h>
>>  #include <string.h>
>>
>> -#include <stddef.h>
>> +#undef memchr
>>
>> -#include <limits.h>
>> -
>> -#undef __memchr
>> -#ifdef _LIBC
>> -# undef memchr
>> +#ifdef MEMCHR
>> +# define __memchr MEMCHR
>>  #endif
>>
>> -#ifndef weak_alias
>> -# define __memchr memchr
>> -#endif
>> -
>> -#ifndef MEMCHR
>> -# define MEMCHR __memchr
>> -#endif
>> +static inline const char *
>> +sadd (uintptr_t x, uintptr_t y)
>> +{
>> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
>> +  return (const char *)ret;
>> +}
>>
>>  /* Search no more than N bytes of S for C.  */
>>  void *
>> -MEMCHR (void const *s, int c_in, size_t n)
>> +__memchr (void const *s, int c_in, size_t n)
>>  {
>> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
>> -     long instead of a 64-bit uintmax_t tends to give better
>> -     performance.  On 64-bit hardware, unsigned long is generally 64
>> -     bits already.  Change this typedef to experiment with
>> -     performance.  */
>> -  typedef unsigned long int longword;
>> +  if (__glibc_unlikely (n == 0))
>> +    return NULL;
>>
>> -  const unsigned char *char_ptr;
>> -  const longword *longword_ptr;
>> -  longword repeated_one;
>> -  longword repeated_c;
>> -  unsigned char c;
>> +  uintptr_t s_int = (uintptr_t) s;
>>
>> -  c = (unsigned char) c_in;
>> +  /* Set up a word, each of whose bytes is C.  */
>> +  op_t repeated_c = repeat_bytes (c_in);
>> +  op_t before_mask = create_mask (s_int);
>>
>> -  /* Handle the first few bytes by reading one byte at a time.
>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>> -  for (char_ptr = (const unsigned char *) s;
>> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
>> -       --n, ++char_ptr)
>> -    if (*char_ptr == c)
>> -      return (void *) char_ptr;
>> +  /* Compute the address of the last byte taking in consideration possible
>> +     overflow.  */
>> +  const char *lbyte = sadd (s_int, n - 1);
>>
>> -  longword_ptr = (const longword *) char_ptr;
>> +  /* Compute the address of the word containing the last byte. */
>> +  const op_t *lword = word_containing (lbyte);
>>
>> -  /* All these elucidatory comments refer to 4-byte longwords,
>> -     but the theory applies equally well to any size longwords.  */
>> +  /* Read the first word, but munge it so that bytes before the array
>> +     will not match goal.  */
>> +  const op_t *word_ptr = word_containing (s);
>> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
>>
>> -  /* Compute auxiliary longword values:
>> -     repeated_one is a value which has a 1 in every byte.
>> -     repeated_c has c in every byte.  */
>> -  repeated_one = 0x01010101;
>> -  repeated_c = c | (c << 8);
>> -  repeated_c |= repeated_c << 16;
>> -  if (0xffffffffU < (longword) -1)
>> +  while (has_eq (word, repeated_c) == 0)
>>      {
>> -      repeated_one |= repeated_one << 31 << 1;
>> -      repeated_c |= repeated_c << 31 << 1;
>> -      if (8 < sizeof (longword))
>> -       {
>> -         size_t i;
>> -
>> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
>> -           {
>> -             repeated_one |= repeated_one << i;
>> -             repeated_c |= repeated_c << i;
>> -           }
>> -       }
>> +      if (word_ptr == lword)
>> +       return NULL;
> Inuitively making lword, lword - 1 so that normal returns don't need the extra
> null check would be faster.

Hum, I did not follow; could you explain it with more details what you mean here?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-09 20:35     ` Adhemerval Zanella Netto
  2023-01-09 20:49       ` Richard Henderson
@ 2023-01-09 20:59       ` Noah Goldstein
  2023-01-09 21:01         ` Noah Goldstein
  2023-01-09 23:33       ` Richard Henderson
  2 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-09 20:59 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, Richard Henderson

On Mon, Jan 9, 2023 at 12:35 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 05/01/23 20:17, Noah Goldstein wrote:
> > On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> New algorithm have the following key differences:
> >>
> >>   - Reads first word unaligned and use string-maskoff function to
> >>     remove unwanted data.  This strategy follow arch-specific
> >>     optimization used on aarch64 and powerpc.
> >>
> >>   - Use string-fz{b,i} functions.
> >>
> >> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
> >> and powerpc-linux-gnu by removing the arch-specific assembly
> >> implementation and disabling multi-arch (it covers both LE and BE
> >> for 64 and 32 bits).
> >>
> >> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> >> ---
> >>  string/strchrnul.c                            | 156 +++---------------
> >>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
> >>  sysdeps/s390/strchrnul-c.c                    |   2 -
> >>  3 files changed, 24 insertions(+), 138 deletions(-)
> >>
> >> diff --git a/string/strchrnul.c b/string/strchrnul.c
> >> index 0cc1fc6bb0..67defa3dab 100644
> >> --- a/string/strchrnul.c
> >> +++ b/string/strchrnul.c
> >> @@ -1,10 +1,5 @@
> >>  /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> >>     This file is part of the GNU C Library.
> >> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> >> -   with help from Dan Sahlin (dan@sics.se) and
> >> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
> >> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> >> -   and implemented by Roland McGrath (roland@ai.mit.edu).
> >>
> >>     The GNU C Library is free software; you can redistribute it and/or
> >>     modify it under the terms of the GNU Lesser General Public
> >> @@ -21,146 +16,43 @@
> >>     <https://www.gnu.org/licenses/>.  */
> >>
> >>  #include <string.h>
> >> -#include <memcopy.h>
> >>  #include <stdlib.h>
> >> +#include <stdint.h>
> >> +#include <string-fza.h>
> >> +#include <string-fzb.h>
> >> +#include <string-fzi.h>
> >> +#include <string-maskoff.h>
> >>
> >>  #undef __strchrnul
> >>  #undef strchrnul
> >>
> >> -#ifndef STRCHRNUL
> >> -# define STRCHRNUL __strchrnul
> >> +#ifdef STRCHRNUL
> >> +# define __strchrnul STRCHRNUL
> >>  #endif
> >>
> >>  /* Find the first occurrence of C in S or the final NUL byte.  */
> >>  char *
> >> -STRCHRNUL (const char *s, int c_in)
> >> +__strchrnul (const char *str, int c_in)
> >>  {
> >> -  const unsigned char *char_ptr;
> >> -  const unsigned long int *longword_ptr;
> >> -  unsigned long int longword, magic_bits, charmask;
> >> -  unsigned char c;
> >> -
> >> -  c = (unsigned char) c_in;
> >> -
> >> -  /* Handle the first few characters by reading one character at a time.
> >> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> >> -  for (char_ptr = (const unsigned char *) s;
> >> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> >> -       ++char_ptr)
> >> -    if (*char_ptr == c || *char_ptr == '\0')
> >> -      return (void *) char_ptr;
> >> -
> >> -  /* All these elucidatory comments refer to 4-byte longwords,
> >> -     but the theory applies equally well to 8-byte longwords.  */
> >> -
> >> -  longword_ptr = (unsigned long int *) char_ptr;
> >> -
> >> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> >> -     the "holes."  Note that there is a hole just to the left of
> >> -     each byte, with an extra at the end:
> >> -
> >> -     bits:  01111110 11111110 11111110 11111111
> >> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> >> -
> >> -     The 1-bits make sure that carries propagate to the next 0-bit.
> >> -     The 0-bits provide holes for carries to fall into.  */
> >> -  magic_bits = -1;
> >> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> >> -
> >> -  /* Set up a longword, each of whose bytes is C.  */
> >> -  charmask = c | (c << 8);
> >> -  charmask |= charmask << 16;
> >> -  if (sizeof (longword) > 4)
> >> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> >> -    charmask |= (charmask << 16) << 16;
> >> -  if (sizeof (longword) > 8)
> >> -    abort ();
> >> -
> >> -  /* Instead of the traditional loop which tests each character,
> >> -     we will test a longword at a time.  The tricky part is testing
> >> -     if *any of the four* bytes in the longword in question are zero.  */
> >> -  for (;;)
> >> -    {
> >> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> >> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> >> -
> >> -        1) Is this safe?  Will it catch all the zero bytes?
> >> -        Suppose there is a byte with all zeros.  Any carry bits
> >> -        propagating from its left will fall into the hole at its
> >> -        least significant bit and stop.  Since there will be no
> >> -        carry from its most significant bit, the LSB of the
> >> -        byte to the left will be unchanged, and the zero will be
> >> -        detected.
> >> +  /* Set up a word, each of whose bytes is C.  */
> >> +  op_t repeated_c = repeat_bytes (c_in);
> >>
> >> -        2) Is this worthwhile?  Will it ignore everything except
> >> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> >> -        somewhere.  There will be a carry into bit 8.  If bit 8
> >> -        is set, this will carry into bit 16.  If bit 8 is clear,
> >> -        one of bits 9-15 must be set, so there will be a carry
> >> -        into bit 16.  Similarly, there will be a carry into bit
> >> -        24.  If one of bits 24-30 is set, there will be a carry
> >> -        into bit 31, so all of the hole bits will be changed.
> >> +  /* Align the input address to op_t.  */
> >> +  uintptr_t s_int = (uintptr_t) str;
> >> +  const op_t *word_ptr = word_containing (str);
> >>
> >> -        The one misfire occurs when bits 24-30 are clear and bit
> >> -        31 is set; in this case, the hole at bit 31 is not
> >> -        changed.  If we had access to the processor carry flag,
> >> -        we could close this loophole by putting the fourth hole
> >> -        at bit 32!
> >> +  /* Read the first aligned word, but force bytes before the string to
> >> +     match neither zero nor goal (we make sure the high bit of each byte
> >> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> >> +  op_t bmask = create_mask (s_int);
> >> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
> >
> > Think much clearer (and probably better codegen) is:
> > find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
>
> It does not seem to work, at least not replacing the two lines with:
>
>   op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
>
> The loader itself can not loader anything (which means strchr is failing
> somewhere).

The following works for me (only checking test-strchrnul, so maybe its missing
a case), also as I have it here its only little endian. (Would need an
ifdef here
or API for shifting out out-of-bounds bits for cross platform).
```
  op_t word = *word_ptr;
  op_t mask = find_zero_eq_low(word, repeated_c) >> (CHAR_BIT * (s_int
% sizeof(uintptr_t)));
  if(mask) {
      return (char *) str + index_first_(mask);
  }

  while (! has_zero_eq (word, repeated_c))
    word = *++word_ptr;

  op_t found = index_first_zero_eq (word, repeated_c);
  return (char *) (word_ptr) + found;
```

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-09 20:59       ` Noah Goldstein
@ 2023-01-09 21:01         ` Noah Goldstein
  0 siblings, 0 replies; 55+ messages in thread
From: Noah Goldstein @ 2023-01-09 21:01 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, Richard Henderson

On Mon, Jan 9, 2023 at 12:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jan 9, 2023 at 12:35 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 05/01/23 20:17, Noah Goldstein wrote:
> > > On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
> > > <libc-alpha@sourceware.org> wrote:
> > >>
> > >> New algorithm have the following key differences:
> > >>
> > >>   - Reads first word unaligned and use string-maskoff function to
> > >>     remove unwanted data.  This strategy follow arch-specific
> > >>     optimization used on aarch64 and powerpc.
> > >>
> > >>   - Use string-fz{b,i} functions.
> > >>
> > >> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
> > >> and powerpc-linux-gnu by removing the arch-specific assembly
> > >> implementation and disabling multi-arch (it covers both LE and BE
> > >> for 64 and 32 bits).
> > >>
> > >> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> > >> ---
> > >>  string/strchrnul.c                            | 156 +++---------------
> > >>  .../power4/multiarch/strchrnul-ppc32.c        |   4 -
> > >>  sysdeps/s390/strchrnul-c.c                    |   2 -
> > >>  3 files changed, 24 insertions(+), 138 deletions(-)
> > >>
> > >> diff --git a/string/strchrnul.c b/string/strchrnul.c
> > >> index 0cc1fc6bb0..67defa3dab 100644
> > >> --- a/string/strchrnul.c
> > >> +++ b/string/strchrnul.c
> > >> @@ -1,10 +1,5 @@
> > >>  /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> > >>     This file is part of the GNU C Library.
> > >> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> > >> -   with help from Dan Sahlin (dan@sics.se) and
> > >> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
> > >> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> > >> -   and implemented by Roland McGrath (roland@ai.mit.edu).
> > >>
> > >>     The GNU C Library is free software; you can redistribute it and/or
> > >>     modify it under the terms of the GNU Lesser General Public
> > >> @@ -21,146 +16,43 @@
> > >>     <https://www.gnu.org/licenses/>.  */
> > >>
> > >>  #include <string.h>
> > >> -#include <memcopy.h>
> > >>  #include <stdlib.h>
> > >> +#include <stdint.h>
> > >> +#include <string-fza.h>
> > >> +#include <string-fzb.h>
> > >> +#include <string-fzi.h>
> > >> +#include <string-maskoff.h>
> > >>
> > >>  #undef __strchrnul
> > >>  #undef strchrnul
> > >>
> > >> -#ifndef STRCHRNUL
> > >> -# define STRCHRNUL __strchrnul
> > >> +#ifdef STRCHRNUL
> > >> +# define __strchrnul STRCHRNUL
> > >>  #endif
> > >>
> > >>  /* Find the first occurrence of C in S or the final NUL byte.  */
> > >>  char *
> > >> -STRCHRNUL (const char *s, int c_in)
> > >> +__strchrnul (const char *str, int c_in)
> > >>  {
> > >> -  const unsigned char *char_ptr;
> > >> -  const unsigned long int *longword_ptr;
> > >> -  unsigned long int longword, magic_bits, charmask;
> > >> -  unsigned char c;
> > >> -
> > >> -  c = (unsigned char) c_in;
> > >> -
> > >> -  /* Handle the first few characters by reading one character at a time.
> > >> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> > >> -  for (char_ptr = (const unsigned char *) s;
> > >> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> > >> -       ++char_ptr)
> > >> -    if (*char_ptr == c || *char_ptr == '\0')
> > >> -      return (void *) char_ptr;
> > >> -
> > >> -  /* All these elucidatory comments refer to 4-byte longwords,
> > >> -     but the theory applies equally well to 8-byte longwords.  */
> > >> -
> > >> -  longword_ptr = (unsigned long int *) char_ptr;
> > >> -
> > >> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> > >> -     the "holes."  Note that there is a hole just to the left of
> > >> -     each byte, with an extra at the end:
> > >> -
> > >> -     bits:  01111110 11111110 11111110 11111111
> > >> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> > >> -
> > >> -     The 1-bits make sure that carries propagate to the next 0-bit.
> > >> -     The 0-bits provide holes for carries to fall into.  */
> > >> -  magic_bits = -1;
> > >> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> > >> -
> > >> -  /* Set up a longword, each of whose bytes is C.  */
> > >> -  charmask = c | (c << 8);
> > >> -  charmask |= charmask << 16;
> > >> -  if (sizeof (longword) > 4)
> > >> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> > >> -    charmask |= (charmask << 16) << 16;
> > >> -  if (sizeof (longword) > 8)
> > >> -    abort ();
> > >> -
> > >> -  /* Instead of the traditional loop which tests each character,
> > >> -     we will test a longword at a time.  The tricky part is testing
> > >> -     if *any of the four* bytes in the longword in question are zero.  */
> > >> -  for (;;)
> > >> -    {
> > >> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> > >> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> > >> -
> > >> -        1) Is this safe?  Will it catch all the zero bytes?
> > >> -        Suppose there is a byte with all zeros.  Any carry bits
> > >> -        propagating from its left will fall into the hole at its
> > >> -        least significant bit and stop.  Since there will be no
> > >> -        carry from its most significant bit, the LSB of the
> > >> -        byte to the left will be unchanged, and the zero will be
> > >> -        detected.
> > >> +  /* Set up a word, each of whose bytes is C.  */
> > >> +  op_t repeated_c = repeat_bytes (c_in);
> > >>
> > >> -        2) Is this worthwhile?  Will it ignore everything except
> > >> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> > >> -        somewhere.  There will be a carry into bit 8.  If bit 8
> > >> -        is set, this will carry into bit 16.  If bit 8 is clear,
> > >> -        one of bits 9-15 must be set, so there will be a carry
> > >> -        into bit 16.  Similarly, there will be a carry into bit
> > >> -        24.  If one of bits 24-30 is set, there will be a carry
> > >> -        into bit 31, so all of the hole bits will be changed.
> > >> +  /* Align the input address to op_t.  */
> > >> +  uintptr_t s_int = (uintptr_t) str;
> > >> +  const op_t *word_ptr = word_containing (str);
> > >>
> > >> -        The one misfire occurs when bits 24-30 are clear and bit
> > >> -        31 is set; in this case, the hole at bit 31 is not
> > >> -        changed.  If we had access to the processor carry flag,
> > >> -        we could close this loophole by putting the fourth hole
> > >> -        at bit 32!
> > >> +  /* Read the first aligned word, but force bytes before the string to
> > >> +     match neither zero nor goal (we make sure the high bit of each byte
> > >> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> > >> +  op_t bmask = create_mask (s_int);
> > >> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
> > >
> > > Think much clearer (and probably better codegen) is:
> > > find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
> >
> > It does not seem to work, at least not replacing the two lines with:
> >
> >   op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
> >
> > The loader itself can not loader anything (which means strchr is failing
> > somewhere).
>
> The following works for me (only checking test-strchrnul, so maybe its missing
> a case), also as I have it here its only little endian. (Would need an
> ifdef here
> or API for shifting out out-of-bounds bits for cross platform).
> ```
>   op_t word = *word_ptr;
>   op_t mask = find_zero_eq_low(word, repeated_c) >> (CHAR_BIT * (s_int
> % sizeof(uintptr_t)));
>   if(mask) {
>       return (char *) str + index_first_(mask);
>   }
>
  word = *++word_ptr;
>   while (! has_zero_eq (word, repeated_c))
>     word = *++word_ptr;
>
>   op_t found = index_first_zero_eq (word, repeated_c);
>   return (char *) (word_ptr) + found;
> ```

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2023-01-09 20:51     ` Adhemerval Zanella Netto
@ 2023-01-09 21:26       ` Noah Goldstein
  2023-01-10 14:33         ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-09 21:26 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, Richard Henderson

On Mon, Jan 9, 2023 at 12:51 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 05/01/23 20:49, Noah Goldstein wrote:
> > On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> New algorithm have the following key differences:
> >>
> >>   - Reads first word unaligned and use string-maskoff function to
> >>     remove unwanted data.  This strategy follow arch-specific
> >>     optimization used on aarch64 and powerpc.
> >>
> >>   - Use string-fz{b,i} and string-opthr functions.
> >>
> >> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> >> and powerpc64-linux-gnu by removing the arch-specific assembly
> >> implementation and disabling multi-arch (it covers both LE and BE
> >> for 64 and 32 bits).
> >>
> >> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> >> ---
> >>  string/memchr.c                               | 168 +++++-------------
> >>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
> >>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
> >>  3 files changed, 48 insertions(+), 143 deletions(-)
> >>
> >> diff --git a/string/memchr.c b/string/memchr.c
> >> index 422bcd0cd6..08d518b02d 100644
> >> --- a/string/memchr.c
> >> +++ b/string/memchr.c
> >> @@ -1,10 +1,6 @@
> >> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> >> +/* Scan memory for a character.  Generic version
> >> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
> >>     This file is part of the GNU C Library.
> >> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> >> -   with help from Dan Sahlin (dan@sics.se) and
> >> -   commentary by Jim Blandy (jimb@ai.mit.edu);
> >> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> >> -   and implemented by Roland McGrath (roland@ai.mit.edu).
> >>
> >>     The GNU C Library is free software; you can redistribute it and/or
> >>     modify it under the terms of the GNU Lesser General Public
> >> @@ -20,143 +16,65 @@
> >>     License along with the GNU C Library; if not, see
> >>     <https://www.gnu.org/licenses/>.  */
> >>
> >> -#ifndef _LIBC
> >> -# include <config.h>
> >> -#endif
> >> -
> >> +#include <intprops.h>
> >> +#include <string-fza.h>
> >> +#include <string-fzb.h>
> >> +#include <string-fzi.h>
> >> +#include <string-maskoff.h>
> >> +#include <string-opthr.h>
> >>  #include <string.h>
> >>
> >> -#include <stddef.h>
> >> +#undef memchr
> >>
> >> -#include <limits.h>
> >> -
> >> -#undef __memchr
> >> -#ifdef _LIBC
> >> -# undef memchr
> >> +#ifdef MEMCHR
> >> +# define __memchr MEMCHR
> >>  #endif
> >>
> >> -#ifndef weak_alias
> >> -# define __memchr memchr
> >> -#endif
> >> -
> >> -#ifndef MEMCHR
> >> -# define MEMCHR __memchr
> >> -#endif
> >> +static inline const char *
> >> +sadd (uintptr_t x, uintptr_t y)
> >> +{
> >> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
> >> +  return (const char *)ret;
> >> +}
> >>
> >>  /* Search no more than N bytes of S for C.  */
> >>  void *
> >> -MEMCHR (void const *s, int c_in, size_t n)
> >> +__memchr (void const *s, int c_in, size_t n)
> >>  {
> >> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
> >> -     long instead of a 64-bit uintmax_t tends to give better
> >> -     performance.  On 64-bit hardware, unsigned long is generally 64
> >> -     bits already.  Change this typedef to experiment with
> >> -     performance.  */
> >> -  typedef unsigned long int longword;
> >> +  if (__glibc_unlikely (n == 0))
> >> +    return NULL;
> >>
> >> -  const unsigned char *char_ptr;
> >> -  const longword *longword_ptr;
> >> -  longword repeated_one;
> >> -  longword repeated_c;
> >> -  unsigned char c;
> >> +  uintptr_t s_int = (uintptr_t) s;
> >>
> >> -  c = (unsigned char) c_in;
> >> +  /* Set up a word, each of whose bytes is C.  */
> >> +  op_t repeated_c = repeat_bytes (c_in);
> >> +  op_t before_mask = create_mask (s_int);
> >>
> >> -  /* Handle the first few bytes by reading one byte at a time.
> >> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> >> -  for (char_ptr = (const unsigned char *) s;
> >> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
> >> -       --n, ++char_ptr)
> >> -    if (*char_ptr == c)
> >> -      return (void *) char_ptr;
> >> +  /* Compute the address of the last byte taking in consideration possible
> >> +     overflow.  */
> >> +  const char *lbyte = sadd (s_int, n - 1);
> >>
> >> -  longword_ptr = (const longword *) char_ptr;
> >> +  /* Compute the address of the word containing the last byte. */
> >> +  const op_t *lword = word_containing (lbyte);
> >>
> >> -  /* All these elucidatory comments refer to 4-byte longwords,
> >> -     but the theory applies equally well to any size longwords.  */
> >> +  /* Read the first word, but munge it so that bytes before the array
> >> +     will not match goal.  */
> >> +  const op_t *word_ptr = word_containing (s);
> >> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
> >>
> >> -  /* Compute auxiliary longword values:
> >> -     repeated_one is a value which has a 1 in every byte.
> >> -     repeated_c has c in every byte.  */
> >> -  repeated_one = 0x01010101;
> >> -  repeated_c = c | (c << 8);
> >> -  repeated_c |= repeated_c << 16;
> >> -  if (0xffffffffU < (longword) -1)
> >> +  while (has_eq (word, repeated_c) == 0)
> >>      {
> >> -      repeated_one |= repeated_one << 31 << 1;
> >> -      repeated_c |= repeated_c << 31 << 1;
> >> -      if (8 < sizeof (longword))
> >> -       {
> >> -         size_t i;
> >> -
> >> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
> >> -           {
> >> -             repeated_one |= repeated_one << i;
> >> -             repeated_c |= repeated_c << i;
> >> -           }
> >> -       }
> >> +      if (word_ptr == lword)
> >> +       return NULL;
> > Inuitively making lword, lword - 1 so that normal returns don't need the extra
> > null check would be faster.
>
> Hum, I did not follow; could you explain it with more details what you mean here?

I was thinking something like:

```
  op_t word = *word_ptr;
  op_t mask = find_eq_low (word, repeated_c)
      >> (CHAR_BIT * (s_int % sizeof (uintptr_t)));
  if (mask)
    {
      char *ret = (char *) s + index_first_ (mask);
      return (ret <= lbyte) ? ret : NULL;
    }
  if (word_ptr == lword)
    return NULL;

  word = *++word_ptr;
  while (word_ptr != lword)
    {
      if (has_eq (word, repeated_c))
return (char *) word_ptr + index_first_eq (word, repeated_c);
      word = *++word_ptr;
    }

  if (has_eq (word, repeated_c))
    {

      /* We found a match, but it might be in a byte past the end
of the array.  */
      char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
      if (ret <= lbyte)
return ret;
    }
  return NULL;
```

The idea is until the last byte you don't need the extra bounds check (tested
on test-memchr.c on little-endian).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-09 20:35     ` Adhemerval Zanella Netto
  2023-01-09 20:49       ` Richard Henderson
  2023-01-09 20:59       ` Noah Goldstein
@ 2023-01-09 23:33       ` Richard Henderson
  2023-01-10 14:18         ` Adhemerval Zanella Netto
  2 siblings, 1 reply; 55+ messages in thread
From: Richard Henderson @ 2023-01-09 23:33 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Noah Goldstein; +Cc: libc-alpha

On 1/9/23 12:35, Adhemerval Zanella Netto wrote:
> 
> 
> On 05/01/23 20:17, Noah Goldstein wrote:
>> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>>
>>> New algorithm have the following key differences:
>>>
>>>    - Reads first word unaligned and use string-maskoff function to
>>>      remove unwanted data.  This strategy follow arch-specific
>>>      optimization used on aarch64 and powerpc.
>>>
>>>    - Use string-fz{b,i} functions.
>>>
>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
>>> and powerpc-linux-gnu by removing the arch-specific assembly
>>> implementation and disabling multi-arch (it covers both LE and BE
>>> for 64 and 32 bits).
>>>
>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>> ---
>>>   string/strchrnul.c                            | 156 +++---------------
>>>   .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>>   sysdeps/s390/strchrnul-c.c                    |   2 -
>>>   3 files changed, 24 insertions(+), 138 deletions(-)
>>>
>>> diff --git a/string/strchrnul.c b/string/strchrnul.c
>>> index 0cc1fc6bb0..67defa3dab 100644
>>> --- a/string/strchrnul.c
>>> +++ b/string/strchrnul.c
>>> @@ -1,10 +1,5 @@
>>>   /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>      This file is part of the GNU C Library.
>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>>> -   with help from Dan Sahlin (dan@sics.se) and
>>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
>>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>>
>>>      The GNU C Library is free software; you can redistribute it and/or
>>>      modify it under the terms of the GNU Lesser General Public
>>> @@ -21,146 +16,43 @@
>>>      <https://www.gnu.org/licenses/>.  */
>>>
>>>   #include <string.h>
>>> -#include <memcopy.h>
>>>   #include <stdlib.h>
>>> +#include <stdint.h>
>>> +#include <string-fza.h>
>>> +#include <string-fzb.h>
>>> +#include <string-fzi.h>
>>> +#include <string-maskoff.h>
>>>
>>>   #undef __strchrnul
>>>   #undef strchrnul
>>>
>>> -#ifndef STRCHRNUL
>>> -# define STRCHRNUL __strchrnul
>>> +#ifdef STRCHRNUL
>>> +# define __strchrnul STRCHRNUL
>>>   #endif
>>>
>>>   /* Find the first occurrence of C in S or the final NUL byte.  */
>>>   char *
>>> -STRCHRNUL (const char *s, int c_in)
>>> +__strchrnul (const char *str, int c_in)
>>>   {
>>> -  const unsigned char *char_ptr;
>>> -  const unsigned long int *longword_ptr;
>>> -  unsigned long int longword, magic_bits, charmask;
>>> -  unsigned char c;
>>> -
>>> -  c = (unsigned char) c_in;
>>> -
>>> -  /* Handle the first few characters by reading one character at a time.
>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>> -  for (char_ptr = (const unsigned char *) s;
>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>>> -       ++char_ptr)
>>> -    if (*char_ptr == c || *char_ptr == '\0')
>>> -      return (void *) char_ptr;
>>> -
>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>> -     but the theory applies equally well to 8-byte longwords.  */
>>> -
>>> -  longword_ptr = (unsigned long int *) char_ptr;
>>> -
>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>>> -     the "holes."  Note that there is a hole just to the left of
>>> -     each byte, with an extra at the end:
>>> -
>>> -     bits:  01111110 11111110 11111110 11111111
>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>>> -
>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>>> -     The 0-bits provide holes for carries to fall into.  */
>>> -  magic_bits = -1;
>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>>> -
>>> -  /* Set up a longword, each of whose bytes is C.  */
>>> -  charmask = c | (c << 8);
>>> -  charmask |= charmask << 16;
>>> -  if (sizeof (longword) > 4)
>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>>> -    charmask |= (charmask << 16) << 16;
>>> -  if (sizeof (longword) > 8)
>>> -    abort ();
>>> -
>>> -  /* Instead of the traditional loop which tests each character,
>>> -     we will test a longword at a time.  The tricky part is testing
>>> -     if *any of the four* bytes in the longword in question are zero.  */
>>> -  for (;;)
>>> -    {
>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>>> -
>>> -        1) Is this safe?  Will it catch all the zero bytes?
>>> -        Suppose there is a byte with all zeros.  Any carry bits
>>> -        propagating from its left will fall into the hole at its
>>> -        least significant bit and stop.  Since there will be no
>>> -        carry from its most significant bit, the LSB of the
>>> -        byte to the left will be unchanged, and the zero will be
>>> -        detected.
>>> +  /* Set up a word, each of whose bytes is C.  */
>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>
>>> -        2) Is this worthwhile?  Will it ignore everything except
>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>>> -        one of bits 9-15 must be set, so there will be a carry
>>> -        into bit 16.  Similarly, there will be a carry into bit
>>> -        24.  If one of bits 24-30 is set, there will be a carry
>>> -        into bit 31, so all of the hole bits will be changed.
>>> +  /* Align the input address to op_t.  */
>>> +  uintptr_t s_int = (uintptr_t) str;
>>> +  const op_t *word_ptr = word_containing (str);
>>>
>>> -        The one misfire occurs when bits 24-30 are clear and bit
>>> -        31 is set; in this case, the hole at bit 31 is not
>>> -        changed.  If we had access to the processor carry flag,
>>> -        we could close this loophole by putting the fourth hole
>>> -        at bit 32!
>>> +  /* Read the first aligned word, but force bytes before the string to
>>> +     match neither zero nor goal (we make sure the high bit of each byte
>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>>> +  op_t bmask = create_mask (s_int);
>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>>
>> Think much clearer (and probably better codegen) is:
>> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
> 
> It does not seem to work, at least not replacing the two lines with:
> 
>    op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);

Oh, two fine points:

(1) big-endian would want shifting left,
(2) alpha would want shifting by bits not bytes,
     because the cmpbge insn produces an 8-bit mask.

so you'd need to hide this shift in the headers like create_mask().


r~

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-09 23:33       ` Richard Henderson
@ 2023-01-10 14:18         ` Adhemerval Zanella Netto
  2023-01-10 16:24           ` Richard Henderson
  2023-01-10 17:17           ` Noah Goldstein
  0 siblings, 2 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-10 14:18 UTC (permalink / raw)
  To: Richard Henderson, Noah Goldstein; +Cc: libc-alpha



On 09/01/23 20:33, Richard Henderson wrote:
> On 1/9/23 12:35, Adhemerval Zanella Netto wrote:
>>
>>
>> On 05/01/23 20:17, Noah Goldstein wrote:
>>> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> New algorithm have the following key differences:
>>>>
>>>>    - Reads first word unaligned and use string-maskoff function to
>>>>      remove unwanted data.  This strategy follow arch-specific
>>>>      optimization used on aarch64 and powerpc.
>>>>
>>>>    - Use string-fz{b,i} functions.
>>>>
>>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
>>>> and powerpc-linux-gnu by removing the arch-specific assembly
>>>> implementation and disabling multi-arch (it covers both LE and BE
>>>> for 64 and 32 bits).
>>>>
>>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>>> ---
>>>>   string/strchrnul.c                            | 156 +++---------------
>>>>   .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>>>   sysdeps/s390/strchrnul-c.c                    |   2 -
>>>>   3 files changed, 24 insertions(+), 138 deletions(-)
>>>>
>>>> diff --git a/string/strchrnul.c b/string/strchrnul.c
>>>> index 0cc1fc6bb0..67defa3dab 100644
>>>> --- a/string/strchrnul.c
>>>> +++ b/string/strchrnul.c
>>>> @@ -1,10 +1,5 @@
>>>>   /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>>      This file is part of the GNU C Library.
>>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>>>> -   with help from Dan Sahlin (dan@sics.se) and
>>>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
>>>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>>>
>>>>      The GNU C Library is free software; you can redistribute it and/or
>>>>      modify it under the terms of the GNU Lesser General Public
>>>> @@ -21,146 +16,43 @@
>>>>      <https://www.gnu.org/licenses/>.  */
>>>>
>>>>   #include <string.h>
>>>> -#include <memcopy.h>
>>>>   #include <stdlib.h>
>>>> +#include <stdint.h>
>>>> +#include <string-fza.h>
>>>> +#include <string-fzb.h>
>>>> +#include <string-fzi.h>
>>>> +#include <string-maskoff.h>
>>>>
>>>>   #undef __strchrnul
>>>>   #undef strchrnul
>>>>
>>>> -#ifndef STRCHRNUL
>>>> -# define STRCHRNUL __strchrnul
>>>> +#ifdef STRCHRNUL
>>>> +# define __strchrnul STRCHRNUL
>>>>   #endif
>>>>
>>>>   /* Find the first occurrence of C in S or the final NUL byte.  */
>>>>   char *
>>>> -STRCHRNUL (const char *s, int c_in)
>>>> +__strchrnul (const char *str, int c_in)
>>>>   {
>>>> -  const unsigned char *char_ptr;
>>>> -  const unsigned long int *longword_ptr;
>>>> -  unsigned long int longword, magic_bits, charmask;
>>>> -  unsigned char c;
>>>> -
>>>> -  c = (unsigned char) c_in;
>>>> -
>>>> -  /* Handle the first few characters by reading one character at a time.
>>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>>> -  for (char_ptr = (const unsigned char *) s;
>>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>>>> -       ++char_ptr)
>>>> -    if (*char_ptr == c || *char_ptr == '\0')
>>>> -      return (void *) char_ptr;
>>>> -
>>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>>> -     but the theory applies equally well to 8-byte longwords.  */
>>>> -
>>>> -  longword_ptr = (unsigned long int *) char_ptr;
>>>> -
>>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>>>> -     the "holes."  Note that there is a hole just to the left of
>>>> -     each byte, with an extra at the end:
>>>> -
>>>> -     bits:  01111110 11111110 11111110 11111111
>>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>>>> -
>>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>>>> -     The 0-bits provide holes for carries to fall into.  */
>>>> -  magic_bits = -1;
>>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>>>> -
>>>> -  /* Set up a longword, each of whose bytes is C.  */
>>>> -  charmask = c | (c << 8);
>>>> -  charmask |= charmask << 16;
>>>> -  if (sizeof (longword) > 4)
>>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>>>> -    charmask |= (charmask << 16) << 16;
>>>> -  if (sizeof (longword) > 8)
>>>> -    abort ();
>>>> -
>>>> -  /* Instead of the traditional loop which tests each character,
>>>> -     we will test a longword at a time.  The tricky part is testing
>>>> -     if *any of the four* bytes in the longword in question are zero.  */
>>>> -  for (;;)
>>>> -    {
>>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>>>> -
>>>> -        1) Is this safe?  Will it catch all the zero bytes?
>>>> -        Suppose there is a byte with all zeros.  Any carry bits
>>>> -        propagating from its left will fall into the hole at its
>>>> -        least significant bit and stop.  Since there will be no
>>>> -        carry from its most significant bit, the LSB of the
>>>> -        byte to the left will be unchanged, and the zero will be
>>>> -        detected.
>>>> +  /* Set up a word, each of whose bytes is C.  */
>>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>>
>>>> -        2) Is this worthwhile?  Will it ignore everything except
>>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>>>> -        one of bits 9-15 must be set, so there will be a carry
>>>> -        into bit 16.  Similarly, there will be a carry into bit
>>>> -        24.  If one of bits 24-30 is set, there will be a carry
>>>> -        into bit 31, so all of the hole bits will be changed.
>>>> +  /* Align the input address to op_t.  */
>>>> +  uintptr_t s_int = (uintptr_t) str;
>>>> +  const op_t *word_ptr = word_containing (str);
>>>>
>>>> -        The one misfire occurs when bits 24-30 are clear and bit
>>>> -        31 is set; in this case, the hole at bit 31 is not
>>>> -        changed.  If we had access to the processor carry flag,
>>>> -        we could close this loophole by putting the fourth hole
>>>> -        at bit 32!
>>>> +  /* Read the first aligned word, but force bytes before the string to
>>>> +     match neither zero nor goal (we make sure the high bit of each byte
>>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>>>> +  op_t bmask = create_mask (s_int);
>>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>>>
>>> Think much clearer (and probably better codegen) is:
>>> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
>>
>> It does not seem to work, at least not replacing the two lines with:
>>
>>    op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
> 
> Oh, two fine points:
> 
> (1) big-endian would want shifting left,
> (2) alpha would want shifting by bits not bytes,
>     because the cmpbge insn produces an 8-bit mask.
> 
> so you'd need to hide this shift in the headers like create_mask().

Alright, the following works:


static __always_inline op_t
check_mask (op_t word, uintptr_t s_int)
{
  if (__BYTE_ORDER == __LITTLE_ENDIAN)
    return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
  else
    return word << (CHAR_BIT * (s_int % sizeof (s_int)));
}

char *
__strchrnul (const char *str, int c_in)
{
  op_t repeated_c = repeat_bytes (c_in);

  uintptr_t s_int = (uintptr_t) str;
  const op_t *word_ptr = word_containing (str);

  op_t word = *word_ptr;

  op_t mask = check_mask (find_zero_eq_all (word, repeated_c), s_int);
  if (mask != 0)
    return (char *) str + index_first_(mask);

  do
    word = *++word_ptr;
  while (! has_zero_eq (word, repeated_c));

  op_t found = index_first_zero_eq (word, repeated_c);
  return (char *) word_ptr + found;
}

 
I had to use find_zero_eq_all to avoid uninitialized bytes, that triggered
some regression on tests that use strchr (for instance test-strpbrk).

I will update the patch based on this version.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/17] string: Improve generic memchr
  2023-01-09 21:26       ` Noah Goldstein
@ 2023-01-10 14:33         ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-10 14:33 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, Richard Henderson



On 09/01/23 18:26, Noah Goldstein wrote:
> On Mon, Jan 9, 2023 at 12:51 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 05/01/23 20:49, Noah Goldstein wrote:
>>> On Mon, Sep 19, 2022 at 1:05 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> New algorithm have the following key differences:
>>>>
>>>>   - Reads first word unaligned and use string-maskoff function to
>>>>     remove unwanted data.  This strategy follow arch-specific
>>>>     optimization used on aarch64 and powerpc.
>>>>
>>>>   - Use string-fz{b,i} and string-opthr functions.
>>>>
>>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>>>> and powerpc64-linux-gnu by removing the arch-specific assembly
>>>> implementation and disabling multi-arch (it covers both LE and BE
>>>> for 64 and 32 bits).
>>>>
>>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>>> ---
>>>>  string/memchr.c                               | 168 +++++-------------
>>>>  .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
>>>>  .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
>>>>  3 files changed, 48 insertions(+), 143 deletions(-)
>>>>
>>>> diff --git a/string/memchr.c b/string/memchr.c
>>>> index 422bcd0cd6..08d518b02d 100644
>>>> --- a/string/memchr.c
>>>> +++ b/string/memchr.c
>>>> @@ -1,10 +1,6 @@
>>>> -/* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>> +/* Scan memory for a character.  Generic version
>>>> +   Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>>     This file is part of the GNU C Library.
>>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>>>> -   with help from Dan Sahlin (dan@sics.se) and
>>>> -   commentary by Jim Blandy (jimb@ai.mit.edu);
>>>> -   adaptation to memchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>>>
>>>>     The GNU C Library is free software; you can redistribute it and/or
>>>>     modify it under the terms of the GNU Lesser General Public
>>>> @@ -20,143 +16,65 @@
>>>>     License along with the GNU C Library; if not, see
>>>>     <https://www.gnu.org/licenses/>.  */
>>>>
>>>> -#ifndef _LIBC
>>>> -# include <config.h>
>>>> -#endif
>>>> -
>>>> +#include <intprops.h>
>>>> +#include <string-fza.h>
>>>> +#include <string-fzb.h>
>>>> +#include <string-fzi.h>
>>>> +#include <string-maskoff.h>
>>>> +#include <string-opthr.h>
>>>>  #include <string.h>
>>>>
>>>> -#include <stddef.h>
>>>> +#undef memchr
>>>>
>>>> -#include <limits.h>
>>>> -
>>>> -#undef __memchr
>>>> -#ifdef _LIBC
>>>> -# undef memchr
>>>> +#ifdef MEMCHR
>>>> +# define __memchr MEMCHR
>>>>  #endif
>>>>
>>>> -#ifndef weak_alias
>>>> -# define __memchr memchr
>>>> -#endif
>>>> -
>>>> -#ifndef MEMCHR
>>>> -# define MEMCHR __memchr
>>>> -#endif
>>>> +static inline const char *
>>>> +sadd (uintptr_t x, uintptr_t y)
>>>> +{
>>>> +  uintptr_t ret = INT_ADD_OVERFLOW (x, y) ? (uintptr_t)-1 : x + y;
>>>> +  return (const char *)ret;
>>>> +}
>>>>
>>>>  /* Search no more than N bytes of S for C.  */
>>>>  void *
>>>> -MEMCHR (void const *s, int c_in, size_t n)
>>>> +__memchr (void const *s, int c_in, size_t n)
>>>>  {
>>>> -  /* On 32-bit hardware, choosing longword to be a 32-bit unsigned
>>>> -     long instead of a 64-bit uintmax_t tends to give better
>>>> -     performance.  On 64-bit hardware, unsigned long is generally 64
>>>> -     bits already.  Change this typedef to experiment with
>>>> -     performance.  */
>>>> -  typedef unsigned long int longword;
>>>> +  if (__glibc_unlikely (n == 0))
>>>> +    return NULL;
>>>>
>>>> -  const unsigned char *char_ptr;
>>>> -  const longword *longword_ptr;
>>>> -  longword repeated_one;
>>>> -  longword repeated_c;
>>>> -  unsigned char c;
>>>> +  uintptr_t s_int = (uintptr_t) s;
>>>>
>>>> -  c = (unsigned char) c_in;
>>>> +  /* Set up a word, each of whose bytes is C.  */
>>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>> +  op_t before_mask = create_mask (s_int);
>>>>
>>>> -  /* Handle the first few bytes by reading one byte at a time.
>>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>>> -  for (char_ptr = (const unsigned char *) s;
>>>> -       n > 0 && (size_t) char_ptr % sizeof (longword) != 0;
>>>> -       --n, ++char_ptr)
>>>> -    if (*char_ptr == c)
>>>> -      return (void *) char_ptr;
>>>> +  /* Compute the address of the last byte taking in consideration possible
>>>> +     overflow.  */
>>>> +  const char *lbyte = sadd (s_int, n - 1);
>>>>
>>>> -  longword_ptr = (const longword *) char_ptr;
>>>> +  /* Compute the address of the word containing the last byte. */
>>>> +  const op_t *lword = word_containing (lbyte);
>>>>
>>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>>> -     but the theory applies equally well to any size longwords.  */
>>>> +  /* Read the first word, but munge it so that bytes before the array
>>>> +     will not match goal.  */
>>>> +  const op_t *word_ptr = word_containing (s);
>>>> +  op_t word = (*word_ptr | before_mask) ^ (repeated_c & before_mask);
>>>>
>>>> -  /* Compute auxiliary longword values:
>>>> -     repeated_one is a value which has a 1 in every byte.
>>>> -     repeated_c has c in every byte.  */
>>>> -  repeated_one = 0x01010101;
>>>> -  repeated_c = c | (c << 8);
>>>> -  repeated_c |= repeated_c << 16;
>>>> -  if (0xffffffffU < (longword) -1)
>>>> +  while (has_eq (word, repeated_c) == 0)
>>>>      {
>>>> -      repeated_one |= repeated_one << 31 << 1;
>>>> -      repeated_c |= repeated_c << 31 << 1;
>>>> -      if (8 < sizeof (longword))
>>>> -       {
>>>> -         size_t i;
>>>> -
>>>> -         for (i = 64; i < sizeof (longword) * 8; i *= 2)
>>>> -           {
>>>> -             repeated_one |= repeated_one << i;
>>>> -             repeated_c |= repeated_c << i;
>>>> -           }
>>>> -       }
>>>> +      if (word_ptr == lword)
>>>> +       return NULL;
>>> Inuitively making lword, lword - 1 so that normal returns don't need the extra
>>> null check would be faster.
>>
>> Hum, I did not follow; could you explain it with more details what you mean here?
> 
> I was thinking something like:
> 
> ```
>   op_t word = *word_ptr;
>   op_t mask = find_eq_low (word, repeated_c)
>       >> (CHAR_BIT * (s_int % sizeof (uintptr_t)));
>   if (mask)
>     {
>       char *ret = (char *) s + index_first_ (mask);
>       return (ret <= lbyte) ? ret : NULL;
>     }
>   if (word_ptr == lword)
>     return NULL;
> 
>   word = *++word_ptr;
>   while (word_ptr != lword)
>     {
>       if (has_eq (word, repeated_c))
> return (char *) word_ptr + index_first_eq (word, repeated_c);
>       word = *++word_ptr;
>     }
> 
>   if (has_eq (word, repeated_c))
>     {
> 
>       /* We found a match, but it might be in a byte past the end
> of the array.  */
>       char *ret = (char *) word_ptr + index_first_eq (word, repeated_c);
>       if (ret <= lbyte)
> return ret;
>     }
>   return NULL;
> ```
> 
> The idea is until the last byte you don't need the extra bounds check (tested
> on test-memchr.c on little-endian).

Alright, this works. I will update the path.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-10 14:18         ` Adhemerval Zanella Netto
@ 2023-01-10 16:24           ` Richard Henderson
  2023-01-10 17:16             ` Noah Goldstein
  2023-01-10 17:17           ` Noah Goldstein
  1 sibling, 1 reply; 55+ messages in thread
From: Richard Henderson @ 2023-01-10 16:24 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Richard Henderson, Noah Goldstein; +Cc: libc-alpha

On 1/10/23 06:18, Adhemerval Zanella Netto via Libc-alpha wrote:
> static __always_inline op_t
> check_mask (op_t word, uintptr_t s_int)
> {
>    if (__BYTE_ORDER == __LITTLE_ENDIAN)
>      return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
>    else
>      return word << (CHAR_BIT * (s_int % sizeof (s_int)));
> }

sizeof(op_t), which is usually the same size, but doesn't have to be.


r~

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-10 16:24           ` Richard Henderson
@ 2023-01-10 17:16             ` Noah Goldstein
  2023-01-10 18:19               ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-10 17:16 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Adhemerval Zanella Netto, Richard Henderson, libc-alpha

On Tue, Jan 10, 2023 at 8:24 AM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 1/10/23 06:18, Adhemerval Zanella Netto via Libc-alpha wrote:
> > static __always_inline op_t
> > check_mask (op_t word, uintptr_t s_int)
> > {
> >    if (__BYTE_ORDER == __LITTLE_ENDIAN)
> >      return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
> >    else
> >      return word << (CHAR_BIT * (s_int % sizeof (s_int)));
> > }
>
> sizeof(op_t), which is usually the same size, but doesn't have to be.
>
Are we aligning by sizeof(op_t) or sizeof(void *)?
The former then `word_containing` will also need to be changed.

If the latter then sizeof(s_int) is correct.
>
> r~

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-10 14:18         ` Adhemerval Zanella Netto
  2023-01-10 16:24           ` Richard Henderson
@ 2023-01-10 17:17           ` Noah Goldstein
  2023-01-10 18:16             ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 55+ messages in thread
From: Noah Goldstein @ 2023-01-10 17:17 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Richard Henderson, libc-alpha

On Tue, Jan 10, 2023 at 6:18 AM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 09/01/23 20:33, Richard Henderson wrote:
> > On 1/9/23 12:35, Adhemerval Zanella Netto wrote:
> >>
> >>
> >> On 05/01/23 20:17, Noah Goldstein wrote:
> >>> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
> >>> <libc-alpha@sourceware.org> wrote:
> >>>>
> >>>> New algorithm have the following key differences:
> >>>>
> >>>>    - Reads first word unaligned and use string-maskoff function to
> >>>>      remove unwanted data.  This strategy follow arch-specific
> >>>>      optimization used on aarch64 and powerpc.
> >>>>
> >>>>    - Use string-fz{b,i} functions.
> >>>>
> >>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
> >>>> and powerpc-linux-gnu by removing the arch-specific assembly
> >>>> implementation and disabling multi-arch (it covers both LE and BE
> >>>> for 64 and 32 bits).
> >>>>
> >>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
> >>>> ---
> >>>>   string/strchrnul.c                            | 156 +++---------------
> >>>>   .../power4/multiarch/strchrnul-ppc32.c        |   4 -
> >>>>   sysdeps/s390/strchrnul-c.c                    |   2 -
> >>>>   3 files changed, 24 insertions(+), 138 deletions(-)
> >>>>
> >>>> diff --git a/string/strchrnul.c b/string/strchrnul.c
> >>>> index 0cc1fc6bb0..67defa3dab 100644
> >>>> --- a/string/strchrnul.c
> >>>> +++ b/string/strchrnul.c
> >>>> @@ -1,10 +1,5 @@
> >>>>   /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
> >>>>      This file is part of the GNU C Library.
> >>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
> >>>> -   with help from Dan Sahlin (dan@sics.se) and
> >>>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
> >>>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
> >>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
> >>>>
> >>>>      The GNU C Library is free software; you can redistribute it and/or
> >>>>      modify it under the terms of the GNU Lesser General Public
> >>>> @@ -21,146 +16,43 @@
> >>>>      <https://www.gnu.org/licenses/>.  */
> >>>>
> >>>>   #include <string.h>
> >>>> -#include <memcopy.h>
> >>>>   #include <stdlib.h>
> >>>> +#include <stdint.h>
> >>>> +#include <string-fza.h>
> >>>> +#include <string-fzb.h>
> >>>> +#include <string-fzi.h>
> >>>> +#include <string-maskoff.h>
> >>>>
> >>>>   #undef __strchrnul
> >>>>   #undef strchrnul
> >>>>
> >>>> -#ifndef STRCHRNUL
> >>>> -# define STRCHRNUL __strchrnul
> >>>> +#ifdef STRCHRNUL
> >>>> +# define __strchrnul STRCHRNUL
> >>>>   #endif
> >>>>
> >>>>   /* Find the first occurrence of C in S or the final NUL byte.  */
> >>>>   char *
> >>>> -STRCHRNUL (const char *s, int c_in)
> >>>> +__strchrnul (const char *str, int c_in)
> >>>>   {
> >>>> -  const unsigned char *char_ptr;
> >>>> -  const unsigned long int *longword_ptr;
> >>>> -  unsigned long int longword, magic_bits, charmask;
> >>>> -  unsigned char c;
> >>>> -
> >>>> -  c = (unsigned char) c_in;
> >>>> -
> >>>> -  /* Handle the first few characters by reading one character at a time.
> >>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
> >>>> -  for (char_ptr = (const unsigned char *) s;
> >>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
> >>>> -       ++char_ptr)
> >>>> -    if (*char_ptr == c || *char_ptr == '\0')
> >>>> -      return (void *) char_ptr;
> >>>> -
> >>>> -  /* All these elucidatory comments refer to 4-byte longwords,
> >>>> -     but the theory applies equally well to 8-byte longwords.  */
> >>>> -
> >>>> -  longword_ptr = (unsigned long int *) char_ptr;
> >>>> -
> >>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
> >>>> -     the "holes."  Note that there is a hole just to the left of
> >>>> -     each byte, with an extra at the end:
> >>>> -
> >>>> -     bits:  01111110 11111110 11111110 11111111
> >>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
> >>>> -
> >>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
> >>>> -     The 0-bits provide holes for carries to fall into.  */
> >>>> -  magic_bits = -1;
> >>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
> >>>> -
> >>>> -  /* Set up a longword, each of whose bytes is C.  */
> >>>> -  charmask = c | (c << 8);
> >>>> -  charmask |= charmask << 16;
> >>>> -  if (sizeof (longword) > 4)
> >>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
> >>>> -    charmask |= (charmask << 16) << 16;
> >>>> -  if (sizeof (longword) > 8)
> >>>> -    abort ();
> >>>> -
> >>>> -  /* Instead of the traditional loop which tests each character,
> >>>> -     we will test a longword at a time.  The tricky part is testing
> >>>> -     if *any of the four* bytes in the longword in question are zero.  */
> >>>> -  for (;;)
> >>>> -    {
> >>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
> >>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
> >>>> -
> >>>> -        1) Is this safe?  Will it catch all the zero bytes?
> >>>> -        Suppose there is a byte with all zeros.  Any carry bits
> >>>> -        propagating from its left will fall into the hole at its
> >>>> -        least significant bit and stop.  Since there will be no
> >>>> -        carry from its most significant bit, the LSB of the
> >>>> -        byte to the left will be unchanged, and the zero will be
> >>>> -        detected.
> >>>> +  /* Set up a word, each of whose bytes is C.  */
> >>>> +  op_t repeated_c = repeat_bytes (c_in);
> >>>>
> >>>> -        2) Is this worthwhile?  Will it ignore everything except
> >>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
> >>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
> >>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
> >>>> -        one of bits 9-15 must be set, so there will be a carry
> >>>> -        into bit 16.  Similarly, there will be a carry into bit
> >>>> -        24.  If one of bits 24-30 is set, there will be a carry
> >>>> -        into bit 31, so all of the hole bits will be changed.
> >>>> +  /* Align the input address to op_t.  */
> >>>> +  uintptr_t s_int = (uintptr_t) str;
> >>>> +  const op_t *word_ptr = word_containing (str);
> >>>>
> >>>> -        The one misfire occurs when bits 24-30 are clear and bit
> >>>> -        31 is set; in this case, the hole at bit 31 is not
> >>>> -        changed.  If we had access to the processor carry flag,
> >>>> -        we could close this loophole by putting the fourth hole
> >>>> -        at bit 32!
> >>>> +  /* Read the first aligned word, but force bytes before the string to
> >>>> +     match neither zero nor goal (we make sure the high bit of each byte
> >>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
> >>>> +  op_t bmask = create_mask (s_int);
> >>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
> >>>
> >>> Think much clearer (and probably better codegen) is:
> >>> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
> >>
> >> It does not seem to work, at least not replacing the two lines with:
> >>
> >>    op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
> >
> > Oh, two fine points:
> >
> > (1) big-endian would want shifting left,
> > (2) alpha would want shifting by bits not bytes,
> >     because the cmpbge insn produces an 8-bit mask.
> >
> > so you'd need to hide this shift in the headers like create_mask().
>
> Alright, the following works:
>
>
> static __always_inline op_t
> check_mask (op_t word, uintptr_t s_int)
> {
>   if (__BYTE_ORDER == __LITTLE_ENDIAN)
>     return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
>   else
>     return word << (CHAR_BIT * (s_int % sizeof (s_int)));
> }

Imo put this in with "[PATCH v5 03/17] Add string-maskoff.h generic header"
think may also be needed for memchr.
>
> char *
> __strchrnul (const char *str, int c_in)
> {
>   op_t repeated_c = repeat_bytes (c_in);
>
>   uintptr_t s_int = (uintptr_t) str;
>   const op_t *word_ptr = word_containing (str);
>
>   op_t word = *word_ptr;
>
>   op_t mask = check_mask (find_zero_eq_all (word, repeated_c), s_int);
>   if (mask != 0)
>     return (char *) str + index_first_(mask);
>
>   do
>     word = *++word_ptr;
>   while (! has_zero_eq (word, repeated_c));
>
>   op_t found = index_first_zero_eq (word, repeated_c);
>   return (char *) word_ptr + found;
> }
>
>
> I had to use find_zero_eq_all to avoid uninitialized bytes, that triggered
> some regression on tests that use strchr (for instance test-strpbrk).
>
> I will update the patch based on this version.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-10 17:17           ` Noah Goldstein
@ 2023-01-10 18:16             ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-10 18:16 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Richard Henderson, libc-alpha



On 10/01/23 14:17, Noah Goldstein wrote:
> On Tue, Jan 10, 2023 at 6:18 AM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 09/01/23 20:33, Richard Henderson wrote:
>>> On 1/9/23 12:35, Adhemerval Zanella Netto wrote:
>>>>
>>>>
>>>> On 05/01/23 20:17, Noah Goldstein wrote:
>>>>> On Mon, Sep 19, 2022 at 1:04 PM Adhemerval Zanella via Libc-alpha
>>>>> <libc-alpha@sourceware.org> wrote:
>>>>>>
>>>>>> New algorithm have the following key differences:
>>>>>>
>>>>>>    - Reads first word unaligned and use string-maskoff function to
>>>>>>      remove unwanted data.  This strategy follow arch-specific
>>>>>>      optimization used on aarch64 and powerpc.
>>>>>>
>>>>>>    - Use string-fz{b,i} functions.
>>>>>>
>>>>>> Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc64-linux-gnu,
>>>>>> and powerpc-linux-gnu by removing the arch-specific assembly
>>>>>> implementation and disabling multi-arch (it covers both LE and BE
>>>>>> for 64 and 32 bits).
>>>>>>
>>>>>> Co-authored-by: Richard Henderson  <rth@twiddle.net>
>>>>>> ---
>>>>>>   string/strchrnul.c                            | 156 +++---------------
>>>>>>   .../power4/multiarch/strchrnul-ppc32.c        |   4 -
>>>>>>   sysdeps/s390/strchrnul-c.c                    |   2 -
>>>>>>   3 files changed, 24 insertions(+), 138 deletions(-)
>>>>>>
>>>>>> diff --git a/string/strchrnul.c b/string/strchrnul.c
>>>>>> index 0cc1fc6bb0..67defa3dab 100644
>>>>>> --- a/string/strchrnul.c
>>>>>> +++ b/string/strchrnul.c
>>>>>> @@ -1,10 +1,5 @@
>>>>>>   /* Copyright (C) 1991-2022 Free Software Foundation, Inc.
>>>>>>      This file is part of the GNU C Library.
>>>>>> -   Based on strlen implementation by Torbjorn Granlund (tege@sics.se),
>>>>>> -   with help from Dan Sahlin (dan@sics.se) and
>>>>>> -   bug fix and commentary by Jim Blandy (jimb@ai.mit.edu);
>>>>>> -   adaptation to strchr suggested by Dick Karpinski (dick@cca.ucsf.edu),
>>>>>> -   and implemented by Roland McGrath (roland@ai.mit.edu).
>>>>>>
>>>>>>      The GNU C Library is free software; you can redistribute it and/or
>>>>>>      modify it under the terms of the GNU Lesser General Public
>>>>>> @@ -21,146 +16,43 @@
>>>>>>      <https://www.gnu.org/licenses/>.  */
>>>>>>
>>>>>>   #include <string.h>
>>>>>> -#include <memcopy.h>
>>>>>>   #include <stdlib.h>
>>>>>> +#include <stdint.h>
>>>>>> +#include <string-fza.h>
>>>>>> +#include <string-fzb.h>
>>>>>> +#include <string-fzi.h>
>>>>>> +#include <string-maskoff.h>
>>>>>>
>>>>>>   #undef __strchrnul
>>>>>>   #undef strchrnul
>>>>>>
>>>>>> -#ifndef STRCHRNUL
>>>>>> -# define STRCHRNUL __strchrnul
>>>>>> +#ifdef STRCHRNUL
>>>>>> +# define __strchrnul STRCHRNUL
>>>>>>   #endif
>>>>>>
>>>>>>   /* Find the first occurrence of C in S or the final NUL byte.  */
>>>>>>   char *
>>>>>> -STRCHRNUL (const char *s, int c_in)
>>>>>> +__strchrnul (const char *str, int c_in)
>>>>>>   {
>>>>>> -  const unsigned char *char_ptr;
>>>>>> -  const unsigned long int *longword_ptr;
>>>>>> -  unsigned long int longword, magic_bits, charmask;
>>>>>> -  unsigned char c;
>>>>>> -
>>>>>> -  c = (unsigned char) c_in;
>>>>>> -
>>>>>> -  /* Handle the first few characters by reading one character at a time.
>>>>>> -     Do this until CHAR_PTR is aligned on a longword boundary.  */
>>>>>> -  for (char_ptr = (const unsigned char *) s;
>>>>>> -       ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0;
>>>>>> -       ++char_ptr)
>>>>>> -    if (*char_ptr == c || *char_ptr == '\0')
>>>>>> -      return (void *) char_ptr;
>>>>>> -
>>>>>> -  /* All these elucidatory comments refer to 4-byte longwords,
>>>>>> -     but the theory applies equally well to 8-byte longwords.  */
>>>>>> -
>>>>>> -  longword_ptr = (unsigned long int *) char_ptr;
>>>>>> -
>>>>>> -  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
>>>>>> -     the "holes."  Note that there is a hole just to the left of
>>>>>> -     each byte, with an extra at the end:
>>>>>> -
>>>>>> -     bits:  01111110 11111110 11111110 11111111
>>>>>> -     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD
>>>>>> -
>>>>>> -     The 1-bits make sure that carries propagate to the next 0-bit.
>>>>>> -     The 0-bits provide holes for carries to fall into.  */
>>>>>> -  magic_bits = -1;
>>>>>> -  magic_bits = magic_bits / 0xff * 0xfe << 1 >> 1 | 1;
>>>>>> -
>>>>>> -  /* Set up a longword, each of whose bytes is C.  */
>>>>>> -  charmask = c | (c << 8);
>>>>>> -  charmask |= charmask << 16;
>>>>>> -  if (sizeof (longword) > 4)
>>>>>> -    /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
>>>>>> -    charmask |= (charmask << 16) << 16;
>>>>>> -  if (sizeof (longword) > 8)
>>>>>> -    abort ();
>>>>>> -
>>>>>> -  /* Instead of the traditional loop which tests each character,
>>>>>> -     we will test a longword at a time.  The tricky part is testing
>>>>>> -     if *any of the four* bytes in the longword in question are zero.  */
>>>>>> -  for (;;)
>>>>>> -    {
>>>>>> -      /* We tentatively exit the loop if adding MAGIC_BITS to
>>>>>> -        LONGWORD fails to change any of the hole bits of LONGWORD.
>>>>>> -
>>>>>> -        1) Is this safe?  Will it catch all the zero bytes?
>>>>>> -        Suppose there is a byte with all zeros.  Any carry bits
>>>>>> -        propagating from its left will fall into the hole at its
>>>>>> -        least significant bit and stop.  Since there will be no
>>>>>> -        carry from its most significant bit, the LSB of the
>>>>>> -        byte to the left will be unchanged, and the zero will be
>>>>>> -        detected.
>>>>>> +  /* Set up a word, each of whose bytes is C.  */
>>>>>> +  op_t repeated_c = repeat_bytes (c_in);
>>>>>>
>>>>>> -        2) Is this worthwhile?  Will it ignore everything except
>>>>>> -        zero bytes?  Suppose every byte of LONGWORD has a bit set
>>>>>> -        somewhere.  There will be a carry into bit 8.  If bit 8
>>>>>> -        is set, this will carry into bit 16.  If bit 8 is clear,
>>>>>> -        one of bits 9-15 must be set, so there will be a carry
>>>>>> -        into bit 16.  Similarly, there will be a carry into bit
>>>>>> -        24.  If one of bits 24-30 is set, there will be a carry
>>>>>> -        into bit 31, so all of the hole bits will be changed.
>>>>>> +  /* Align the input address to op_t.  */
>>>>>> +  uintptr_t s_int = (uintptr_t) str;
>>>>>> +  const op_t *word_ptr = word_containing (str);
>>>>>>
>>>>>> -        The one misfire occurs when bits 24-30 are clear and bit
>>>>>> -        31 is set; in this case, the hole at bit 31 is not
>>>>>> -        changed.  If we had access to the processor carry flag,
>>>>>> -        we could close this loophole by putting the fourth hole
>>>>>> -        at bit 32!
>>>>>> +  /* Read the first aligned word, but force bytes before the string to
>>>>>> +     match neither zero nor goal (we make sure the high bit of each byte
>>>>>> +     is 1, and the low 7 bits are all the opposite of the goal byte).  */
>>>>>> +  op_t bmask = create_mask (s_int);
>>>>>> +  op_t word = (*word_ptr | bmask) ^ (repeated_c & highbit_mask (bmask));
>>>>>
>>>>> Think much clearer (and probably better codegen) is:
>>>>> find_zero_eq_low/all(word, repeated) >> (s_int * CHAR_BIT)
>>>>
>>>> It does not seem to work, at least not replacing the two lines with:
>>>>
>>>>    op_t word = find_zero_eq_all/low (*word_ptr, repeated_c) >> (s_int * CHAR_BIT);
>>>
>>> Oh, two fine points:
>>>
>>> (1) big-endian would want shifting left,
>>> (2) alpha would want shifting by bits not bytes,
>>>     because the cmpbge insn produces an 8-bit mask.
>>>
>>> so you'd need to hide this shift in the headers like create_mask().
>>
>> Alright, the following works:
>>
>>
>> static __always_inline op_t
>> check_mask (op_t word, uintptr_t s_int)
>> {
>>   if (__BYTE_ORDER == __LITTLE_ENDIAN)
>>     return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
>>   else
>>     return word << (CHAR_BIT * (s_int % sizeof (s_int)));
>> }
> 
> Imo put this in with "[PATCH v5 03/17] Add string-maskoff.h generic header"
> think may also be needed for memchr.

Yeap, this is what I have done.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 08/17] string: Improve generic strchrnul
  2023-01-10 17:16             ` Noah Goldstein
@ 2023-01-10 18:19               ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 55+ messages in thread
From: Adhemerval Zanella Netto @ 2023-01-10 18:19 UTC (permalink / raw)
  To: Noah Goldstein, Richard Henderson; +Cc: Richard Henderson, libc-alpha



On 10/01/23 14:16, Noah Goldstein wrote:
> On Tue, Jan 10, 2023 at 8:24 AM Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 1/10/23 06:18, Adhemerval Zanella Netto via Libc-alpha wrote:
>>> static __always_inline op_t
>>> check_mask (op_t word, uintptr_t s_int)
>>> {
>>>    if (__BYTE_ORDER == __LITTLE_ENDIAN)
>>>      return word >> (CHAR_BIT * (s_int % sizeof (s_int)));
>>>    else
>>>      return word << (CHAR_BIT * (s_int % sizeof (s_int)));
>>> }
>>
>> sizeof(op_t), which is usually the same size, but doesn't have to be.
>>
> Are we aligning by sizeof(op_t) or sizeof(void *)?
> The former then `word_containing` will also need to be changed.
> 
> If the latter then sizeof(s_int) is correct.

The read/write operation are being done by op_t, so it makes sense to use
op_t on the sizeof.  It would really matter on ABIs with op_t different
than uintptr_t (mips64n32 and x32).

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-01-10 18:19 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-19 19:59 [PATCH v5 00/17] Improve generic string routines Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 01/17] Parameterize op_t from memcopy.h Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 02/17] Parameterize OP_T_THRES " Adhemerval Zanella
2022-09-20 10:49   ` Carlos O'Donell
2022-09-19 19:59 ` [PATCH v5 03/17] Add string-maskoff.h generic header Adhemerval Zanella
2022-09-20 11:43   ` Carlos O'Donell
2022-09-22 17:31     ` Adhemerval Zanella Netto
2023-01-05 22:49   ` Noah Goldstein
2023-01-05 23:26     ` Alejandro Colomar
2023-01-09 18:19       ` Adhemerval Zanella Netto
2023-01-09 18:02     ` Adhemerval Zanella Netto
2022-09-19 19:59 ` [PATCH v5 04/17] Add string vectorized find and detection functions Adhemerval Zanella
2023-01-05 22:53   ` Noah Goldstein
2023-01-09 18:51     ` Adhemerval Zanella Netto
2023-01-05 23:04   ` Noah Goldstein
2023-01-09 19:34     ` Adhemerval Zanella Netto
2022-09-19 19:59 ` [PATCH v5 05/17] string: Improve generic strlen Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 06/17] string: Improve generic strnlen Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 07/17] string: Improve generic strchr Adhemerval Zanella
2023-01-05 23:09   ` Noah Goldstein
2023-01-05 23:19     ` Noah Goldstein
2023-01-09 19:39       ` Adhemerval Zanella Netto
2022-09-19 19:59 ` [PATCH v5 08/17] string: Improve generic strchrnul Adhemerval Zanella
2023-01-05 23:17   ` Noah Goldstein
2023-01-09 20:35     ` Adhemerval Zanella Netto
2023-01-09 20:49       ` Richard Henderson
2023-01-09 20:59       ` Noah Goldstein
2023-01-09 21:01         ` Noah Goldstein
2023-01-09 23:33       ` Richard Henderson
2023-01-10 14:18         ` Adhemerval Zanella Netto
2023-01-10 16:24           ` Richard Henderson
2023-01-10 17:16             ` Noah Goldstein
2023-01-10 18:19               ` Adhemerval Zanella Netto
2023-01-10 17:17           ` Noah Goldstein
2023-01-10 18:16             ` Adhemerval Zanella Netto
2022-09-19 19:59 ` [PATCH v5 09/17] string: Improve generic strcmp Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 10/17] string: Improve generic memchr Adhemerval Zanella
2023-01-05 23:47   ` Noah Goldstein
2023-01-09 20:50     ` Adhemerval Zanella Netto
2023-01-05 23:49   ` Noah Goldstein
2023-01-09 20:51     ` Adhemerval Zanella Netto
2023-01-09 21:26       ` Noah Goldstein
2023-01-10 14:33         ` Adhemerval Zanella Netto
2022-09-19 19:59 ` [PATCH v5 11/17] string: Improve generic memrchr Adhemerval Zanella
2023-01-05 23:51   ` Noah Goldstein
2022-09-19 19:59 ` [PATCH v5 12/17] hppa: Add memcopy.h Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 13/17] hppa: Add string-fzb.h and string-fzi.h Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 14/17] alpha: " Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 15/17] arm: Add string-fza.h Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 16/17] powerpc: " Adhemerval Zanella
2022-09-19 19:59 ` [PATCH v5 17/17] sh: Add string-fzb.h Adhemerval Zanella
2022-12-05 17:07 ` [PATCH v5 00/17] Improve generic string routines Xi Ruoyao
2023-01-05 21:56   ` Adhemerval Zanella Netto
2023-01-05 23:52     ` Noah Goldstein
2023-01-06 13:43       ` Adhemerval Zanella Netto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).