public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
@ 2024-02-18  8:26 James Tirta Halim
  2024-02-19  0:07 ` Noah Goldstein
  2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
  0 siblings, 2 replies; 25+ messages in thread
From: James Tirta Halim @ 2024-02-18  8:26 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, James Tirta Halim

Find the rarest byte in NE. Find the parts of HS that matches the rare byte
and the byte after it. If found, shift back to the start of NE in HS and
vector compare the first VEC_SIZE with NE. If matches, compare the rest
with MEMCMPEQ.

Timings (Core i3-1115G4):
basic_memmem twoway_memmem __memmem_avx512 __memmem_avx2
__memmem_generic
Total:
6.80124e+06 1.06087e+06 219483 345385 768041
Average:
25958.9 4049.11 837.721 1318.26 2931.45

Passes make check.

Changes in v1:
1. Add memmem-avx2.c

Changes in v2:
1. Add avx512 support with a generic header file
2. Use __memcmpeq instead of memcmp
3. Remove scalar loop
4. Fix unsafe unaligned load

Changes in v3:
1. Avoid checking for alignment to the start of the page since that will be rare
2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
reference errors)
3. Add memmem.c (needs review)
4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
review)
5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)

Changes in v4:
1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
2. Correct the Makefile to use the appropriate flags
3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
4. Remove unused vector macros (POPCNT and LZCNT)

Changes in v5:
1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
3. Add comments
4. Limit needle length to VEC_SIZE when finding the rare byte

Changes in v6:
1. Fix patch apply error in memmem.c
2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
of needle
3. Always do unaligned load at the tail code
4. Rename rarebyte_table to ___rarebyte_table
5. Add memmem-avx-base.c in which ___rarebyte_table is defined
6. Add memmem-avx-base to the Makefile
7. Add always_inline to find_rarest_byte
8. Change ((m << off) >> off) to (m & (ONES >> off))
9. Change void * to unsigned char * in find_rarest_byte

---
 string/memmem.c                            |   7 +-
 sysdeps/x86_64/multiarch/Makefile          |   6 +
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |  12 ++
 sysdeps/x86_64/multiarch/memmem-avx-base.c |  20 +++
 sysdeps/x86_64/multiarch/memmem-avx-base.h | 183 +++++++++++++++++++++
 sysdeps/x86_64/multiarch/memmem-avx2.c     |   3 +
 sysdeps/x86_64/multiarch/memmem-avx512.c   |  12 ++
 sysdeps/x86_64/multiarch/memmem.c          |  67 ++++++++
 8 files changed, 309 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem.c

diff --git a/string/memmem.c b/string/memmem.c
index a4117f8e1e..a315c7d0b5 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -25,6 +25,10 @@
 # define __memmem	memmem
 #endif
 
+#ifndef MEMMEM
+# define MEMMEM __memmem
+#endif
+
 #define RETURN_TYPE void *
 #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
 #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
@@ -50,7 +54,7 @@
    The limit also implies worst-case performance is linear.
    Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
 void *
-__memmem (const void *haystack, size_t hs_len,
+MEMMEM (const void *haystack, size_t hs_len,
 	  const void *needle, size_t ne_len)
 {
   const unsigned char *hs = (const unsigned char *) haystack;
@@ -127,3 +131,4 @@ __memmem (const void *haystack, size_t hs_len,
 libc_hidden_def (__memmem)
 weak_alias (__memmem, memmem)
 libc_hidden_weak (memmem)
+libc_hidden_builtin_def (MEMMEM)
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index d3d2270394..0b46d5f341 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -15,6 +15,9 @@ sysdep_routines += \
   memcmpeq-avx2-rtm \
   memcmpeq-evex \
   memcmpeq-sse2 \
+  memmem-avx-base \
+  memmem-avx2 \
+  memmem-avx512 \
   memmove-avx-unaligned-erms \
   memmove-avx-unaligned-erms-rtm \
   memmove-avx512-no-vzeroupper \
@@ -122,6 +125,9 @@ sysdep_routines += \
   varshift \
 # sysdep_routines
 
+CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
+CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
+
 CFLAGS-strcspn-sse4.c += -msse4
 CFLAGS-strpbrk-sse4.c += -msse4
 CFLAGS-strspn-sse4.c += -msse4
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c4a21d4b7c..5fe1440235 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -798,6 +798,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
                               __strstr_avx512)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
+  
+    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
+  IFUNC_IMPL (i, name, memmem,
+              IFUNC_IMPL_ADD (array, i, memmem,
+                              (CPU_FEATURE_USABLE (AVX512BW)
+                               && CPU_FEATURE_USABLE (BMI1)),
+                              __memmem_avx512)
+              IFUNC_IMPL_ADD (array, i, memmem,
+		              (CPU_FEATURE_USABLE (AVX2)
+			      && CPU_FEATURE_USABLE (BMI1)),
+			      __memmem_avx2)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic))
 
   /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
   IFUNC_IMPL (i, name, wcschr,
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
new file mode 100644
index 0000000000..212d75c96f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
@@ -0,0 +1,20 @@
+const unsigned char ___rarebyte_table[256] attribute_hidden
+    = { 0,   1,	  13,  56,  59,	 60,  61,  62,	63,  232, 248, 2,   158, 4,
+	5,   6,	  7,   8,   9,	 10,  14,  20,	26,  29,  37,  46,  52,	 53,
+	54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
+	212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
+	222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
+	196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
+	214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
+	233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
+	228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
+	166, 3,	  140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,	 87,
+	132, 96,  112, 97,  103, 82,  139, 89,	98,  88,  119, 74,  156, 115,
+	104, 75,  120, 106, 76,	 155, 90,  122, 107, 125, 152, 145, 136, 137,
+	101, 116, 102, 108, 99,	 141, 77,  78,	118, 79,  109, 100, 150, 73,
+	94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,	 149,
+	146, 111, 65,  69,  66,	 15,  16,  17,	18,  19,  130, 92,  144, 123,
+	21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,	 67,
+	153, 84,  143, 138, 147, 157, 148, 68,	71,  30,  31,  32,  33,	 34,
+	35,  36,  154, 38,  39,	 40,  41,  42,	80,  43,  44,  45,  47,	 48,
+	85,  49,  50,  51 };
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
new file mode 100644
index 0000000000..1333eac5b5
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
@@ -0,0 +1,183 @@
+#include <immintrin.h>
+#include <inttypes.h>
+#include <string.h>
+#include <libc-pointer-arith.h>
+
+#ifndef FUNC_NAME
+#  define __memmem_avx2
+#endif
+#ifndef VEC
+#  define VEC __m256i
+#endif
+#ifndef MASK
+#  define MASK uint32_t
+#endif
+#ifndef LOAD
+#  define LOAD(x) _mm256_load_si256 (x)
+#endif
+#ifndef LOADU
+#  define LOADU(x) _mm256_loadu_si256 (x)
+#endif
+#ifndef CMPEQ8_MASK
+#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
+#endif
+#ifndef SETONE8
+#  define SETONE8(x) _mm256_set1_epi8 (x)
+#endif
+#ifndef TZCNT
+#  define TZCNT(x) _tzcnt_u32 (x)
+#endif
+#ifndef BLSR
+#  define BLSR(x) _blsr_u32 (x)
+#endif
+#define VEC_SIZE sizeof (VEC)
+#define ONES ((MASK) -1)
+
+#ifndef MEMCMPEQ
+#  define MEMCMPEQ __memcmpeq
+#endif
+#ifndef MEMCPY
+#  define MEMCPY memcpy
+#endif
+#ifndef MEMCHR
+#  define MEMCHR memchr
+#endif
+#ifndef PAGE_SIZE
+#  define PAGE_SIZE 4096
+#endif
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+
+/* Lower is rarer. The table is based on the
+ *.c and *.h files in glibc. */
+extern const unsigned char ___rarebyte_table[256] attribute_hidden;
+
+static inline void *__attribute__ ((always_inline))
+find_rarest_byte (const unsigned char *rare, size_t n)
+{
+  const unsigned char *p = (const unsigned char *) rare;
+  int c_rare = ___rarebyte_table[*rare];
+  int c;
+  for (; n--; ++p)
+    {
+      c = ___rarebyte_table[*p];
+      if (c < c_rare)
+	{
+	  rare = p;
+	  c_rare = c;
+	}
+    }
+  return (void *) rare;
+}
+
+void *
+FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
+{
+  if (ne_len == 1)
+    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
+  if (__glibc_unlikely (ne_len == 0))
+    return (void *) hs;
+  if (__glibc_unlikely (hs_len < ne_len))
+    return NULL;
+  VEC hv0, hv1, hv, nv;
+  MASK i, hm0, hm1, m, cmpm;
+  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
+  const MASK matchm = ONES << matchsh;
+  const unsigned char *h = (const unsigned char *) hs;
+  const unsigned char *const end = h + hs_len - ne_len;
+  const unsigned char *hp;
+  size_t rare = PTR_DIFF (find_rarest_byte ((const unsigned char *)ne, MIN (ne_len, VEC_SIZE)), ne);
+  /* RARE will always be the first byte to find.
+     If RARE is at the end of the needle, use the byte before it. */
+  if (rare == MIN (ne_len, VEC_SIZE) - 1)
+    --rare;
+  const VEC nv0 = SETONE8 (*((char *) ne + rare));
+  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
+  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
+			   ? VEC_SIZE - (unsigned int) (end - h) - 1
+			   : 0;
+  /* Start from the position of RARE. */
+  h += rare;
+  /* Load the needle vector. */
+  if (((uintptr_t) ne & (PAGE_SIZE - 1)) > (PAGE_SIZE - VEC_SIZE)
+      || ne_len >= VEC_SIZE)
+    nv = LOADU ((VEC *) ne);
+  else
+    MEMCPY (&nv, ne, MIN (VEC_SIZE, ne_len));
+  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
+  /* Align down to VEC_SIZE. */
+  h -= off_s;
+  hv0 = LOAD ((const VEC *) h);
+  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
+  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
+   * of bounds (OFF_E). */
+  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
+  while (m)
+    {
+      i = TZCNT (m);
+      m = BLSR (m);
+      hp = h + off_s + i - rare;
+      if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
+	{
+	  /* Do a vector compare if we are not crossing a page. */
+	  hv = LOADU ((VEC *) hp);
+	  cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
+	  /* Compare only the relevant bits of the needle vector. */
+	  if (cmpm == matchm)
+	    /* Compare the rest of the needle. */
+	    if (ne_len <= VEC_SIZE
+		|| !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
+			      ne_len - VEC_SIZE))
+	      return (void *) hp;
+	}
+      else
+	{
+	  if (!MEMCMPEQ (hp, ne, ne_len))
+	    return (void *) hp;
+	}
+    }
+  h += VEC_SIZE - 1;
+  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
+    {
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      m = hm0 & hm1;
+      while (m)
+	{
+	match:
+	  i = TZCNT (m);
+	  m = BLSR (m);
+	  hp = h + i - rare;
+	  if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
+	    {
+	      hv = LOADU ((VEC *) hp);
+	      cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
+	      if (cmpm == matchm)
+		if (ne_len <= VEC_SIZE
+		    || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
+				  ne_len - VEC_SIZE))
+		  return (void *) hp;
+	    }
+	  else
+	    {
+	      if (!MEMCMPEQ (hp, ne, ne_len))
+		return (void *) hp;
+	    }
+	}
+    }
+  if (h - rare <= end)
+    {
+      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      /* Clear the irrelevant bits that are out of bounds. */
+      m = hm0 & hm1 & (ONES >> off_e);
+      if (m)
+	goto match;
+    }
+  return NULL;
+}
diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
new file mode 100644
index 0000000000..91f5d5d331
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
@@ -0,0 +1,3 @@
+#define FUNC_NAME __memmem_avx2
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
new file mode 100644
index 0000000000..76016c1cfe
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
@@ -0,0 +1,12 @@
+#define VEC __m512i
+#define MASK uint64_t
+#define LOAD(x) _mm512_load_si512 (x)
+#define LOADU(x) _mm512_loadu_si512 (x)
+#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
+#define SETONE8(x) _mm512_set1_epi8 (x)
+#define TZCNT(x) _tzcnt_u64 (x)
+#define BLSR(x) _blsr_u64 (x)
+
+#define FUNC_NAME __memmem_avx512
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
new file mode 100644
index 0000000000..8fe7b77d33
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem.c
@@ -0,0 +1,67 @@
+/* Multiple versions of memmem.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Redefine memmem so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+#undef  memmem
+#define memmem __redirect_memmem
+#include <string.h>
+#undef  memmem
+
+#define MEMMEM __memmem_generic
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
+#endif
+
+#include "string/memmem.c"
+
+extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
+
+#define SYMBOL_NAME memmem
+
+#include "init-arch.h"
+
+/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
+   ifunc symbol properly.  */
+extern __typeof (__redirect_memmem) __libc_memmem;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
+      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx512;
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx2;
+
+  return __memmem_generic;
+}
+
+libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
+#undef memmem
+strong_alias (__libc_memmem, __memmem)
-- 
2.43.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-18  8:26 [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c James Tirta Halim
@ 2024-02-19  0:07 ` Noah Goldstein
  2024-02-19  8:13   ` Alexander Monakov
  2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
  1 sibling, 1 reply; 25+ messages in thread
From: Noah Goldstein @ 2024-02-19  0:07 UTC (permalink / raw)
  To: James Tirta Halim; +Cc: libc-alpha

On Sun, Feb 18, 2024 at 8:26 AM James Tirta Halim
<tirtajames45@gmail.com> wrote:
>
> Find the rarest byte in NE. Find the parts of HS that matches the rare byte
> and the byte after it. If found, shift back to the start of NE in HS and
> vector compare the first VEC_SIZE with NE. If matches, compare the rest
> with MEMCMPEQ.
>
> Timings (Core i3-1115G4):
> basic_memmem twoway_memmem __memmem_avx512 __memmem_avx2
> __memmem_generic
> Total:
> 6.80124e+06 1.06087e+06 219483 345385 768041
> Average:
> 25958.9 4049.11 837.721 1318.26 2931.45
>
> Passes make check.
>
> Changes in v1:
> 1. Add memmem-avx2.c
>
> Changes in v2:
> 1. Add avx512 support with a generic header file
> 2. Use __memcmpeq instead of memcmp
> 3. Remove scalar loop
> 4. Fix unsafe unaligned load
>
> Changes in v3:
> 1. Avoid checking for alignment to the start of the page since that will be rare
> 2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
> reference errors)
> 3. Add memmem.c (needs review)
> 4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
> review)
> 5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)
>
> Changes in v4:
> 1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
> use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
> 2. Correct the Makefile to use the appropriate flags
> 3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
> 4. Remove unused vector macros (POPCNT and LZCNT)
>
> Changes in v5:
> 1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
> 2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
> 3. Add comments
> 4. Limit needle length to VEC_SIZE when finding the rare byte
>
> Changes in v6:
> 1. Fix patch apply error in memmem.c
> 2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
> of needle
> 3. Always do unaligned load at the tail code
> 4. Rename rarebyte_table to ___rarebyte_table
> 5. Add memmem-avx-base.c in which ___rarebyte_table is defined
> 6. Add memmem-avx-base to the Makefile
> 7. Add always_inline to find_rarest_byte
> 8. Change ((m << off) >> off) to (m & (ONES >> off))
> 9. Change void * to unsigned char * in find_rarest_byte
>
> ---
>  string/memmem.c                            |   7 +-
>  sysdeps/x86_64/multiarch/Makefile          |   6 +
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c |  12 ++
>  sysdeps/x86_64/multiarch/memmem-avx-base.c |  20 +++
>  sysdeps/x86_64/multiarch/memmem-avx-base.h | 183 +++++++++++++++++++++
>  sysdeps/x86_64/multiarch/memmem-avx2.c     |   3 +
>  sysdeps/x86_64/multiarch/memmem-avx512.c   |  12 ++
>  sysdeps/x86_64/multiarch/memmem.c          |  67 ++++++++
>  8 files changed, 309 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem.c
>
> diff --git a/string/memmem.c b/string/memmem.c
> index a4117f8e1e..a315c7d0b5 100644
> --- a/string/memmem.c
> +++ b/string/memmem.c
> @@ -25,6 +25,10 @@
>  # define __memmem      memmem
>  #endif
>
> +#ifndef MEMMEM
> +# define MEMMEM __memmem
> +#endif
> +
>  #define RETURN_TYPE void *
>  #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
>  #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
> @@ -50,7 +54,7 @@
>     The limit also implies worst-case performance is linear.
>     Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
>  void *
> -__memmem (const void *haystack, size_t hs_len,
> +MEMMEM (const void *haystack, size_t hs_len,
>           const void *needle, size_t ne_len)
>  {
>    const unsigned char *hs = (const unsigned char *) haystack;
> @@ -127,3 +131,4 @@ __memmem (const void *haystack, size_t hs_len,
>  libc_hidden_def (__memmem)
>  weak_alias (__memmem, memmem)
>  libc_hidden_weak (memmem)
> +libc_hidden_builtin_def (MEMMEM)
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index d3d2270394..0b46d5f341 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -15,6 +15,9 @@ sysdep_routines += \
>    memcmpeq-avx2-rtm \
>    memcmpeq-evex \
>    memcmpeq-sse2 \
> +  memmem-avx-base \
> +  memmem-avx2 \
> +  memmem-avx512 \
>    memmove-avx-unaligned-erms \
>    memmove-avx-unaligned-erms-rtm \
>    memmove-avx512-no-vzeroupper \
> @@ -122,6 +125,9 @@ sysdep_routines += \
>    varshift \
>  # sysdep_routines
>
> +CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
> +CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
> +
>  CFLAGS-strcspn-sse4.c += -msse4
>  CFLAGS-strpbrk-sse4.c += -msse4
>  CFLAGS-strspn-sse4.c += -msse4
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index c4a21d4b7c..5fe1440235 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -798,6 +798,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>                                __strstr_avx512)
>               IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
>               IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
> +
> +    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
> +  IFUNC_IMPL (i, name, memmem,
> +              IFUNC_IMPL_ADD (array, i, memmem,
> +                              (CPU_FEATURE_USABLE (AVX512BW)
> +                               && CPU_FEATURE_USABLE (BMI1)),
> +                              __memmem_avx512)
> +              IFUNC_IMPL_ADD (array, i, memmem,
> +                             (CPU_FEATURE_USABLE (AVX2)
> +                             && CPU_FEATURE_USABLE (BMI1)),
> +                             __memmem_avx2)
> +             IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic))
>
>    /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
>    IFUNC_IMPL (i, name, wcschr,
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> new file mode 100644
> index 0000000000..212d75c96f
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> @@ -0,0 +1,20 @@
> +const unsigned char ___rarebyte_table[256] attribute_hidden
> +    = { 0,   1,          13,  56,  59,  60,  61,  62,  63,  232, 248, 2,   158, 4,
> +       5,   6,   7,   8,   9,   10,  14,  20,  26,  29,  37,  46,  52,  53,
> +       54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
> +       212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
> +       222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
> +       196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
> +       214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
> +       233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
> +       228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
> +       166, 3,   140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,  87,
> +       132, 96,  112, 97,  103, 82,  139, 89,  98,  88,  119, 74,  156, 115,
> +       104, 75,  120, 106, 76,  155, 90,  122, 107, 125, 152, 145, 136, 137,
> +       101, 116, 102, 108, 99,  141, 77,  78,  118, 79,  109, 100, 150, 73,
> +       94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,  149,
> +       146, 111, 65,  69,  66,  15,  16,  17,  18,  19,  130, 92,  144, 123,
> +       21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,  67,
> +       153, 84,  143, 138, 147, 157, 148, 68,  71,  30,  31,  32,  33,  34,
> +       35,  36,  154, 38,  39,  40,  41,  42,  80,  43,  44,  45,  47,  48,
> +       85,  49,  50,  51 };
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> new file mode 100644
> index 0000000000..1333eac5b5
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> @@ -0,0 +1,183 @@
> +#include <immintrin.h>
> +#include <inttypes.h>
> +#include <string.h>
> +#include <libc-pointer-arith.h>
> +
> +#ifndef FUNC_NAME
> +#  define __memmem_avx2
> +#endif
> +#ifndef VEC
> +#  define VEC __m256i
> +#endif
> +#ifndef MASK
> +#  define MASK uint32_t
> +#endif
> +#ifndef LOAD
> +#  define LOAD(x) _mm256_load_si256 (x)
> +#endif
> +#ifndef LOADU
> +#  define LOADU(x) _mm256_loadu_si256 (x)
> +#endif
> +#ifndef CMPEQ8_MASK
> +#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
> +#endif
> +#ifndef SETONE8
> +#  define SETONE8(x) _mm256_set1_epi8 (x)
> +#endif
> +#ifndef TZCNT
> +#  define TZCNT(x) _tzcnt_u32 (x)
> +#endif
> +#ifndef BLSR
> +#  define BLSR(x) _blsr_u32 (x)
> +#endif
> +#define VEC_SIZE sizeof (VEC)
> +#define ONES ((MASK) -1)
> +
> +#ifndef MEMCMPEQ
> +#  define MEMCMPEQ __memcmpeq
> +#endif
> +#ifndef MEMCPY
> +#  define MEMCPY memcpy
> +#endif
> +#ifndef MEMCHR
> +#  define MEMCHR memchr
> +#endif
> +#ifndef PAGE_SIZE
> +#  define PAGE_SIZE 4096
> +#endif
> +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> +
> +/* Lower is rarer. The table is based on the
> + *.c and *.h files in glibc. */
> +extern const unsigned char ___rarebyte_table[256] attribute_hidden;
> +
> +static inline void *__attribute__ ((always_inline))
> +find_rarest_byte (const unsigned char *rare, size_t n)
> +{
> +  const unsigned char *p = (const unsigned char *) rare;
> +  int c_rare = ___rarebyte_table[*rare];
> +  int c;
> +  for (; n--; ++p)
> +    {
> +      c = ___rarebyte_table[*p];
> +      if (c < c_rare)
> +       {
> +         rare = p;
> +         c_rare = c;
> +       }
> +    }
> +  return (void *) rare;
> +}
> +
> +void *
> +FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
> +{
> +  if (ne_len == 1)
> +    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
> +  if (__glibc_unlikely (ne_len == 0))
> +    return (void *) hs;
> +  if (__glibc_unlikely (hs_len < ne_len))
> +    return NULL;
> +  VEC hv0, hv1, hv, nv;
> +  MASK i, hm0, hm1, m, cmpm;
> +  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
> +  const MASK matchm = ONES << matchsh;
> +  const unsigned char *h = (const unsigned char *) hs;
> +  const unsigned char *const end = h + hs_len - ne_len;
> +  const unsigned char *hp;
> +  size_t rare = PTR_DIFF (find_rarest_byte ((const unsigned char *)ne, MIN (ne_len, VEC_SIZE)), ne);
> +  /* RARE will always be the first byte to find.
> +     If RARE is at the end of the needle, use the byte before it. */
> +  if (rare == MIN (ne_len, VEC_SIZE) - 1)
> +    --rare;
> +  const VEC nv0 = SETONE8 (*((char *) ne + rare));
> +  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
> +  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
> +                          ? VEC_SIZE - (unsigned int) (end - h) - 1
> +                          : 0;
> +  /* Start from the position of RARE. */
> +  h += rare;
> +  /* Load the needle vector. */
> +  if (((uintptr_t) ne & (PAGE_SIZE - 1)) > (PAGE_SIZE - VEC_SIZE)
> +      || ne_len >= VEC_SIZE)
> +    nv = LOADU ((VEC *) ne);
> +  else
> +    MEMCPY (&nv, ne, MIN (VEC_SIZE, ne_len));
> +  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
> +  /* Align down to VEC_SIZE. */
> +  h -= off_s;
> +  hv0 = LOAD ((const VEC *) h);
> +  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
> +  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
> +   * of bounds (OFF_E). */
> +  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
> +  while (m)
> +    {
> +      i = TZCNT (m);
> +      m = BLSR (m);
> +      hp = h + off_s + i - rare;
> +      if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> +       {
> +         /* Do a vector compare if we are not crossing a page. */
> +         hv = LOADU ((VEC *) hp);
> +         cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> +         /* Compare only the relevant bits of the needle vector. */
> +         if (cmpm == matchm)
> +           /* Compare the rest of the needle. */
> +           if (ne_len <= VEC_SIZE
> +               || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
> +                             ne_len - VEC_SIZE))
> +             return (void *) hp;
> +       }
> +      else
> +       {
> +         if (!MEMCMPEQ (hp, ne, ne_len))
> +           return (void *) hp;
> +       }
> +    }
> +  h += VEC_SIZE - 1;
> +  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
> +    {
> +      hv0 = LOADU ((const VEC *) h);
> +      hv1 = LOAD ((const VEC *) (h + 1));
> +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +      m = hm0 & hm1;
> +      while (m)
> +       {
> +       match:
> +         i = TZCNT (m);
> +         m = BLSR (m);
> +         hp = h + i - rare;
> +         if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> +           {
> +             hv = LOADU ((VEC *) hp);
> +             cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> +             if (cmpm == matchm)
> +               if (ne_len <= VEC_SIZE
> +                   || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
> +                                 ne_len - VEC_SIZE))
> +                 return (void *) hp;
> +           }
> +         else
> +           {
> +             if (!MEMCMPEQ (hp, ne, ne_len))
> +               return (void *) hp;
> +           }
> +       }
> +    }
> +  if (h - rare <= end)
> +    {
> +      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
> +      hv0 = LOADU ((const VEC *) h);
> +      hv1 = LOAD ((const VEC *) (h + 1));
> +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +      /* Clear the irrelevant bits that are out of bounds. */
> +      m = hm0 & hm1 & (ONES >> off_e);
> +      if (m)
> +       goto match;
> +    }
> +  return NULL;
> +}
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
> new file mode 100644
> index 0000000000..91f5d5d331
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
> @@ -0,0 +1,3 @@
> +#define FUNC_NAME __memmem_avx2
> +
> +#include "memmem-avx-base.h"
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
> new file mode 100644
> index 0000000000..76016c1cfe
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
> @@ -0,0 +1,12 @@
> +#define VEC __m512i
> +#define MASK uint64_t
> +#define LOAD(x) _mm512_load_si512 (x)
> +#define LOADU(x) _mm512_loadu_si512 (x)
> +#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
> +#define SETONE8(x) _mm512_set1_epi8 (x)
> +#define TZCNT(x) _tzcnt_u64 (x)
> +#define BLSR(x) _blsr_u64 (x)
> +
> +#define FUNC_NAME __memmem_avx512
> +
> +#include "memmem-avx-base.h"
> diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
> new file mode 100644
> index 0000000000..8fe7b77d33
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem.c
> @@ -0,0 +1,67 @@
> +/* Multiple versions of memmem.
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2012-2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Redefine memmem so that the compiler won't complain about the type
> +   mismatch with the IFUNC selector in strong_alias, below.  */
> +#undef  memmem
> +#define memmem __redirect_memmem
> +#include <string.h>
> +#undef  memmem
> +
> +#define MEMMEM __memmem_generic
> +#ifdef SHARED
> +# undef libc_hidden_builtin_def
> +# define libc_hidden_builtin_def(name) \
> +  __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
> +#endif
> +
> +#include "string/memmem.c"
> +
> +extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
> +extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
> +extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
> +
> +#define SYMBOL_NAME memmem
> +
> +#include "init-arch.h"
> +
> +/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
> +   ifunc symbol properly.  */
> +extern __typeof (__redirect_memmem) __libc_memmem;
> +
> +static inline void *
> +IFUNC_SELECTOR (void)
> +{
> +  const struct cpu_features *cpu_features = __get_cpu_features ();
> +
> +  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
> +      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> +    return __memmem_avx512;
> +
> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
> +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> +    return __memmem_avx2;
> +
> +  return __memmem_generic;
> +}
> +
> +libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
> +#undef memmem
> +strong_alias (__libc_memmem, __memmem)
> --
> 2.43.2
>

It doesn't seem you have addressed many of the comments from your v5 patch.
Can it helps if you
1: Reply to the comments indicating they are handled / why are choosing not
to handle them.
2: Send further versions to the same email chain. (`--in-reply-to`
with `git send-email`).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-19  0:07 ` Noah Goldstein
@ 2024-02-19  8:13   ` Alexander Monakov
  2024-02-19 14:25     ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Monakov @ 2024-02-19  8:13 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: James Tirta Halim, libc-alpha


On Mon, 19 Feb 2024, Noah Goldstein wrote:

> It doesn't seem you have addressed many of the comments from your v5 patch.
> Can it helps if you
> 1: Reply to the comments indicating they are handled / why are choosing not
> to handle them.
> 2: Send further versions to the same email chain. (`--in-reply-to`
> with `git send-email`).

Are you ok with the change in worst-case time complexity? The existing generic
implementation is O(n+m), the proposed variants are O(n*m).

Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-19  8:13   ` Alexander Monakov
@ 2024-02-19 14:25     ` Adhemerval Zanella Netto
  2024-02-19 17:20       ` Noah Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Adhemerval Zanella Netto @ 2024-02-19 14:25 UTC (permalink / raw)
  To: Alexander Monakov, Noah Goldstein; +Cc: James Tirta Halim, libc-alpha



On 19/02/24 05:13, Alexander Monakov wrote:
> 
> On Mon, 19 Feb 2024, Noah Goldstein wrote:
> 
>> It doesn't seem you have addressed many of the comments from your v5 patch.
>> Can it helps if you
>> 1: Reply to the comments indicating they are handled / why are choosing not
>> to handle them.
>> 2: Send further versions to the same email chain. (`--in-reply-to`
>> with `git send-email`).
> 
> Are you ok with the change in worst-case time complexity? The existing generic
> implementation is O(n+m), the proposed variants are O(n*m).

I think we should consider this a regression, we already have a bug opened for
wcsstr [1] for a similar issue. We already had another similar issue for
PowerPC [2], and we did not have consensus back then because the generic
implementation was also O(m*n) (it was before Wilco new implementation).

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23865
[2] https://sourceware.org/pipermail/libc-alpha/2015-July/062808.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-19 14:25     ` Adhemerval Zanella Netto
@ 2024-02-19 17:20       ` Noah Goldstein
  2024-02-20  3:00         ` James
  0 siblings, 1 reply; 25+ messages in thread
From: Noah Goldstein @ 2024-02-19 17:20 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Alexander Monakov, James Tirta Halim, libc-alpha

On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 19/02/24 05:13, Alexander Monakov wrote:
> >
> > On Mon, 19 Feb 2024, Noah Goldstein wrote:
> >
> >> It doesn't seem you have addressed many of the comments from your v5 patch.
> >> Can it helps if you
> >> 1: Reply to the comments indicating they are handled / why are choosing not
> >> to handle them.
> >> 2: Send further versions to the same email chain. (`--in-reply-to`
> >> with `git send-email`).
> >
> > Are you ok with the change in worst-case time complexity? The existing generic
> > implementation is O(n+m), the proposed variants are O(n*m).
>
> I think we should consider this a regression, we already have a bug opened for
> wcsstr [1] for a similar issue. We already had another similar issue for
> PowerPC [2], and we did not have consensus back then because the generic
> implementation was also O(m*n) (it was before Wilco new implementation).

Think practically this impl would be faster for short needles. Maybe
limit to `m < ~16`, otherwise fallback to generic?
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23865
> [2] https://sourceware.org/pipermail/libc-alpha/2015-July/062808.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-19 17:20       ` Noah Goldstein
@ 2024-02-20  3:00         ` James
  2024-02-20 14:30           ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 25+ messages in thread
From: James @ 2024-02-20  3:00 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Adhemerval Zanella Netto, Alexander Monakov, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 19/02/24 05:13, Alexander Monakov wrote:
> > >
> > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
> > >
> > >> It doesn't seem you have addressed many of the comments from your v5
> patch.
> > >> Can it helps if you
> > >> 1: Reply to the comments indicating they are handled / why are
> choosing not
> > >> to handle them.
> > >> 2: Send further versions to the same email chain. (`--in-reply-to`
> > >> with `git send-email`).
> > >
> > > Are you ok with the change in worst-case time complexity? The existing
> generic
> > > implementation is O(n+m), the proposed variants are O(n*m).
> >
> > I think we should consider this a regression, we already have a bug
> opened for
> > wcsstr [1] for a similar issue. We already had another similar issue for
> > PowerPC [2], and we did not have consensus back then because the generic
> > implementation was also O(m*n) (it was before Wilco new implementation).
>
> Think practically this impl would be faster for short needles. Maybe
> limit to `m < ~16`, otherwise fallback to generic?
>
This implementation is virtually O(n) for m <= VEC_SIZE, so I think it
should be at least m <= VEC_SIZE, and since generic implementation uses
O(n+m) for m > 256, it should be m <= 256, unless we want to directly use
str-two-way.h, which I think would be a waste of code size.

> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23865
> > [2] https://sourceware.org/pipermail/libc-alpha/2015-July/062808.html
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-20  3:00         ` James
@ 2024-02-20 14:30           ` Adhemerval Zanella Netto
  2024-02-20 15:16             ` James
  0 siblings, 1 reply; 25+ messages in thread
From: Adhemerval Zanella Netto @ 2024-02-20 14:30 UTC (permalink / raw)
  To: James, Noah Goldstein; +Cc: Alexander Monakov, libc-alpha



On 20/02/24 00:00, James wrote:
> On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> 
>     On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
>     <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote:
>     >
>     >
>     >
>     > On 19/02/24 05:13, Alexander Monakov wrote:
>     > >
>     > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
>     > >
>     > >> It doesn't seem you have addressed many of the comments from your v5 patch.
>     > >> Can it helps if you
>     > >> 1: Reply to the comments indicating they are handled / why are choosing not
>     > >> to handle them.
>     > >> 2: Send further versions to the same email chain. (`--in-reply-to`
>     > >> with `git send-email`).
>     > >
>     > > Are you ok with the change in worst-case time complexity? The existing generic
>     > > implementation is O(n+m), the proposed variants are O(n*m).
>     >
>     > I think we should consider this a regression, we already have a bug opened for
>     > wcsstr [1] for a similar issue. We already had another similar issue for
>     > PowerPC [2], and we did not have consensus back then because the generic
>     > implementation was also O(m*n) (it was before Wilco new implementation).
> 
>     Think practically this impl would be faster for short needles. Maybe
>     limit to `m < ~16`, otherwise fallback to generic?
> 
> This implementation is virtually O(n) for m <= VEC_SIZE, so I think it should be at least m <= VEC_SIZE, and since generic implementation uses O(n+m) for m > 256, it should be m <= 256, unless we want to directly use str-two-way.h, which I think would be a waste of code size.

Afaik s390x do use a similar strategy, so it should be ok to optimize for
m <= VEC_SIZE.

Also, please check why your patch is making aarch64/arm buildbot fails to
build [1]. You can bootstrap a toolchain using the build-many-glibcs.py
script it required.

[1] https://patchwork.sourceware.org/project/glibc/patch/20240218082621.131128-1-tirtajames45@gmail.com/


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-20 14:30           ` Adhemerval Zanella Netto
@ 2024-02-20 15:16             ` James
  2024-02-20 16:13               ` Noah Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: James @ 2024-02-20 15:16 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Noah Goldstein, Alexander Monakov, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 2512 bytes --]

(Resend because I didn't reply all)

On Tue, Feb 20, 2024 at 9:30 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 20/02/24 00:00, James wrote:
> > On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <goldstein.w.n@gmail.com
> <mailto:goldstein.w.n@gmail.com>> wrote:
> >
> >     On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
> >     <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>>
> wrote:
> >     >
> >     >
> >     >
> >     > On 19/02/24 05:13, Alexander Monakov wrote:
> >     > >
> >     > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
> >     > >
> >     > >> It doesn't seem you have addressed many of the comments from
> your v5 patch.
> >     > >> Can it helps if you
> >     > >> 1: Reply to the comments indicating they are handled / why are
> choosing not
> >     > >> to handle them.
> >     > >> 2: Send further versions to the same email chain.
> (`--in-reply-to`
> >     > >> with `git send-email`).
> >     > >
> >     > > Are you ok with the change in worst-case time complexity? The
> existing generic
> >     > > implementation is O(n+m), the proposed variants are O(n*m).
> >     >
> >     > I think we should consider this a regression, we already have a
> bug opened for
> >     > wcsstr [1] for a similar issue. We already had another similar
> issue for
> >     > PowerPC [2], and we did not have consensus back then because the
> generic
> >     > implementation was also O(m*n) (it was before Wilco new
> implementation).
> >
> >     Think practically this impl would be faster for short needles. Maybe
> >     limit to `m < ~16`, otherwise fallback to generic?
> >
> > This implementation is virtually O(n) for m <= VEC_SIZE, so I think it
> should be at least m <= VEC_SIZE, and since generic implementation uses
> O(n+m) for m > 256, it should be m <= 256, unless we want to directly use
> str-two-way.h, which I think would be a waste of code size.
>
> Afaik s390x do use a similar strategy, so it should be ok to optimize for
> m <= VEC_SIZE.
>
> Also, please check why your patch is making aarch64/arm buildbot fails to
> build [1]. You can bootstrap a toolchain using the build-many-glibcs.py
> script it required.
>
It seems that it has to do with the libc_hidden_builtin_def in
string/memmem.c which I don't really understand.

>
> [1]
> https://patchwork.sourceware.org/project/glibc/patch/20240218082621.131128-1-tirtajames45@gmail.com/
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-20 15:16             ` James
@ 2024-02-20 16:13               ` Noah Goldstein
  2024-02-20 16:26                 ` James
  0 siblings, 1 reply; 25+ messages in thread
From: Noah Goldstein @ 2024-02-20 16:13 UTC (permalink / raw)
  To: James; +Cc: Adhemerval Zanella Netto, Alexander Monakov, libc-alpha

On Tue, Feb 20, 2024 at 3:16 PM James <tirtajames45@gmail.com> wrote:
>
> (Resend because I didn't reply all)
>
> On Tue, Feb 20, 2024 at 9:30 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 20/02/24 00:00, James wrote:
>> > On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
>> >
>> >     On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
>> >     <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote:
>> >     >
>> >     >
>> >     >
>> >     > On 19/02/24 05:13, Alexander Monakov wrote:
>> >     > >
>> >     > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
>> >     > >
>> >     > >> It doesn't seem you have addressed many of the comments from your v5 patch.
>> >     > >> Can it helps if you
>> >     > >> 1: Reply to the comments indicating they are handled / why are choosing not
>> >     > >> to handle them.
>> >     > >> 2: Send further versions to the same email chain. (`--in-reply-to`
>> >     > >> with `git send-email`).
>> >     > >
>> >     > > Are you ok with the change in worst-case time complexity? The existing generic
>> >     > > implementation is O(n+m), the proposed variants are O(n*m).
>> >     >
>> >     > I think we should consider this a regression, we already have a bug opened for
>> >     > wcsstr [1] for a similar issue. We already had another similar issue for
>> >     > PowerPC [2], and we did not have consensus back then because the generic
>> >     > implementation was also O(m*n) (it was before Wilco new implementation).
>> >
>> >     Think practically this impl would be faster for short needles. Maybe
>> >     limit to `m < ~16`, otherwise fallback to generic?
>> >
>> > This implementation is virtually O(n) for m <= VEC_SIZE, so I think it should be at least m <= VEC_SIZE, and since generic implementation uses O(n+m) for m > 256, it should be m <= 256, unless we want to directly use str-two-way.h, which I think would be a waste of code size.
>>
>> Afaik s390x do use a similar strategy, so it should be ok to optimize for
>> m <= VEC_SIZE.
>>
>> Also, please check why your patch is making aarch64/arm buildbot fails to
>> build [1]. You can bootstrap a toolchain using the build-many-glibcs.py
>> script it required.
>
> It seems that it has to do with the libc_hidden_builtin_def in string/memmem.c which I don't really understand.

Instead of adding a new hidden def at the end of `string/memmem.c`,
just replace the existing
using of `__memmem` with `MEMMEM`
>>
>>
>> [1] https://patchwork.sourceware.org/project/glibc/patch/20240218082621.131128-1-tirtajames45@gmail.com/
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-20 16:13               ` Noah Goldstein
@ 2024-02-20 16:26                 ` James
  2024-02-20 16:38                   ` Noah Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: James @ 2024-02-20 16:26 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Adhemerval Zanella Netto, Alexander Monakov, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 5529 bytes --]

On Tue, Feb 20, 2024 at 11:14 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Tue, Feb 20, 2024 at 3:16 PM James <tirtajames45@gmail.com> wrote:
> >
> > (Resend because I didn't reply all)
> >
> > On Tue, Feb 20, 2024 at 9:30 PM Adhemerval Zanella Netto <
> adhemerval.zanella@linaro.org> wrote:
> >>
> >>
> >>
> >> On 20/02/24 00:00, James wrote:
> >> > On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <
> goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
> >> >
> >> >     On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
> >> >     <adhemerval.zanella@linaro.org <mailto:
> adhemerval.zanella@linaro.org>> wrote:
> >> >     >
> >> >     >
> >> >     >
> >> >     > On 19/02/24 05:13, Alexander Monakov wrote:
> >> >     > >
> >> >     > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
> >> >     > >
> >> >     > >> It doesn't seem you have addressed many of the comments from
> your v5 patch.
> >> >     > >> Can it helps if you
> >> >     > >> 1: Reply to the comments indicating they are handled / why
> are choosing not
> >> >     > >> to handle them.
> >> >     > >> 2: Send further versions to the same email chain.
> (`--in-reply-to`
> >> >     > >> with `git send-email`).
> >> >     > >
> >> >     > > Are you ok with the change in worst-case time complexity? The
> existing generic
> >> >     > > implementation is O(n+m), the proposed variants are O(n*m).
> >> >     >
> >> >     > I think we should consider this a regression, we already have a
> bug opened for
> >> >     > wcsstr [1] for a similar issue. We already had another similar
> issue for
> >> >     > PowerPC [2], and we did not have consensus back then because
> the generic
> >> >     > implementation was also O(m*n) (it was before Wilco new
> implementation).
> >> >
> >> >     Think practically this impl would be faster for short needles.
> Maybe
> >> >     limit to `m < ~16`, otherwise fallback to generic?
> >> >
> >> > This implementation is virtually O(n) for m <= VEC_SIZE, so I think
> it should be at least m <= VEC_SIZE, and since generic implementation uses
> O(n+m) for m > 256, it should be m <= 256, unless we want to directly use
> str-two-way.h, which I think would be a waste of code size.
> >>
> >> Afaik s390x do use a similar strategy, so it should be ok to optimize
> for
> >> m <= VEC_SIZE.
> >>
> >> Also, please check why your patch is making aarch64/arm buildbot fails
> to
> >> build [1]. You can bootstrap a toolchain using the build-many-glibcs.py
> >> script it required.
> >
> > It seems that it has to do with the libc_hidden_builtin_def in
> string/memmem.c which I don't really understand.
>
> Instead of adding a new hidden def at the end of `string/memmem.c`,
> just replace the existing
> using of `__memmem` with `MEMMEM`
>
With

#ifndef _LIBC
# define __memmem memmem
#endif

#ifndef MEMMEM
# define MEMMEM __memmem
#endif

void *
MEMMEM (const void *haystack, size_t hs_len,
 const void *needle, size_t ne_len)

libc_hidden_def (MEMMEM)
weak_alias (MEMMEM, memmem)
libc_hidden_weak (memmem)

make test t=string/test-memmem on x86-64 shows

 ./../include/libc-symbols.h:472:33: error: ‘__EI___memmem_generic’ aliased
to undefined symbol ‘__GI___memmem_generic’
  472 |   extern thread __typeof (name) __EI_##name \
      |                                 ^~~~~
./../include/libc-symbols.h:468:3: note: in expansion of macro
‘__hidden_ver2’
  468 |   __hidden_ver2 (, local, internal, name)
      |   ^~~~~~~~~~~~~
./../include/libc-symbols.h:476:41: note: in expansion of macro
‘__hidden_ver1’
  476 | #  define hidden_def(name)              __hidden_ver1(__GI_##name,
name, name);
      |                                         ^~~~~~~~~~~~~
./../include/libc-symbols.h:557:32: note: in expansion of macro ‘hidden_def’
  557 | # define libc_hidden_def(name) hidden_def (name)
      |                                ^~~~~~~~~~
../string/memmem.c:131:1: note: in expansion of macro ‘libc_hidden_def’
  131 | libc_hidden_def (MEMMEM)
      | ^~~~~~~~~~~~~~~
./../include/libc-symbols.h:472:33: error: ‘__EI_memmem’ aliased to
undefined symbol ‘__GI_memmem’
  472 |   extern thread __typeof (name) __EI_##name \
      |                                 ^~~~~
./../include/libc-symbols.h:468:3: note: in expansion of macro
‘__hidden_ver2’
  468 |   __hidden_ver2 (, local, internal, name)
      |   ^~~~~~~~~~~~~
./../include/libc-symbols.h:484:9: note: in expansion of macro
‘__hidden_ver1’
  484 |         __hidden_ver1(__GI_##name, name, name)
__attribute__((weak));
      |         ^~~~~~~~~~~~~
./../include/libc-symbols.h:558:33: note: in expansion of macro
‘hidden_weak’
  558 | # define libc_hidden_weak(name) hidden_weak (name)
      |                                 ^~~~~~~~~~~
../string/memmem.c:133:1: note: in expansion of macro ‘libc_hidden_weak’
  133 | libc_hidden_weak (memmem)
      | ^~~~~~~~~~~~~~~~
make[2]: *** [/home/james/.local/src/glibc/build/sysd-rules:671:
/home/james/.local/src/glibc/build/string/memmem.os] Error 1
make[2]: Leaving directory '/home/james/.local/src/glibc/string'
make[1]: *** [Makefile:759: test] Error 2
make[1]: Leaving directory '/home/james/.local/src/glibc'
make: *** [Makefile:9: test] Error 2
>
> >>
> >>
> >> [1]
> https://patchwork.sourceware.org/project/glibc/patch/20240218082621.131128-1-tirtajames45@gmail.com/
> >>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-20 16:26                 ` James
@ 2024-02-20 16:38                   ` Noah Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Noah Goldstein @ 2024-02-20 16:38 UTC (permalink / raw)
  To: James; +Cc: Adhemerval Zanella Netto, Alexander Monakov, libc-alpha

On Tue, Feb 20, 2024 at 4:26 PM James <tirtajames45@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 11:14 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Tue, Feb 20, 2024 at 3:16 PM James <tirtajames45@gmail.com> wrote:
>> >
>> > (Resend because I didn't reply all)
>> >
>> > On Tue, Feb 20, 2024 at 9:30 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote:
>> >>
>> >>
>> >>
>> >> On 20/02/24 00:00, James wrote:
>> >> > On Tue, Feb 20, 2024 at 12:20 AM Noah Goldstein <goldstein.w.n@gmail.com <mailto:goldstein.w.n@gmail.com>> wrote:
>> >> >
>> >> >     On Mon, Feb 19, 2024 at 2:25 PM Adhemerval Zanella Netto
>> >> >     <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote:
>> >> >     >
>> >> >     >
>> >> >     >
>> >> >     > On 19/02/24 05:13, Alexander Monakov wrote:
>> >> >     > >
>> >> >     > > On Mon, 19 Feb 2024, Noah Goldstein wrote:
>> >> >     > >
>> >> >     > >> It doesn't seem you have addressed many of the comments from your v5 patch.
>> >> >     > >> Can it helps if you
>> >> >     > >> 1: Reply to the comments indicating they are handled / why are choosing not
>> >> >     > >> to handle them.
>> >> >     > >> 2: Send further versions to the same email chain. (`--in-reply-to`
>> >> >     > >> with `git send-email`).
>> >> >     > >
>> >> >     > > Are you ok with the change in worst-case time complexity? The existing generic
>> >> >     > > implementation is O(n+m), the proposed variants are O(n*m).
>> >> >     >
>> >> >     > I think we should consider this a regression, we already have a bug opened for
>> >> >     > wcsstr [1] for a similar issue. We already had another similar issue for
>> >> >     > PowerPC [2], and we did not have consensus back then because the generic
>> >> >     > implementation was also O(m*n) (it was before Wilco new implementation).
>> >> >
>> >> >     Think practically this impl would be faster for short needles. Maybe
>> >> >     limit to `m < ~16`, otherwise fallback to generic?
>> >> >
>> >> > This implementation is virtually O(n) for m <= VEC_SIZE, so I think it should be at least m <= VEC_SIZE, and since generic implementation uses O(n+m) for m > 256, it should be m <= 256, unless we want to directly use str-two-way.h, which I think would be a waste of code size.
>> >>
>> >> Afaik s390x do use a similar strategy, so it should be ok to optimize for
>> >> m <= VEC_SIZE.
>> >>
>> >> Also, please check why your patch is making aarch64/arm buildbot fails to
>> >> build [1]. You can bootstrap a toolchain using the build-many-glibcs.py
>> >> script it required.
>> >
>> > It seems that it has to do with the libc_hidden_builtin_def in string/memmem.c which I don't really understand.
>>
>> Instead of adding a new hidden def at the end of `string/memmem.c`,
>> just replace the existing
>> using of `__memmem` with `MEMMEM`

So if the target is just using this as the generic impl (and defines the
defs in sysdeps/*


See how we do `wcscpy` in x86_64, you should be able to
follow the same pattern.


>
> With
>
> #ifndef _LIBC
> # define __memmem memmem
> #endif
>
> #ifndef MEMMEM
> # define MEMMEM __memmem
> #endif
>
> void *
> MEMMEM (const void *haystack, size_t hs_len,
>  const void *needle, size_t ne_len)
>
> libc_hidden_def (MEMMEM)
> weak_alias (MEMMEM, memmem)
> libc_hidden_weak (memmem)
>
> make test t=string/test-memmem on x86-64 shows
>
>  ./../include/libc-symbols.h:472:33: error: ‘__EI___memmem_generic’ aliased to undefined symbol ‘__GI___memmem_generic’
>   472 |   extern thread __typeof (name) __EI_##name \
>       |                                 ^~~~~
> ./../include/libc-symbols.h:468:3: note: in expansion of macro ‘__hidden_ver2’
>   468 |   __hidden_ver2 (, local, internal, name)
>       |   ^~~~~~~~~~~~~
> ./../include/libc-symbols.h:476:41: note: in expansion of macro ‘__hidden_ver1’
>   476 | #  define hidden_def(name)              __hidden_ver1(__GI_##name, name, name);
>       |                                         ^~~~~~~~~~~~~
> ./../include/libc-symbols.h:557:32: note: in expansion of macro ‘hidden_def’
>   557 | # define libc_hidden_def(name) hidden_def (name)
>       |                                ^~~~~~~~~~
> ../string/memmem.c:131:1: note: in expansion of macro ‘libc_hidden_def’
>   131 | libc_hidden_def (MEMMEM)
>       | ^~~~~~~~~~~~~~~
> ./../include/libc-symbols.h:472:33: error: ‘__EI_memmem’ aliased to undefined symbol ‘__GI_memmem’
>   472 |   extern thread __typeof (name) __EI_##name \
>       |                                 ^~~~~
> ./../include/libc-symbols.h:468:3: note: in expansion of macro ‘__hidden_ver2’
>   468 |   __hidden_ver2 (, local, internal, name)
>       |   ^~~~~~~~~~~~~
> ./../include/libc-symbols.h:484:9: note: in expansion of macro ‘__hidden_ver1’
>   484 |         __hidden_ver1(__GI_##name, name, name) __attribute__((weak));
>       |         ^~~~~~~~~~~~~
> ./../include/libc-symbols.h:558:33: note: in expansion of macro ‘hidden_weak’
>   558 | # define libc_hidden_weak(name) hidden_weak (name)
>       |                                 ^~~~~~~~~~~
> ../string/memmem.c:133:1: note: in expansion of macro ‘libc_hidden_weak’
>   133 | libc_hidden_weak (memmem)
>       | ^~~~~~~~~~~~~~~~
> make[2]: *** [/home/james/.local/src/glibc/build/sysd-rules:671: /home/james/.local/src/glibc/build/string/memmem.os] Error 1
> make[2]: Leaving directory '/home/james/.local/src/glibc/string'
> make[1]: *** [Makefile:759: test] Error 2
> make[1]: Leaving directory '/home/james/.local/src/glibc'
> make: *** [Makefile:9: test] Error 2
>>
>> >>
>> >>
>> >> [1] https://patchwork.sourceware.org/project/glibc/patch/20240218082621.131128-1-tirtajames45@gmail.com/
>> >>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-18  8:26 [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c James Tirta Halim
  2024-02-19  0:07 ` Noah Goldstein
@ 2024-02-21  6:57 ` James Tirta Halim
  2024-02-21 17:17   ` Noah Goldstein
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: James Tirta Halim @ 2024-02-21  6:57 UTC (permalink / raw)
  To: tirtajames45; +Cc: goldstein.w.n, libc-alpha

Find the rarest byte in NE. Find the parts of HS that matches the rare byte
and the byte after it. If found, shift back to the start of NE in HS and
vector compare the first VEC_SIZE with NE. If matches, compare the rest
with MEMCMPEQ.

Timings (Core i3-1115G4):
basic_memmem twoway_memmem __memmem_avx512 __memmem_avx2
__memmem_generic
Total:
6.80124e+06 1.06087e+06 219483 345385 768041
Average:
25958.9 4049.11 837.721 1318.26 2931.45

Passes make check.

Changes in v1:
1. Add memmem-avx2.c

Changes in v2:
1. Add avx512 support with a generic header file
2. Use __memcmpeq instead of memcmp
3. Remove scalar loop
4. Fix unsafe unaligned load

Changes in v3:
1. Avoid checking for alignment to the start of the page since that will be rare
2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
reference errors)
3. Add memmem.c (needs review)
4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
review)
5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)

Changes in v4:
1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
2. Correct the Makefile to use the appropriate flags
3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
4. Remove unused vector macros (POPCNT and LZCNT)

Changes in v5:
1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
3. Add comments
4. Limit needle length to VEC_SIZE when finding the rare byte

Changes in v6:
1. Fix patch apply error in memmem.c
2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
of needle
3. Always do unaligned load at the tail code
4. Rename rarebyte_table to ___rarebyte_table
5. Add memmem-avx-base.c in which ___rarebyte_table is defined
6. Add memmem-avx-base to the Makefile
7. Add always_inline to find_rarest_byte
8. Change ((m << off) >> off) to (m & (ONES >> off))
9. Change void * to unsigned char * in find_rarest_byte

Changes in v7:
1. Fallback to generic memmem for long needles for guaranteed
linear-time worst-case performance
2. Use memmem instead of MEMMEM for libc_hidden_builtin_def in
memmem.c (string/memmem.c and sysdeps/x86_64/multiarch/memmem.c may
still need to be fixed for non-x86_64 builds to work. The changes were
made following string/strstr.c and sysdeps/x86_64/multiarch/strstr.c)
3. Change some (VEC *) casts to (const VEC *)

---
 string/memmem.c                            |   7 +-
 sysdeps/x86_64/multiarch/Makefile          |   6 +
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |  12 ++
 sysdeps/x86_64/multiarch/memmem-avx-base.c |  20 +++
 sysdeps/x86_64/multiarch/memmem-avx-base.h | 191 +++++++++++++++++++++
 sysdeps/x86_64/multiarch/memmem-avx2.c     |   3 +
 sysdeps/x86_64/multiarch/memmem-avx512.c   |  12 ++
 sysdeps/x86_64/multiarch/memmem.c          |  67 ++++++++
 8 files changed, 317 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem.c

diff --git a/string/memmem.c b/string/memmem.c
index a4117f8e1e..0a89bd5f7c 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -25,6 +25,10 @@
 # define __memmem	memmem
 #endif
 
+#ifndef MEMMEM
+# define MEMMEM __memmem
+#endif
+
 #define RETURN_TYPE void *
 #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
 #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
@@ -50,7 +54,7 @@
    The limit also implies worst-case performance is linear.
    Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
 void *
-__memmem (const void *haystack, size_t hs_len,
+MEMMEM (const void *haystack, size_t hs_len,
 	  const void *needle, size_t ne_len)
 {
   const unsigned char *hs = (const unsigned char *) haystack;
@@ -127,3 +131,4 @@ __memmem (const void *haystack, size_t hs_len,
 libc_hidden_def (__memmem)
 weak_alias (__memmem, memmem)
 libc_hidden_weak (memmem)
+libc_hidden_builtin_def (memmem)
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index d3d2270394..0b46d5f341 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -15,6 +15,9 @@ sysdep_routines += \
   memcmpeq-avx2-rtm \
   memcmpeq-evex \
   memcmpeq-sse2 \
+  memmem-avx-base \
+  memmem-avx2 \
+  memmem-avx512 \
   memmove-avx-unaligned-erms \
   memmove-avx-unaligned-erms-rtm \
   memmove-avx512-no-vzeroupper \
@@ -122,6 +125,9 @@ sysdep_routines += \
   varshift \
 # sysdep_routines
 
+CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
+CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
+
 CFLAGS-strcspn-sse4.c += -msse4
 CFLAGS-strpbrk-sse4.c += -msse4
 CFLAGS-strspn-sse4.c += -msse4
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c4a21d4b7c..20a8b85da9 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -799,6 +799,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
 
+    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
+  IFUNC_IMPL (i, name, memmem,
+              IFUNC_IMPL_ADD (array, i, memmem,
+                              (CPU_FEATURE_USABLE (AVX512BW)
+                               && CPU_FEATURE_USABLE (BMI1)),
+                              __memmem_avx512)
+              IFUNC_IMPL_ADD (array, i, memmem,
+		              (CPU_FEATURE_USABLE (AVX2)
+			      && CPU_FEATURE_USABLE (BMI1)),
+			      __memmem_avx2)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic))
+
   /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
   IFUNC_IMPL (i, name, wcschr,
 	      X86_IFUNC_IMPL_ADD_V4 (array, i, wcschr,
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
new file mode 100644
index 0000000000..212d75c96f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
@@ -0,0 +1,20 @@
+const unsigned char ___rarebyte_table[256] attribute_hidden
+    = { 0,   1,	  13,  56,  59,	 60,  61,  62,	63,  232, 248, 2,   158, 4,
+	5,   6,	  7,   8,   9,	 10,  14,  20,	26,  29,  37,  46,  52,	 53,
+	54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
+	212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
+	222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
+	196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
+	214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
+	233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
+	228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
+	166, 3,	  140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,	 87,
+	132, 96,  112, 97,  103, 82,  139, 89,	98,  88,  119, 74,  156, 115,
+	104, 75,  120, 106, 76,	 155, 90,  122, 107, 125, 152, 145, 136, 137,
+	101, 116, 102, 108, 99,	 141, 77,  78,	118, 79,  109, 100, 150, 73,
+	94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,	 149,
+	146, 111, 65,  69,  66,	 15,  16,  17,	18,  19,  130, 92,  144, 123,
+	21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,	 67,
+	153, 84,  143, 138, 147, 157, 148, 68,	71,  30,  31,  32,  33,	 34,
+	35,  36,  154, 38,  39,	 40,  41,  42,	80,  43,  44,  45,  47,	 48,
+	85,  49,  50,  51 };
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
new file mode 100644
index 0000000000..08941798ff
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
@@ -0,0 +1,191 @@
+#include <immintrin.h>
+#include <inttypes.h>
+#include <string.h>
+#include <libc-pointer-arith.h>
+
+#ifndef FUNC_NAME
+#  define __memmem_avx2
+#endif
+#ifndef VEC
+#  define VEC __m256i
+#endif
+#ifndef MASK
+#  define MASK uint32_t
+#endif
+#ifndef LOAD
+#  define LOAD(x) _mm256_load_si256 (x)
+#endif
+#ifndef LOADU
+#  define LOADU(x) _mm256_loadu_si256 (x)
+#endif
+#ifndef CMPEQ8_MASK
+#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
+#endif
+#ifndef SETONE8
+#  define SETONE8(x) _mm256_set1_epi8 (x)
+#endif
+#ifndef TZCNT
+#  define TZCNT(x) _tzcnt_u32 (x)
+#endif
+#ifndef BLSR
+#  define BLSR(x) _blsr_u32 (x)
+#endif
+#define VEC_SIZE sizeof (VEC)
+#define ONES ((MASK) -1)
+
+#ifndef MEMCMPEQ
+#  define MEMCMPEQ __memcmpeq
+#endif
+#ifndef MEMCPY
+#  define MEMCPY memcpy
+#endif
+#ifndef MEMCHR
+#  define MEMCHR memchr
+#endif
+#ifndef PAGE_SIZE
+#  define PAGE_SIZE 4096
+#endif
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+
+extern void *__memmem_generic (const void *, size_t, const void *,
+			       size_t) attribute_hidden;
+
+/* Lower is rarer. The table is based on the *.c and *.h files in glibc. */
+extern const unsigned char ___rarebyte_table[256] attribute_hidden;
+
+static inline void *__attribute__ ((always_inline))
+find_rarest_byte (const unsigned char *rare, size_t n)
+{
+  const unsigned char *p = (const unsigned char *) rare;
+  int c_rare = ___rarebyte_table[*rare];
+  int c;
+  for (; n--; ++p)
+    {
+      c = ___rarebyte_table[*p];
+      if (c < c_rare)
+	{
+	  rare = p;
+	  c_rare = c;
+	}
+    }
+  return (void *) rare;
+}
+
+void *
+FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
+{
+  if (ne_len == 1)
+    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
+  if (__glibc_unlikely (ne_len == 0))
+    return (void *) hs;
+  if (__glibc_unlikely (hs_len < ne_len))
+    return NULL;
+  /* Linear-time worst-case performance is guaranteed by the generic
+   * implementation using the Two-Way algorithm. */
+  if (__glibc_unlikely (ne_len > 256))
+    return __memmem_generic (hs, hs_len, ne, ne_len);
+  VEC hv0, hv1, hv, nv;
+  MASK i, hm0, hm1, m, cmpm;
+  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
+  const MASK matchm = ONES << matchsh;
+  const unsigned char *h = (const unsigned char *) hs;
+  const unsigned char *const end = h + hs_len - ne_len;
+  const unsigned char *hp;
+  size_t rare = PTR_DIFF (
+      find_rarest_byte ((const unsigned char *) ne, MIN (ne_len, VEC_SIZE)),
+      ne);
+  /* RARE will always be the first byte to find.
+     If RARE is at the end of the needle, use the byte before it. */
+  if (rare == MIN (ne_len, VEC_SIZE) - 1)
+    --rare;
+  const VEC nv0 = SETONE8 (*((char *) ne + rare));
+  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
+  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
+			   ? VEC_SIZE - (unsigned int) (end - h) - 1
+			   : 0;
+  /* Start from the position of RARE. */
+  h += rare;
+  /* Load the needle vector. */
+  if (((uintptr_t) ne & (PAGE_SIZE - 1)) > (PAGE_SIZE - VEC_SIZE)
+      || ne_len >= VEC_SIZE)
+    nv = LOADU ((const VEC *) ne);
+  else
+    MEMCPY (&nv, ne, MIN (VEC_SIZE, ne_len));
+  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
+  /* Align down to VEC_SIZE. */
+  h -= off_s;
+  hv0 = LOAD ((const VEC *) h);
+  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
+  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
+   * of bounds (OFF_E). */
+  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
+  while (m)
+    {
+      i = TZCNT (m);
+      m = BLSR (m);
+      hp = h + off_s + i - rare;
+      if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
+	{
+	  /* Do a vector compare if we are not crossing a page. */
+	  hv = LOADU ((const VEC *) hp);
+	  cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
+	  /* Compare only the relevant bits of the needle vector. */
+	  if (cmpm == matchm)
+	    /* Compare the rest of the needle. */
+	    if (ne_len <= VEC_SIZE
+		|| !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
+			      ne_len - VEC_SIZE))
+	      return (void *) hp;
+	}
+      else
+	{
+	  if (!MEMCMPEQ (hp, ne, ne_len))
+	    return (void *) hp;
+	}
+    }
+  h += VEC_SIZE - 1;
+  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
+    {
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      m = hm0 & hm1;
+      while (m)
+	{
+	match:
+	  i = TZCNT (m);
+	  m = BLSR (m);
+	  hp = h + i - rare;
+	  if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
+	    {
+	      hv = LOADU ((const VEC *) hp);
+	      cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
+	      if (cmpm == matchm)
+		if (ne_len <= VEC_SIZE
+		    || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
+				  ne_len - VEC_SIZE))
+		  return (void *) hp;
+	    }
+	  else
+	    {
+	      if (!MEMCMPEQ (hp, ne, ne_len))
+		return (void *) hp;
+	    }
+	}
+    }
+  if (h - rare <= end)
+    {
+      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      /* Clear the irrelevant bits that are out of bounds. */
+      m = hm0 & hm1 & (ONES >> off_e);
+      if (m)
+	goto match;
+    }
+  return NULL;
+}
diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
new file mode 100644
index 0000000000..91f5d5d331
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
@@ -0,0 +1,3 @@
+#define FUNC_NAME __memmem_avx2
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
new file mode 100644
index 0000000000..76016c1cfe
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
@@ -0,0 +1,12 @@
+#define VEC __m512i
+#define MASK uint64_t
+#define LOAD(x) _mm512_load_si512 (x)
+#define LOADU(x) _mm512_loadu_si512 (x)
+#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
+#define SETONE8(x) _mm512_set1_epi8 (x)
+#define TZCNT(x) _tzcnt_u64 (x)
+#define BLSR(x) _blsr_u64 (x)
+
+#define FUNC_NAME __memmem_avx512
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
new file mode 100644
index 0000000000..8fe7b77d33
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem.c
@@ -0,0 +1,67 @@
+/* Multiple versions of memmem.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Redefine memmem so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+#undef  memmem
+#define memmem __redirect_memmem
+#include <string.h>
+#undef  memmem
+
+#define MEMMEM __memmem_generic
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
+#endif
+
+#include "string/memmem.c"
+
+extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
+
+#define SYMBOL_NAME memmem
+
+#include "init-arch.h"
+
+/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
+   ifunc symbol properly.  */
+extern __typeof (__redirect_memmem) __libc_memmem;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
+      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx512;
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx2;
+
+  return __memmem_generic;
+}
+
+libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
+#undef memmem
+strong_alias (__libc_memmem, __memmem)
-- 
2.43.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
@ 2024-02-21 17:17   ` Noah Goldstein
  2024-02-21 20:30     ` Alexander Monakov
  2024-02-24  4:25     ` James
  2024-02-24  9:09   ` [PATCH v8] " James Tirta Halim
  2024-02-24  9:29   ` James Tirta Halim
  2 siblings, 2 replies; 25+ messages in thread
From: Noah Goldstein @ 2024-02-21 17:17 UTC (permalink / raw)
  To: James Tirta Halim; +Cc: libc-alpha

On Wed, Feb 21, 2024 at 12:58 AM James Tirta Halim
<tirtajames45@gmail.com> wrote:
>
> Find the rarest byte in NE. Find the parts of HS that matches the rare byte
> and the byte after it. If found, shift back to the start of NE in HS and
> vector compare the first VEC_SIZE with NE. If matches, compare the rest
> with MEMCMPEQ.
>
> Timings (Core i3-1115G4):
> basic_memmem twoway_memmem __memmem_avx512 __memmem_avx2
> __memmem_generic
> Total:
> 6.80124e+06 1.06087e+06 219483 345385 768041
> Average:
> 25958.9 4049.11 837.721 1318.26 2931.45
>
> Passes make check.
>
> Changes in v1:
> 1. Add memmem-avx2.c
>
> Changes in v2:
> 1. Add avx512 support with a generic header file
> 2. Use __memcmpeq instead of memcmp
> 3. Remove scalar loop
> 4. Fix unsafe unaligned load
>
> Changes in v3:
> 1. Avoid checking for alignment to the start of the page since that will be rare
> 2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
> reference errors)
> 3. Add memmem.c (needs review)
> 4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
> review)
> 5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)
>
> Changes in v4:
> 1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
> use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
> 2. Correct the Makefile to use the appropriate flags
> 3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
> 4. Remove unused vector macros (POPCNT and LZCNT)
>
> Changes in v5:
> 1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
> 2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
> 3. Add comments
> 4. Limit needle length to VEC_SIZE when finding the rare byte
>
> Changes in v6:
> 1. Fix patch apply error in memmem.c
> 2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
> of needle
> 3. Always do unaligned load at the tail code
> 4. Rename rarebyte_table to ___rarebyte_table
> 5. Add memmem-avx-base.c in which ___rarebyte_table is defined
> 6. Add memmem-avx-base to the Makefile
> 7. Add always_inline to find_rarest_byte
> 8. Change ((m << off) >> off) to (m & (ONES >> off))
> 9. Change void * to unsigned char * in find_rarest_byte
>
> Changes in v7:
> 1. Fallback to generic memmem for long needles for guaranteed
> linear-time worst-case performance
> 2. Use memmem instead of MEMMEM for libc_hidden_builtin_def in
> memmem.c (string/memmem.c and sysdeps/x86_64/multiarch/memmem.c may
> still need to be fixed for non-x86_64 builds to work. The changes were
> made following string/strstr.c and sysdeps/x86_64/multiarch/strstr.c)
> 3. Change some (VEC *) casts to (const VEC *)
>
> ---
>  string/memmem.c                            |   7 +-
>  sysdeps/x86_64/multiarch/Makefile          |   6 +
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c |  12 ++
>  sysdeps/x86_64/multiarch/memmem-avx-base.c |  20 +++
>  sysdeps/x86_64/multiarch/memmem-avx-base.h | 191 +++++++++++++++++++++
>  sysdeps/x86_64/multiarch/memmem-avx2.c     |   3 +
>  sysdeps/x86_64/multiarch/memmem-avx512.c   |  12 ++
>  sysdeps/x86_64/multiarch/memmem.c          |  67 ++++++++
>  8 files changed, 317 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
>  create mode 100644 sysdeps/x86_64/multiarch/memmem.c
>
> diff --git a/string/memmem.c b/string/memmem.c
> index a4117f8e1e..0a89bd5f7c 100644
> --- a/string/memmem.c
> +++ b/string/memmem.c
> @@ -25,6 +25,10 @@
>  # define __memmem      memmem
>  #endif
>
> +#ifndef MEMMEM
> +# define MEMMEM __memmem
> +#endif
> +
>  #define RETURN_TYPE void *
>  #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
>  #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
> @@ -50,7 +54,7 @@
>     The limit also implies worst-case performance is linear.
>     Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
>  void *
> -__memmem (const void *haystack, size_t hs_len,
> +MEMMEM (const void *haystack, size_t hs_len,
>           const void *needle, size_t ne_len)
>  {
>    const unsigned char *hs = (const unsigned char *) haystack;
> @@ -127,3 +131,4 @@ __memmem (const void *haystack, size_t hs_len,
>  libc_hidden_def (__memmem)
>  weak_alias (__memmem, memmem)
>  libc_hidden_weak (memmem)
> +libc_hidden_builtin_def (memmem)
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index d3d2270394..0b46d5f341 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -15,6 +15,9 @@ sysdep_routines += \
>    memcmpeq-avx2-rtm \
>    memcmpeq-evex \
>    memcmpeq-sse2 \
> +  memmem-avx-base \
> +  memmem-avx2 \
> +  memmem-avx512 \
>    memmove-avx-unaligned-erms \
>    memmove-avx-unaligned-erms-rtm \
>    memmove-avx512-no-vzeroupper \
> @@ -122,6 +125,9 @@ sysdep_routines += \
>    varshift \
>  # sysdep_routines
>
> +CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
> +CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
> +
>  CFLAGS-strcspn-sse4.c += -msse4
>  CFLAGS-strpbrk-sse4.c += -msse4
>  CFLAGS-strspn-sse4.c += -msse4
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index c4a21d4b7c..20a8b85da9 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -799,6 +799,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
>               IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
>
> +    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
> +  IFUNC_IMPL (i, name, memmem,
> +              IFUNC_IMPL_ADD (array, i, memmem,
> +                              (CPU_FEATURE_USABLE (AVX512BW)
> +                               && CPU_FEATURE_USABLE (BMI1)),
> +                              __memmem_avx512)
> +              IFUNC_IMPL_ADD (array, i, memmem,
> +                             (CPU_FEATURE_USABLE (AVX2)
> +                             && CPU_FEATURE_USABLE (BMI1)),
> +                             __memmem_avx2)
> +             IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic))
> +
>    /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
>    IFUNC_IMPL (i, name, wcschr,
>               X86_IFUNC_IMPL_ADD_V4 (array, i, wcschr,
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> new file mode 100644
> index 0000000000..212d75c96f
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> @@ -0,0 +1,20 @@
> +const unsigned char ___rarebyte_table[256] attribute_hidden
> +    = { 0,   1,          13,  56,  59,  60,  61,  62,  63,  232, 248, 2,   158, 4,
> +       5,   6,   7,   8,   9,   10,  14,  20,  26,  29,  37,  46,  52,  53,
> +       54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
> +       212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
> +       222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
> +       196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
> +       214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
> +       233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
> +       228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
> +       166, 3,   140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,  87,
> +       132, 96,  112, 97,  103, 82,  139, 89,  98,  88,  119, 74,  156, 115,
> +       104, 75,  120, 106, 76,  155, 90,  122, 107, 125, 152, 145, 136, 137,
> +       101, 116, 102, 108, 99,  141, 77,  78,  118, 79,  109, 100, 150, 73,
> +       94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,  149,
> +       146, 111, 65,  69,  66,  15,  16,  17,  18,  19,  130, 92,  144, 123,
> +       21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,  67,
> +       153, 84,  143, 138, 147, 157, 148, 68,  71,  30,  31,  32,  33,  34,
> +       35,  36,  154, 38,  39,  40,  41,  42,  80,  43,  44,  45,  47,  48,
> +       85,  49,  50,  51 };
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> new file mode 100644
> index 0000000000..08941798ff
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> @@ -0,0 +1,191 @@
> +#include <immintrin.h>
> +#include <inttypes.h>
> +#include <string.h>
> +#include <libc-pointer-arith.h>
> +
> +#ifndef FUNC_NAME
> +#  define __memmem_avx2
> +#endif
> +#ifndef VEC
> +#  define VEC __m256i
> +#endif
> +#ifndef MASK
> +#  define MASK uint32_t
> +#endif
> +#ifndef LOAD
> +#  define LOAD(x) _mm256_load_si256 (x)
> +#endif
> +#ifndef LOADU
> +#  define LOADU(x) _mm256_loadu_si256 (x)
> +#endif
> +#ifndef CMPEQ8_MASK
> +#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
> +#endif
> +#ifndef SETONE8
> +#  define SETONE8(x) _mm256_set1_epi8 (x)
> +#endif
> +#ifndef TZCNT
> +#  define TZCNT(x) _tzcnt_u32 (x)
> +#endif
Use `__builtin_ctz`
> +#ifndef BLSR
> +#  define BLSR(x) _blsr_u32 (x)
> +#endif

Think you can drop the `BLSR` define (here and in the avx512)
and just replace with `((x) & ((x) - 1))`
any reasonable compiler will optimize that correctly.
> +#define VEC_SIZE sizeof (VEC)
> +#define ONES ((MASK) -1)
> +
> +#ifndef MEMCMPEQ
> +#  define MEMCMPEQ __memcmpeq
> +#endif
> +#ifndef MEMCPY
> +#  define MEMCPY memcpy
> +#endif
> +#ifndef MEMCHR
> +#  define MEMCHR memchr
> +#endif
> +#ifndef PAGE_SIZE
> +#  define PAGE_SIZE 4096
> +#endif
> +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> +
> +extern void *__memmem_generic (const void *, size_t, const void *,
> +                              size_t) attribute_hidden;
> +
> +/* Lower is rarer. The table is based on the *.c and *.h files in glibc. */
> +extern const unsigned char ___rarebyte_table[256] attribute_hidden;
> +
> +static inline void *__attribute__ ((always_inline))
> +find_rarest_byte (const unsigned char *rare, size_t n)
> +{
> +  const unsigned char *p = (const unsigned char *) rare;
> +  int c_rare = ___rarebyte_table[*rare];
> +  int c;
> +  for (; n--; ++p)
> +    {
> +      c = ___rarebyte_table[*p];
> +      if (c < c_rare)
> +       {
> +         rare = p;
> +         c_rare = c;
> +       }
> +    }
> +  return (void *) rare;
> +}
> +
> +void *
> +FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
> +{
> +  if (ne_len == 1)
> +    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
> +  if (__glibc_unlikely (ne_len == 0))
> +    return (void *) hs;
> +  if (__glibc_unlikely (hs_len < ne_len))
> +    return NULL;
> +  /* Linear-time worst-case performance is guaranteed by the generic
> +   * implementation using the Two-Way algorithm. */
> +  if (__glibc_unlikely (ne_len > 256))
> +    return __memmem_generic (hs, hs_len, ne, ne_len)
Think this impl makes sense up to VEC_SIZE * 1 + 1, but after that
it doesn't seem to have that much advantage.
> +  VEC hv0, hv1, hv, nv;
> +  MASK i, hm0, hm1, m, cmpm;
> +  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
> +  const MASK matchm = ONES << matchsh;
> +  const unsigned char *h = (const unsigned char *) hs;
> +  const unsigned char *const end = h + hs_len - ne_len;
> +  const unsigned char *hp;
> +  size_t rare = PTR_DIFF (
> +      find_rarest_byte ((const unsigned char *) ne, MIN (ne_len, VEC_SIZE)),
> +      ne);
> +  /* RARE will always be the first byte to find.
> +     If RARE is at the end of the needle, use the byte before it. */
> +  if (rare == MIN (ne_len, VEC_SIZE) - 1)
> +    --rare;
> +  const VEC nv0 = SETONE8 (*((char *) ne + rare));
> +  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
> +  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
> +                          ? VEC_SIZE - (unsigned int) (end - h) - 1
> +                          : 0;
> +  /* Start from the position of RARE. */
> +  h += rare;
> +  /* Load the needle vector. */
> +  if (((uintptr_t) ne & (PAGE_SIZE - 1)) > (PAGE_SIZE - VEC_SIZE)
> +      || ne_len >= VEC_SIZE)
the `ne_len >= VEC_SIZE` should probably be the first check here.
> +    nv = LOADU ((const VEC *) ne);
> +  else
> +    MEMCPY (&nv, ne, MIN (VEC_SIZE, ne_len));
> +  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
> +  /* Align down to VEC_SIZE. */
> +  h -= off_s;
> +  hv0 = LOAD ((const VEC *) h);
> +  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
> +  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
> +   * of bounds (OFF_E). */
> +  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
> +  while (m)
> +    {
> +      i = TZCNT (m);
> +      m = BLSR (m);
> +      hp = h + off_s + i - rare;
> +      if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> +       {
> +         /* Do a vector compare if we are not crossing a page. */
> +         hv = LOADU ((const VEC *) hp);
> +         cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> +         /* Compare only the relevant bits of the needle vector. */
> +         if (cmpm == matchm)
> +           /* Compare the rest of the needle. */
> +           if (ne_len <= VEC_SIZE
> +               || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
> +                             ne_len - VEC_SIZE))
> +             return (void *) hp;
> +       }
> +      else
> +       {
> +         if (!MEMCMPEQ (hp, ne, ne_len))
> +           return (void *) hp;
think (assuming you bound ne_len <= ~VEC_SIZE * 2), you can
just make a little inline impl of this that will be much faster
than a call to __memcmpeq.
> +       }
> +    }
> +  h += VEC_SIZE - 1;
> +  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
> +    {
> +      hv0 = LOADU ((const VEC *) h);
> +      hv1 = LOAD ((const VEC *) (h + 1));
> +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +      m = hm0 & hm1;
> +      while (m)
> +       {
> +       match:
> +         i = TZCNT (m);
> +         m = BLSR (m);
> +         hp = h + i - rare;
> +         if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> +           {
> +             hv = LOADU ((const VEC *) hp);
> +             cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> +             if (cmpm == matchm)
> +               if (ne_len <= VEC_SIZE
> +                   || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne + VEC_SIZE,
> +                                 ne_len - VEC_SIZE))
> +                 return (void *) hp;
> +           }
> +         else
> +           {
> +             if (!MEMCMPEQ (hp, ne, ne_len))
> +               return (void *) hp;
> +           }
> +       }
> +    }
> +  if (h - rare <= end)
> +    {
> +      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
> +      hv0 = LOADU ((const VEC *) h);
> +      hv1 = LOAD ((const VEC *) (h + 1));
> +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> +      /* Clear the irrelevant bits that are out of bounds. */
> +      m = hm0 & hm1 & (ONES >> off_e);
> +      if (m)
> +       goto match;
> +    }
> +  return NULL;
> +}
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
> new file mode 100644
> index 0000000000..91f5d5d331
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
> @@ -0,0 +1,3 @@
> +#define FUNC_NAME __memmem_avx2
> +
> +#include "memmem-avx-base.h"
> diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
> new file mode 100644
> index 0000000000..76016c1cfe
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
> @@ -0,0 +1,12 @@
> +#define VEC __m512i
> +#define MASK uint64_t
> +#define LOAD(x) _mm512_load_si512 (x)
> +#define LOADU(x) _mm512_loadu_si512 (x)
> +#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
> +#define SETONE8(x) _mm512_set1_epi8 (x)
> +#define TZCNT(x) _tzcnt_u64 (x)
> +#define BLSR(x) _blsr_u64 (x)
> +
> +#define FUNC_NAME __memmem_avx512
> +
> +#include "memmem-avx-base.h"
> diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
> new file mode 100644
> index 0000000000..8fe7b77d33
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memmem.c
> @@ -0,0 +1,67 @@
> +/* Multiple versions of memmem.
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2012-2023 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Redefine memmem so that the compiler won't complain about the type
> +   mismatch with the IFUNC selector in strong_alias, below.  */
> +#undef  memmem
> +#define memmem __redirect_memmem
> +#include <string.h>
> +#undef  memmem
> +
> +#define MEMMEM __memmem_generic
> +#ifdef SHARED
> +# undef libc_hidden_builtin_def
> +# define libc_hidden_builtin_def(name) \
> +  __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
> +#endif
> +
> +#include "string/memmem.c"
> +
> +extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
> +extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
> +extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
> +
> +#define SYMBOL_NAME memmem
> +
> +#include "init-arch.h"
> +
> +/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
> +   ifunc symbol properly.  */
> +extern __typeof (__redirect_memmem) __libc_memmem;
> +
> +static inline void *
> +IFUNC_SELECTOR (void)
> +{
> +  const struct cpu_features *cpu_features = __get_cpu_features ();
> +
> +  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
> +      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> +    return __memmem_avx512;
> +
> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
> +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> +    return __memmem_avx2;
> +
> +  return __memmem_generic;
> +}
> +
> +libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
> +#undef memmem
> +strong_alias (__libc_memmem, __memmem)
> --
> 2.43.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21 17:17   ` Noah Goldstein
@ 2024-02-21 20:30     ` Alexander Monakov
  2024-02-21 22:17       ` Noah Goldstein
  2024-02-27 15:06       ` Rich Felker
  2024-02-24  4:25     ` James
  1 sibling, 2 replies; 25+ messages in thread
From: Alexander Monakov @ 2024-02-21 20:30 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: James Tirta Halim, libc-alpha


On Wed, 21 Feb 2024, Noah Goldstein wrote:

> > +#ifndef TZCNT
> > +#  define TZCNT(x) _tzcnt_u32 (x)
> > +#endif
> Use `__builtin_ctz`
> > +#ifndef BLSR
> > +#  define BLSR(x) _blsr_u32 (x)
> > +#endif
> 
> Think you can drop the `BLSR` define (here and in the avx512)
> and just replace with `((x) & ((x) - 1))`
> any reasonable compiler will optimize that correctly.

I am really confused why review of such minor technical details is happening
as if the proposed change is desirable and the goal is to include it in Glibc,
and algorithm-wise it's all fine including the relevance of rarebyte_table to
real-world uses of memmem and handling of page boundaries when iterating over
the haystack. Not to mention the necessity of carrying SIMD variants of memmem
in Glibc.

Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21 20:30     ` Alexander Monakov
@ 2024-02-21 22:17       ` Noah Goldstein
  2024-02-23 17:27         ` Adhemerval Zanella Netto
  2024-02-27 15:06       ` Rich Felker
  1 sibling, 1 reply; 25+ messages in thread
From: Noah Goldstein @ 2024-02-21 22:17 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: James Tirta Halim, libc-alpha

On Wed, Feb 21, 2024 at 2:30 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Wed, 21 Feb 2024, Noah Goldstein wrote:
>
> > > +#ifndef TZCNT
> > > +#  define TZCNT(x) _tzcnt_u32 (x)
> > > +#endif
> > Use `__builtin_ctz`
> > > +#ifndef BLSR
> > > +#  define BLSR(x) _blsr_u32 (x)
> > > +#endif
> >
> > Think you can drop the `BLSR` define (here and in the avx512)
> > and just replace with `((x) & ((x) - 1))`
> > any reasonable compiler will optimize that correctly.
>
> I am really confused why review of such minor technical details is happening
> as if the proposed change is desirable and the goal is to include it in Glibc,
> and algorithm-wise it's all fine including the relevance of rarebyte_table to
> real-world uses of memmem and handling of page boundaries when iterating over
> the haystack. Not to mention the necessity of carrying SIMD variants of memmem
> in Glibc.

Is there consensus that we don't want the change?
I thought we landed on roughly it's okay for ne_len <= ~VEC_SIZE
assuming it has a performance advantage in such cases.
>
> Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21 22:17       ` Noah Goldstein
@ 2024-02-23 17:27         ` Adhemerval Zanella Netto
  2024-02-29 20:19           ` Alexander Monakov
  0 siblings, 1 reply; 25+ messages in thread
From: Adhemerval Zanella Netto @ 2024-02-23 17:27 UTC (permalink / raw)
  To: Noah Goldstein, Alexander Monakov; +Cc: James Tirta Halim, libc-alpha



On 21/02/24 19:17, Noah Goldstein wrote:
> On Wed, Feb 21, 2024 at 2:30 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>>
>>
>> On Wed, 21 Feb 2024, Noah Goldstein wrote:
>>
>>>> +#ifndef TZCNT
>>>> +#  define TZCNT(x) _tzcnt_u32 (x)
>>>> +#endif
>>> Use `__builtin_ctz`
>>>> +#ifndef BLSR
>>>> +#  define BLSR(x) _blsr_u32 (x)
>>>> +#endif
>>>
>>> Think you can drop the `BLSR` define (here and in the avx512)
>>> and just replace with `((x) & ((x) - 1))`
>>> any reasonable compiler will optimize that correctly.
>>
>> I am really confused why review of such minor technical details is happening
>> as if the proposed change is desirable and the goal is to include it in Glibc,
>> and algorithm-wise it's all fine including the relevance of rarebyte_table to
>> real-world uses of memmem and handling of page boundaries when iterating over
>> the haystack. Not to mention the necessity of carrying SIMD variants of memmem
>> in Glibc.
> 
> Is there consensus that we don't want the change?
> I thought we landed on roughly it's okay for ne_len <= ~VEC_SIZE
> assuming it has a performance advantage in such cases.
>>

The patch needs something like:

index 0a89bd5f7c..8d0a1a2131 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -131,4 +131,3 @@ MEMMEM (const void *haystack, size_t hs_len,
 libc_hidden_def (__memmem)
 weak_alias (__memmem, memmem)
 libc_hidden_weak (memmem)
-libc_hidden_builtin_def (memmem)
diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
index 8fe7b77d33..66fe304f93 100644
--- a/sysdeps/x86_64/multiarch/memmem.c
+++ b/sysdeps/x86_64/multiarch/memmem.c
@@ -26,8 +26,8 @@

 #define MEMMEM __memmem_generic
 #ifdef SHARED
-# undef libc_hidden_builtin_def
-# define libc_hidden_builtin_def(name) \
+# undef libc_hidden_weak
+# define libc_hidden_weak(name) \
   __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
 #endif

To avoid break other architecture builds.  There are minor issue with the
patch, like missing Copyright header, and some minor style issues.

And I don't have a strong opinion here, the s390x seems to use a similar strategy
(sysdeps/s390/strstr-arch13.S, however I haven't dig into) so we have a
precedence. There are other projects that seems also to use similar strategies [1].

The implementation also does seems to provide some speedup for small needles
compare to generic one, at least based on your benchmark.  However the benchmark 
also shows that twoway_memmem is also slight better, which was used previously
680942b0167715, so I am not sure how representative our current benchmark is.

Alexandre, are you reservation about this optimization related to extra code
and data required to optimize for a limited input range?

[1] https://github.com/BurntSushi/memchr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21 17:17   ` Noah Goldstein
  2024-02-21 20:30     ` Alexander Monakov
@ 2024-02-24  4:25     ` James
  1 sibling, 0 replies; 25+ messages in thread
From: James @ 2024-02-24  4:25 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 21291 bytes --]

On Thu, Feb 22, 2024 at 12:17 AM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Wed, Feb 21, 2024 at 12:58 AM James Tirta Halim
> <tirtajames45@gmail.com> wrote:
> >
> > Find the rarest byte in NE. Find the parts of HS that matches the rare
> byte
> > and the byte after it. If found, shift back to the start of NE in HS and
> > vector compare the first VEC_SIZE with NE. If matches, compare the rest
> > with MEMCMPEQ.
> >
> > Timings (Core i3-1115G4):
> > basic_memmem twoway_memmem __memmem_avx512 __memmem_avx2
> > __memmem_generic
> > Total:
> > 6.80124e+06 1.06087e+06 219483 345385 768041
> > Average:
> > 25958.9 4049.11 837.721 1318.26 2931.45
> >
> > Passes make check.
> >
> > Changes in v1:
> > 1. Add memmem-avx2.c
> >
> > Changes in v2:
> > 1. Add avx512 support with a generic header file
> > 2. Use __memcmpeq instead of memcmp
> > 3. Remove scalar loop
> > 4. Fix unsafe unaligned load
> >
> > Changes in v3:
> > 1. Avoid checking for alignment to the start of the page since that will
> be rare
> > 2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
> > reference errors)
> > 3. Add memmem.c (needs review)
> > 4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
> > review)
> > 5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)
> >
> > Changes in v4:
> > 1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
> > use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
> > 2. Correct the Makefile to use the appropriate flags
> > 3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
> > 4. Remove unused vector macros (POPCNT and LZCNT)
> >
> > Changes in v5:
> > 1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
> > 2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
> > 3. Add comments
> > 4. Limit needle length to VEC_SIZE when finding the rare byte
> >
> > Changes in v6:
> > 1. Fix patch apply error in memmem.c
> > 2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at
> the end
> > of needle
> > 3. Always do unaligned load at the tail code
> > 4. Rename rarebyte_table to ___rarebyte_table
> > 5. Add memmem-avx-base.c in which ___rarebyte_table is defined
> > 6. Add memmem-avx-base to the Makefile
> > 7. Add always_inline to find_rarest_byte
> > 8. Change ((m << off) >> off) to (m & (ONES >> off))
> > 9. Change void * to unsigned char * in find_rarest_byte
> >
> > Changes in v7:
> > 1. Fallback to generic memmem for long needles for guaranteed
> > linear-time worst-case performance
> > 2. Use memmem instead of MEMMEM for libc_hidden_builtin_def in
> > memmem.c (string/memmem.c and sysdeps/x86_64/multiarch/memmem.c may
> > still need to be fixed for non-x86_64 builds to work. The changes were
> > made following string/strstr.c and sysdeps/x86_64/multiarch/strstr.c)
> > 3. Change some (VEC *) casts to (const VEC *)
> >
> > ---
> >  string/memmem.c                            |   7 +-
> >  sysdeps/x86_64/multiarch/Makefile          |   6 +
> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c |  12 ++
> >  sysdeps/x86_64/multiarch/memmem-avx-base.c |  20 +++
> >  sysdeps/x86_64/multiarch/memmem-avx-base.h | 191 +++++++++++++++++++++
> >  sysdeps/x86_64/multiarch/memmem-avx2.c     |   3 +
> >  sysdeps/x86_64/multiarch/memmem-avx512.c   |  12 ++
> >  sysdeps/x86_64/multiarch/memmem.c          |  67 ++++++++
> >  8 files changed, 317 insertions(+), 1 deletion(-)
> >  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
> >  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
> >  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
> >  create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
> >  create mode 100644 sysdeps/x86_64/multiarch/memmem.c
> >
> > diff --git a/string/memmem.c b/string/memmem.c
> > index a4117f8e1e..0a89bd5f7c 100644
> > --- a/string/memmem.c
> > +++ b/string/memmem.c
> > @@ -25,6 +25,10 @@
> >  # define __memmem      memmem
> >  #endif
> >
> > +#ifndef MEMMEM
> > +# define MEMMEM __memmem
> > +#endif
> > +
> >  #define RETURN_TYPE void *
> >  #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
> >  #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
> > @@ -50,7 +54,7 @@
> >     The limit also implies worst-case performance is linear.
> >     Needles larger than 256 characters use the linear-time Two-Way
> algorithm.  */
> >  void *
> > -__memmem (const void *haystack, size_t hs_len,
> > +MEMMEM (const void *haystack, size_t hs_len,
> >           const void *needle, size_t ne_len)
> >  {
> >    const unsigned char *hs = (const unsigned char *) haystack;
> > @@ -127,3 +131,4 @@ __memmem (const void *haystack, size_t hs_len,
> >  libc_hidden_def (__memmem)
> >  weak_alias (__memmem, memmem)
> >  libc_hidden_weak (memmem)
> > +libc_hidden_builtin_def (memmem)
> > diff --git a/sysdeps/x86_64/multiarch/Makefile
> b/sysdeps/x86_64/multiarch/Makefile
> > index d3d2270394..0b46d5f341 100644
> > --- a/sysdeps/x86_64/multiarch/Makefile
> > +++ b/sysdeps/x86_64/multiarch/Makefile
> > @@ -15,6 +15,9 @@ sysdep_routines += \
> >    memcmpeq-avx2-rtm \
> >    memcmpeq-evex \
> >    memcmpeq-sse2 \
> > +  memmem-avx-base \
> > +  memmem-avx2 \
> > +  memmem-avx512 \
> >    memmove-avx-unaligned-erms \
> >    memmove-avx-unaligned-erms-rtm \
> >    memmove-avx512-no-vzeroupper \
> > @@ -122,6 +125,9 @@ sysdep_routines += \
> >    varshift \
> >  # sysdep_routines
> >
> > +CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
> > +CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
> > +
> >  CFLAGS-strcspn-sse4.c += -msse4
> >  CFLAGS-strpbrk-sse4.c += -msse4
> >  CFLAGS-strspn-sse4.c += -msse4
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > index c4a21d4b7c..20a8b85da9 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > @@ -799,6 +799,18 @@ __libc_ifunc_impl_list (const char *name, struct
> libc_ifunc_impl *array,
> >               IFUNC_IMPL_ADD (array, i, strstr, 1,
> __strstr_sse2_unaligned)
> >               IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
> >
> > +    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
> > +  IFUNC_IMPL (i, name, memmem,
> > +              IFUNC_IMPL_ADD (array, i, memmem,
> > +                              (CPU_FEATURE_USABLE (AVX512BW)
> > +                               && CPU_FEATURE_USABLE (BMI1)),
> > +                              __memmem_avx512)
> > +              IFUNC_IMPL_ADD (array, i, memmem,
> > +                             (CPU_FEATURE_USABLE (AVX2)
> > +                             && CPU_FEATURE_USABLE (BMI1)),
> > +                             __memmem_avx2)
> > +             IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic))
> > +
> >    /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
> >    IFUNC_IMPL (i, name, wcschr,
> >               X86_IFUNC_IMPL_ADD_V4 (array, i, wcschr,
> > diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c
> b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> > new file mode 100644
> > index 0000000000..212d75c96f
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
> > @@ -0,0 +1,20 @@
> > +const unsigned char ___rarebyte_table[256] attribute_hidden
> > +    = { 0,   1,          13,  56,  59,  60,  61,  62,  63,  232, 248,
> 2,   158, 4,
> > +       5,   6,   7,   8,   9,   10,  14,  20,  26,  29,  37,  46,  52,
> 53,
> > +       54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218,
> 219,
> > +       212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227,
> 223,
> > +       222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189,
> 216,
> > +       196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206,
> 204,
> > +       214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163,
> 246,
> > +       233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245,
> 243,
> > +       228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171,
> 190,
> > +       166, 3,   140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,
> 87,
> > +       132, 96,  112, 97,  103, 82,  139, 89,  98,  88,  119, 74,  156,
> 115,
> > +       104, 75,  120, 106, 76,  155, 90,  122, 107, 125, 152, 145, 136,
> 137,
> > +       101, 116, 102, 108, 99,  141, 77,  78,  118, 79,  109, 100, 150,
> 73,
> > +       94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,
> 149,
> > +       146, 111, 65,  69,  66,  15,  16,  17,  18,  19,  130, 92,  144,
> 123,
> > +       21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,
> 67,
> > +       153, 84,  143, 138, 147, 157, 148, 68,  71,  30,  31,  32,  33,
> 34,
> > +       35,  36,  154, 38,  39,  40,  41,  42,  80,  43,  44,  45,  47,
> 48,
> > +       85,  49,  50,  51 };
> > diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h
> b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> > new file mode 100644
> > index 0000000000..08941798ff
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
> > @@ -0,0 +1,191 @@
> > +#include <immintrin.h>
> > +#include <inttypes.h>
> > +#include <string.h>
> > +#include <libc-pointer-arith.h>
> > +
> > +#ifndef FUNC_NAME
> > +#  define __memmem_avx2
> > +#endif
> > +#ifndef VEC
> > +#  define VEC __m256i
> > +#endif
> > +#ifndef MASK
> > +#  define MASK uint32_t
> > +#endif
> > +#ifndef LOAD
> > +#  define LOAD(x) _mm256_load_si256 (x)
> > +#endif
> > +#ifndef LOADU
> > +#  define LOADU(x) _mm256_loadu_si256 (x)
> > +#endif
> > +#ifndef CMPEQ8_MASK
> > +#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x,
> y))
> > +#endif
> > +#ifndef SETONE8
> > +#  define SETONE8(x) _mm256_set1_epi8 (x)
> > +#endif
> > +#ifndef TZCNT
> > +#  define TZCNT(x) _tzcnt_u32 (x)
> > +#endif
> Use `__builtin_ctz`
>
 Is it more portable? Are we dropping tzcnt and blsr to drop BMI1?

> > +#ifndef BLSR
> > +#  define BLSR(x) _blsr_u32 (x)
> > +#endif
>
> Think you can drop the `BLSR` define (here and in the avx512)
> and just replace with `((x) & ((x) - 1))`
> any reasonable compiler will optimize that correctly.
>
Ok.

> > +#define VEC_SIZE sizeof (VEC)
> > +#define ONES ((MASK) -1)
> > +
> > +#ifndef MEMCMPEQ
> > +#  define MEMCMPEQ __memcmpeq
> > +#endif
> > +#ifndef MEMCPY
> > +#  define MEMCPY memcpy
> > +#endif
> > +#ifndef MEMCHR
> > +#  define MEMCHR memchr
> > +#endif
> > +#ifndef PAGE_SIZE
> > +#  define PAGE_SIZE 4096
> > +#endif
> > +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> > +
> > +extern void *__memmem_generic (const void *, size_t, const void *,
> > +                              size_t) attribute_hidden;
> > +
> > +/* Lower is rarer. The table is based on the *.c and *.h files in
> glibc. */
> > +extern const unsigned char ___rarebyte_table[256] attribute_hidden;
> > +
> > +static inline void *__attribute__ ((always_inline))
> > +find_rarest_byte (const unsigned char *rare, size_t n)
> > +{
> > +  const unsigned char *p = (const unsigned char *) rare;
> > +  int c_rare = ___rarebyte_table[*rare];
> > +  int c;
> > +  for (; n--; ++p)
> > +    {
> > +      c = ___rarebyte_table[*p];
> > +      if (c < c_rare)
> > +       {
> > +         rare = p;
> > +         c_rare = c;
> > +       }
> > +    }
> > +  return (void *) rare;
> > +}
> > +
> > +void *
> > +FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
> > +{
> > +  if (ne_len == 1)
> > +    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
> > +  if (__glibc_unlikely (ne_len == 0))
> > +    return (void *) hs;
> > +  if (__glibc_unlikely (hs_len < ne_len))
> > +    return NULL;
> > +  /* Linear-time worst-case performance is guaranteed by the generic
> > +   * implementation using the Two-Way algorithm. */
> > +  if (__glibc_unlikely (ne_len > 256))
> > +    return __memmem_generic (hs, hs_len, ne, ne_len)
> Think this impl makes sense up to VEC_SIZE * 1 + 1, but after that
> it doesn't seem to have that much advantage.
>
Should we fallback directly to two_way_long_needle then (make it
non-static)?

> > +  VEC hv0, hv1, hv, nv;
> > +  MASK i, hm0, hm1, m, cmpm;
> > +  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len :
> 0;
> > +  const MASK matchm = ONES << matchsh;
> > +  const unsigned char *h = (const unsigned char *) hs;
> > +  const unsigned char *const end = h + hs_len - ne_len;
> > +  const unsigned char *hp;
> > +  size_t rare = PTR_DIFF (
> > +      find_rarest_byte ((const unsigned char *) ne, MIN (ne_len,
> VEC_SIZE)),
> > +      ne);
> > +  /* RARE will always be the first byte to find.
> > +     If RARE is at the end of the needle, use the byte before it. */
> > +  if (rare == MIN (ne_len, VEC_SIZE) - 1)
> > +    --rare;
> > +  const VEC nv0 = SETONE8 (*((char *) ne + rare));
> > +  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
> > +  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
> > +                          ? VEC_SIZE - (unsigned int) (end - h) - 1
> > +                          : 0;
> > +  /* Start from the position of RARE. */
> > +  h += rare;
> > +  /* Load the needle vector. */
> > +  if (((uintptr_t) ne & (PAGE_SIZE - 1)) > (PAGE_SIZE - VEC_SIZE)
> > +      || ne_len >= VEC_SIZE)
> the `ne_len >= VEC_SIZE` should probably be the first check here.
>
I'm keeping it as it is because that is faster for short needles. And I
think I'm reusing PTR_DIFF (PTR_ALIGN_UP (ne, VEC_SIZE), ne) >= VEC_SIZE
because I've run into some problems with the current condition.

> > +    nv = LOADU ((const VEC *) ne);
> > +  else
> > +    MEMCPY (&nv, ne, MIN (VEC_SIZE, ne_len));
> > +  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
> > +  /* Align down to VEC_SIZE. */
> > +  h -= off_s;
> > +  hv0 = LOAD ((const VEC *) h);
> > +  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> > +  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
> > +  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that
> are out
> > +   * of bounds (OFF_E). */
> > +  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
> > +  while (m)
> > +    {
> > +      i = TZCNT (m);
> > +      m = BLSR (m);
> > +      hp = h + off_s + i - rare;
> > +      if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> > +       {
> > +         /* Do a vector compare if we are not crossing a page. */
> > +         hv = LOADU ((const VEC *) hp);
> > +         cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> > +         /* Compare only the relevant bits of the needle vector. */
> > +         if (cmpm == matchm)
> > +           /* Compare the rest of the needle. */
> > +           if (ne_len <= VEC_SIZE
> > +               || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne +
> VEC_SIZE,
> > +                             ne_len - VEC_SIZE))
> > +             return (void *) hp;
> > +       }
> > +      else
> > +       {
> > +         if (!MEMCMPEQ (hp, ne, ne_len))
> > +           return (void *) hp;
> think (assuming you bound ne_len <= ~VEC_SIZE * 2), you can
> just make a little inline impl of this that will be much faster
> than a call to __memcmpeq.

Realistically, how often are we going to have needles longer than 64 from
normal input, though I think ne_len <= VEC_SIZE * 2 is fine for avx2.

> > +       }
> > +    }
> > +  h += VEC_SIZE - 1;
> > +  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
> > +    {
> > +      hv0 = LOADU ((const VEC *) h);
> > +      hv1 = LOAD ((const VEC *) (h + 1));
> > +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> > +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> > +      m = hm0 & hm1;
> > +      while (m)
> > +       {
> > +       match:
> > +         i = TZCNT (m);
> > +         m = BLSR (m);
> > +         hp = h + i - rare;
> > +         if (PTR_DIFF (PTR_ALIGN_UP (hp, PAGE_SIZE), hp) >= VEC_SIZE)
> > +           {
> > +             hv = LOADU ((const VEC *) hp);
> > +             cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;
> > +             if (cmpm == matchm)
> > +               if (ne_len <= VEC_SIZE
> > +                   || !MEMCMPEQ (hp + VEC_SIZE, (const char *) ne +
> VEC_SIZE,
> > +                                 ne_len - VEC_SIZE))
> > +                 return (void *) hp;
> > +           }
> > +         else
> > +           {
> > +             if (!MEMCMPEQ (hp, ne, ne_len))
> > +               return (void *) hp;
> > +           }
> > +       }
> > +    }
> > +  if (h - rare <= end)
> > +    {
> > +      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
> > +      hv0 = LOADU ((const VEC *) h);
> > +      hv1 = LOAD ((const VEC *) (h + 1));
> > +      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
> > +      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
> > +      /* Clear the irrelevant bits that are out of bounds. */
> > +      m = hm0 & hm1 & (ONES >> off_e);
> > +      if (m)
> > +       goto match;
> > +    }
> > +  return NULL;
> > +}
> > diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c
> b/sysdeps/x86_64/multiarch/memmem-avx2.c
> > new file mode 100644
> > index 0000000000..91f5d5d331
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
> > @@ -0,0 +1,3 @@
> > +#define FUNC_NAME __memmem_avx2
> > +
> > +#include "memmem-avx-base.h"
> > diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c
> b/sysdeps/x86_64/multiarch/memmem-avx512.c
> > new file mode 100644
> > index 0000000000..76016c1cfe
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
> > @@ -0,0 +1,12 @@
> > +#define VEC __m512i
> > +#define MASK uint64_t
> > +#define LOAD(x) _mm512_load_si512 (x)
> > +#define LOADU(x) _mm512_loadu_si512 (x)
> > +#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
> > +#define SETONE8(x) _mm512_set1_epi8 (x)
> > +#define TZCNT(x) _tzcnt_u64 (x)
> > +#define BLSR(x) _blsr_u64 (x)
> > +
> > +#define FUNC_NAME __memmem_avx512
> > +
> > +#include "memmem-avx-base.h"
> > diff --git a/sysdeps/x86_64/multiarch/memmem.c
> b/sysdeps/x86_64/multiarch/memmem.c
> > new file mode 100644
> > index 0000000000..8fe7b77d33
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/memmem.c
> > @@ -0,0 +1,67 @@
> > +/* Multiple versions of memmem.
> > +   All versions must be listed in ifunc-impl-list.c.
> > +   Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +/* Redefine memmem so that the compiler won't complain about the type
> > +   mismatch with the IFUNC selector in strong_alias, below.  */
> > +#undef  memmem
> > +#define memmem __redirect_memmem
> > +#include <string.h>
> > +#undef  memmem
> > +
> > +#define MEMMEM __memmem_generic
> > +#ifdef SHARED
> > +# undef libc_hidden_builtin_def
> > +# define libc_hidden_builtin_def(name) \
> > +  __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
> > +#endif
> > +
> > +#include "string/memmem.c"
> > +
> > +extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
> > +extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
> > +extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
> > +
> > +#define SYMBOL_NAME memmem
> > +
> > +#include "init-arch.h"
> > +
> > +/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
> > +   ifunc symbol properly.  */
> > +extern __typeof (__redirect_memmem) __libc_memmem;
> > +
> > +static inline void *
> > +IFUNC_SELECTOR (void)
> > +{
> > +  const struct cpu_features *cpu_features = __get_cpu_features ();
> > +
> > +  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
> > +      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
> > +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> > +    return __memmem_avx512;
> > +
> > +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
> > +      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
> > +    return __memmem_avx2;
> > +
> > +  return __memmem_generic;
> > +}
> > +
> > +libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR
> ());
> > +#undef memmem
> > +strong_alias (__libc_memmem, __memmem)
> > --
> > 2.43.2
> >
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v8] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
  2024-02-21 17:17   ` Noah Goldstein
@ 2024-02-24  9:09   ` James Tirta Halim
  2024-02-24  9:29   ` James Tirta Halim
  2 siblings, 0 replies; 25+ messages in thread
From: James Tirta Halim @ 2024-02-24  9:09 UTC (permalink / raw)
  To: tirtajames45; +Cc: goldstein.w.n, libc-alpha

Find the rarest byte in NE. Find the parts of HS that matches the rare byte
and the byte after it. If found, shift back to the start of NE in HS and
vector compare with NE.

Timings (Core i3-1115G4):
basic_memmem twoway_memmem __memmem_avx2 __memmem_avx512 __memmem_generic __memmem_sse2
Average:
25905.8 4117.55 1574.32 850.412 3011.89 2190.56
Total:
6.78732e+06 1.0788e+06 412471 222808 789116 573927

Passes test-memmem

Changes in v1:
1. Add memmem-avx2.c

Changes in v2:
1. Add avx512 support with a generic header file
2. Use __memcmpeq instead of memcmp
3. Remove scalar loop
4. Fix unsafe unaligned load

Changes in v3:
1. Avoid checking for alignment to the start of the page since that will be rare
2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
reference errors)
3. Add memmem.c (needs review)
4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
review)
5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)

Changes in v4:
1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
2. Correct the Makefile to use the appropriate flags
3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
4. Remove unused vector macros (POPCNT and LZCNT)

Changes in v5:
1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
3. Add comments
4. Limit needle length to VEC_SIZE when finding the rare byte

Changes in v6:
1. Fix patch apply error in memmem.c
2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
of needle
3. Always do unaligned load at the tail code
4. Rename rarebyte_table to ___rarebyte_table
5. Add memmem-avx-base.c in which ___rarebyte_table is defined
6. Add memmem-avx-base to the Makefile
7. Add always_inline to find_rarest_byte
8. Change ((m << off) >> off) to (m & (ONES >> off))
9. Change void * to unsigned char * in find_rarest_byte

Changes in v7:
1. Fallback to generic memmem for long needles for guaranteed
linear-time worst-case performance
2. Use memmem instead of MEMMEM for libc_hidden_builtin_def in
memmem.c (string/memmem.c and sysdeps/x86_64/multiarch/memmem.c may
still need to be fixed for non-x86_64 builds to work. The changes were
made following string/strstr.c and sysdeps/x86_64/multiarch/strstr.c)
3. Change some (VEC *) casts to (const VEC *)

Changes in v8:
1. Remove libc_hidden_builtin_def in string/memmem.c and change libc_hidden_builtin_def to
libc_hidden_weak in sysdeps/*/memmem.c
2. Add memmem-sse2 (add to ifunc-impl-list.c, sysdeps/*/memmem.c, and
Makefile). sse2 is used if we have Fast_Unaligned_Load
3. avx2 and avx512 are used for ne_len <= VEC_SIZE * 2; sse2 for ne_len <=
VEC_SIZE (benchmark shows that sse2 is slower for ne_len <= VEC_SIZE * 2)
4. avx2 and avx512 fallback to two_way_long_needle; sse2 fallback to
__memmem_generic
5. Change MEMCMPEQ that is used for comparing the rest of the needle
with CMPEQ8. If ne_len <= VEC_SIZE * 2, CMPEQ8 the start and end of the
needle
6. If ne_len <= VEC_SIZE * 2, load the second needle vector
7. Implement BLSR with ((x) & ((x) - 1)), TZCNT (avx2) with
__builtin_ctz
8. Implement TZCNT (sse2) with ((x) ? _bit_scan_forward (x) : (MASK)
sizeof (MASK) * CHAR_BIT)
9. Add NOT_CROSSING_PAGE macro
10. Add MIN_VEC macro. If ne_len <= VEC_SIZE * 2, it expands to MIN
(ne_len, VEC_SIZE). Otherwise, it expands to ne_len, since ne_len will
always be <= VEC_SIZE
11. Add LONG_NEEDLE macro for checking if ne_len may be <= VEC_SIZE * 2
12. Add macros to change the name of two_way_long_needle and make it non-static in string/str-two-way.h

---
 string/memmem.c                            |   8 +-
 string/str-two-way.h                       |  13 +-
 sysdeps/x86_64/multiarch/Makefile          |   8 +
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |  13 ++
 sysdeps/x86_64/multiarch/memmem-avx-base.c |  37 +++
 sysdeps/x86_64/multiarch/memmem-avx-base.h | 255 +++++++++++++++++++++
 sysdeps/x86_64/multiarch/memmem-avx2.c     |   6 +
 sysdeps/x86_64/multiarch/memmem-avx512.c   |  13 ++
 sysdeps/x86_64/multiarch/memmem-sse2.c     |  16 ++
 sysdeps/x86_64/multiarch/memmem.c          |  73 ++++++
 10 files changed, 438 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-sse2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem.c

diff --git a/string/memmem.c b/string/memmem.c
index a4117f8e1e..d04710bf92 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -25,6 +25,10 @@
 # define __memmem	memmem
 #endif
 
+#ifndef MEMMEM
+# define MEMMEM __memmem
+#endif
+
 #define RETURN_TYPE void *
 #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
 #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
@@ -50,7 +54,7 @@
    The limit also implies worst-case performance is linear.
    Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
 void *
-__memmem (const void *haystack, size_t hs_len,
+MEMMEM (const void *haystack, size_t hs_len,
 	  const void *needle, size_t ne_len)
 {
   const unsigned char *hs = (const unsigned char *) haystack;
@@ -77,7 +81,7 @@ __memmem (const void *haystack, size_t hs_len,
 
   /* Use Two-Way algorithm for very long needles.  */
   if (__builtin_expect (ne_len > 256, 0))
-    return two_way_long_needle (hs, hs_len, ne, ne_len);
+    return TWO_WAY_LONG_NEEDLE_FUNC_NAME (hs, hs_len, ne, ne_len);
 
   uint8_t shift[256];
   size_t tmp, shift1;
diff --git a/string/str-two-way.h b/string/str-two-way.h
index 0e663b957c..26d2853e0f 100644
--- a/string/str-two-way.h
+++ b/string/str-two-way.h
@@ -91,6 +91,15 @@
 # define RET0_IF_0(a) /* nothing */
 #endif
 
+#ifndef TWO_WAY_LONG_NEEDLE_FUNC_NAME
+# define TWO_WAY_LONG_NEEDLE_FUNC_NAME two_way_long_needle
+#endif
+#ifndef TWO_WAY_LONG_NEEDLE_NON_STATIC
+# define TWO_WAY_LONG_NEEDLE_STATIC static
+#else
+# define TWO_WAY_LONG_NEEDLE_STATIC
+#endif
+
 /* Perform a critical factorization of NEEDLE, of length NEEDLE_LEN.
    Return the index of the first byte in the right half, and set
    *PERIOD to the global period of the right half.
@@ -386,8 +395,8 @@ two_way_short_needle (const unsigned char *haystack, size_t haystack_len,
 
    Since this function is large and complex, block inlining to avoid
    slowing down the common case of small needles.  */
-__attribute__((noinline)) static RETURN_TYPE
-two_way_long_needle (const unsigned char *haystack, size_t haystack_len,
+__attribute__((noinline)) TWO_WAY_LONG_NEEDLE_STATIC RETURN_TYPE
+TWO_WAY_LONG_NEEDLE_FUNC_NAME (const unsigned char *haystack, size_t haystack_len,
 		     const unsigned char *needle, size_t needle_len)
 {
   size_t i; /* Index into current byte of NEEDLE.  */
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index d3d2270394..5c0139f17a 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -15,6 +15,10 @@ sysdep_routines += \
   memcmpeq-avx2-rtm \
   memcmpeq-evex \
   memcmpeq-sse2 \
+  memmem-avx-base \
+  memmem-avx2 \
+  memmem-avx512 \
+  memmem-sse2 \
   memmove-avx-unaligned-erms \
   memmove-avx-unaligned-erms-rtm \
   memmove-avx512-no-vzeroupper \
@@ -122,6 +126,10 @@ sysdep_routines += \
   varshift \
 # sysdep_routines
 
+CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
+CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
+CFLAGS-memmem-sse2.c += -O3
+
 CFLAGS-strcspn-sse4.c += -msse4
 CFLAGS-strpbrk-sse4.c += -msse4
 CFLAGS-strspn-sse4.c += -msse4
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c4a21d4b7c..002d255e16 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -798,6 +798,19 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
                               __strstr_avx512)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
+  
+    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
+  IFUNC_IMPL (i, name, memmem,
+              IFUNC_IMPL_ADD (array, i, memmem,
+		              (CPU_FEATURE_USABLE (AVX2)
+			      && CPU_FEATURE_USABLE (BMI1)),
+			      __memmem_avx2)
+              IFUNC_IMPL_ADD (array, i, memmem,
+                              (CPU_FEATURE_USABLE (AVX512BW)
+                               && CPU_FEATURE_USABLE (BMI1)),
+                              __memmem_avx512)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_sse2))
 
   /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
   IFUNC_IMPL (i, name, wcschr,
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
new file mode 100644
index 0000000000..f8c5ed5f37
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
@@ -0,0 +1,37 @@
+/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+const unsigned char ___rarebyte_table[256] attribute_hidden
+    = { 0,   1,	  13,  56,  59,	 60,  61,  62,	63,  232, 248, 2,   158, 4,
+	5,   6,	  7,   8,   9,	 10,  14,  20,	26,  29,  37,  46,  52,	 53,
+	54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
+	212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
+	222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
+	196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
+	214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
+	233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
+	228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
+	166, 3,	  140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,	 87,
+	132, 96,  112, 97,  103, 82,  139, 89,	98,  88,  119, 74,  156, 115,
+	104, 75,  120, 106, 76,	 155, 90,  122, 107, 125, 152, 145, 136, 137,
+	101, 116, 102, 108, 99,	 141, 77,  78,	118, 79,  109, 100, 150, 73,
+	94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,	 149,
+	146, 111, 65,  69,  66,	 15,  16,  17,	18,  19,  130, 92,  144, 123,
+	21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,	 67,
+	153, 84,  143, 138, 147, 157, 148, 68,	71,  30,  31,  32,  33,	 34,
+	35,  36,  154, 38,  39,	 40,  41,  42,	80,  43,  44,  45,  47,	 48,
+	85,  49,  50,  51 };
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
new file mode 100644
index 0000000000..71c15d8c2f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
@@ -0,0 +1,255 @@
+/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <immintrin.h>
+#include <inttypes.h>
+#include <string.h>
+#include <libc-pointer-arith.h>
+#include "str-two-way.h"
+
+#ifndef FUNC_NAME
+#  define FUNC_NAME __memmem_avx2
+#endif
+#ifndef VEC
+#  define VEC __m256i
+#endif
+#ifndef MASK
+#  define MASK uint32_t
+#endif
+#ifndef LOAD
+#  define LOAD(x) _mm256_load_si256 (x)
+#endif
+#ifndef LOADU
+#  define LOADU(x) _mm256_loadu_si256 (x)
+#endif
+#ifndef CMPEQ8_MASK
+#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
+#endif
+#ifndef SETONE8
+#  define SETONE8(x) _mm256_set1_epi8 (x)
+#endif
+#ifndef TZCNT
+#  define TZCNT(x) __builtin_ctz (x)
+#endif
+#ifndef BLSR
+#  define BLSR(x) ((x) & ((x) -1))
+#endif
+#ifndef MEMMEM_GENERIC
+#  define MEMMEM_GENERIC __memmem_generic
+#endif
+#ifndef TWO_WAY_LONG_NEEDLE_THRESHOLD
+#  define TWO_WAY_LONG_NEEDLE_THRESHOLD VEC_SIZE
+#endif
+#ifndef VEC_SIZE
+#  define VEC_SIZE 32
+#endif
+#define ONES ((MASK) -1)
+
+#ifndef MEMCMPEQ
+#  define MEMCMPEQ __memcmpeq
+#endif
+#ifndef MEMCPY
+#  define MEMCPY memcpy
+#endif
+#ifndef MEMCHR
+#  define MEMCHR memchr
+#endif
+#ifndef PAGE_SIZE
+#  define PAGE_SIZE 4096
+#endif
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+#define NOT_CROSSING_PAGE(p, obj_size)                                        \
+  (PTR_DIFF (PTR_ALIGN_UP (p, PAGE_SIZE), p) >= obj_size)
+#if TWO_WAY_LONG_NEEDLE_THRESHOLD > VEC_SIZE
+#  define LONG_NEEDLE 1
+#  define MIN_VEC(ne_len) MIN (ne_len, VEC_SIZE)
+#else
+#  define LONG_NEEDLE 0
+#  define MIN_VEC(ne_len) (ne_len)
+#endif
+
+_Static_assert (VEC_SIZE == sizeof (VEC), "VEC_SIZE != sizeof (VEC).");
+_Static_assert (
+    TWO_WAY_LONG_NEEDLE_THRESHOLD <= VEC_SIZE * 2,
+    "FIND_MATCH() assumes TWO_WAY_LONG_NEEDLE_THRESHOLD <= VEC_SIZE * 2.");
+
+#if LONG_NEEDLE
+#  define FIND_MATCH()                                                        \
+    if (NOT_CROSSING_PAGE (hp, VEC_SIZE * 2))                                 \
+      {                                                                       \
+	/* Do a vector compare if we are not crossing a page. */              \
+	hv = LOADU ((const VEC *) hp);                                        \
+	cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;                        \
+	/* Compare only the relevant bits of the needle vector. */            \
+	if (cmpm == matchm)                                                   \
+	  {                                                                   \
+	    if (ne_len <= VEC_SIZE)                                           \
+	      return (void *) hp;                                             \
+	    /* Compare the rest of the needle. */                             \
+	    hv = LOADU ((const VEC *) hp + 1);                                \
+	    cmpm = (MASK) CMPEQ8_MASK (hv, nv_e) << matchsh_e;                \
+	    if (cmpm == matchm_e)                                             \
+	      return (void *) hp;                                             \
+	  }                                                                   \
+      }                                                                       \
+    else                                                                      \
+      {                                                                       \
+	if (!MEMCMPEQ (hp, ne, ne_len))                                       \
+	  return (void *) hp;                                                 \
+      }
+#else
+#  define FIND_MATCH()                                                        \
+    if (NOT_CROSSING_PAGE (hp, VEC_SIZE))                                     \
+      {                                                                       \
+	hv = LOADU ((const VEC *) hp);                                        \
+	cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;                        \
+	if (cmpm == matchm)                                                   \
+	  return (void *) hp;                                                 \
+      }                                                                       \
+    else                                                                      \
+      {                                                                       \
+	if (!MEMCMPEQ (hp, ne, ne_len))                                       \
+	  return (void *) hp;                                                 \
+      }
+#endif
+
+extern void *MEMMEM_GENERIC (const void *, size_t, const void *,
+			     size_t) attribute_hidden;
+
+/* Lower is rarer. The table is based on the *.c and *.h files in glibc. */
+extern const unsigned char ___rarebyte_table[256] attribute_hidden;
+
+static inline void *__attribute__ ((always_inline))
+find_rarest_byte (const unsigned char *rare, size_t n)
+{
+  const unsigned char *p = (const unsigned char *) rare;
+  int c_rare = ___rarebyte_table[*rare];
+  int c;
+  for (; n--; ++p)
+    {
+      c = ___rarebyte_table[*p];
+      if (c < c_rare)
+	{
+	  rare = p;
+	  c_rare = c;
+	}
+    }
+  return (void *) rare;
+}
+
+void *
+FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
+{
+  if (ne_len == 1)
+    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
+  if (__glibc_unlikely (ne_len == 0))
+    return (void *) hs;
+  if (__glibc_unlikely (hs_len < ne_len))
+    return NULL;
+  /* Linear-time worst-case performance is guaranteed by the generic
+   * implementation using the Two-Way algorithm. */
+  if (__glibc_unlikely (ne_len > TWO_WAY_LONG_NEEDLE_THRESHOLD))
+    return MEMMEM_GENERIC (hs, hs_len, ne, ne_len);
+  VEC hv0, hv1, hv, nv;
+  MASK i, hm0, hm1, m, cmpm;
+  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
+  const MASK matchm = ONES << matchsh;
+#if LONG_NEEDLE
+  VEC nv_e;
+  const unsigned int matchsh_e
+      = ne_len < VEC_SIZE * 2 ? VEC_SIZE * 2 - ne_len : 0;
+  const MASK matchm_e = ONES << matchsh_e;
+#endif
+  const unsigned char *h = (const unsigned char *) hs;
+  const unsigned char *const end = h + hs_len - ne_len;
+  const unsigned char *hp;
+  size_t rare = PTR_DIFF (
+      find_rarest_byte ((const unsigned char *) ne, MIN_VEC (ne_len)), ne);
+  /* RARE will always be the first byte to find.
+     If RARE is at the end of the needle, use the byte before it. */
+  if (rare == MIN_VEC (ne_len) - 1)
+    --rare;
+  const VEC nv0 = SETONE8 (*((char *) ne + rare));
+  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
+  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
+			   ? VEC_SIZE - (unsigned int) (end - h) - 1
+			   : 0;
+  /* Start from the position of RARE. */
+  h += rare;
+  /* Load the needle vector. */
+  if (NOT_CROSSING_PAGE (ne, VEC_SIZE)
+      || (LONG_NEEDLE ? ne_len >= VEC_SIZE : 0))
+    nv = LOADU ((const VEC *) ne);
+  else
+    MEMCPY (&nv, ne, MIN_VEC (ne_len));
+#if LONG_NEEDLE
+  if (ne_len >= VEC_SIZE)
+    {
+      if (NOT_CROSSING_PAGE (ne, VEC_SIZE * 2))
+	nv_e = LOADU ((const VEC *) ne + 1);
+      else
+	MEMCPY (&nv_e, (const unsigned char *) ne + VEC_SIZE,
+		MIN (VEC_SIZE, ne_len - VEC_SIZE));
+    }
+#endif
+  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
+  /* Align down to VEC_SIZE. */
+  h -= off_s;
+  hv0 = LOAD ((const VEC *) h);
+  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
+  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
+   * of bounds (OFF_E). */
+  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
+  while (m)
+    {
+      i = TZCNT (m);
+      m = BLSR (m);
+      hp = h + off_s + i - rare;
+      FIND_MATCH ();
+    }
+  h += VEC_SIZE - 1;
+  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
+    {
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      m = hm0 & hm1;
+      while (m)
+	{
+	match:
+	  i = TZCNT (m);
+	  m = BLSR (m);
+	  hp = h + i - rare;
+	  FIND_MATCH ();
+	}
+    }
+  if (h - rare <= end)
+    {
+      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      /* Clear the irrelevant bits that are out of bounds. */
+      m = hm0 & hm1 & (ONES >> off_e);
+      if (m)
+	goto match;
+    }
+  return NULL;
+}
diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
new file mode 100644
index 0000000000..ef5e7c1c67
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
@@ -0,0 +1,6 @@
+#include "str-two-way.h"
+#define MEMMEM_GENERIC TWO_WAY_LONG_NEEDLE_FUNC_NAME
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD ((VEC_SIZE) *2)
+#define VEC_SIZE 32
+#define FUNC_NAME __memmem_avx2
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
new file mode 100644
index 0000000000..b1f23889ec
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
@@ -0,0 +1,13 @@
+#include "str-two-way.h"
+#define MEMMEM_GENERIC TWO_WAY_LONG_NEEDLE_FUNC_NAME
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD ((VEC_SIZE) *2)
+#define VEC_SIZE 64
+#define VEC __m512i
+#define MASK uint64_t
+#define LOAD(x) _mm512_load_si512 (x)
+#define LOADU(x) _mm512_loadu_si512 (x)
+#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
+#define SETONE8(x) _mm512_set1_epi8 (x)
+#define TZCNT(x) _tzcnt_u64 (x)
+#define FUNC_NAME __memmem_avx512
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-sse2.c b/sysdeps/x86_64/multiarch/memmem-sse2.c
new file mode 100644
index 0000000000..a69e35a25b
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-sse2.c
@@ -0,0 +1,16 @@
+#include <x86intrin.h>
+
+#define VEC __m128i
+#define VEC_SIZE 16
+#define MASK uint16_t
+#define LOAD(x) _mm_load_si128 (x)
+#define LOADU(x) _mm_loadu_si128 (x)
+#define CMPEQ8_MASK(x, y) _mm_movemask_epi8 (_mm_cmpeq_epi8 (x, y))
+#define SETONE8(x) _mm_set1_epi8 (x)
+#define TZCNT(x)                                                              \
+  ((x) ? _bit_scan_forward (x) : (MASK) sizeof (MASK) * CHAR_BIT)
+
+#define FUNC_NAME __memmem_sse2
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD VEC_SIZE
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
new file mode 100644
index 0000000000..69ee4867ad
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem.c
@@ -0,0 +1,73 @@
+/* Multiple versions of memmem.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Redefine memmem so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+#undef memmem
+#define memmem __redirect_memmem
+#include <string.h>
+#undef memmem
+
+#define MEMMEM __memmem_generic
+#ifdef SHARED
+#  undef libc_hidden_weak
+#  define libc_hidden_weak(name)                                              \
+    __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
+#endif
+
+#include "str-two-way.h"
+#define TWO_WAY_LONG_NEEDLE_NON_STATIC
+#include "string/memmem.c"
+
+extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_sse2 attribute_hidden;
+
+#define SYMBOL_NAME memmem
+
+#include "init-arch.h"
+
+/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
+   ifunc symbol properly.  */
+extern __typeof (__redirect_memmem) __libc_memmem;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
+      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx512;
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx2;
+
+  if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
+    return __memmem_sse2;
+
+  return __memmem_generic;
+}
+
+libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
+#undef memmem
+strong_alias (__libc_memmem, __memmem)
-- 
2.43.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v8] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
  2024-02-21 17:17   ` Noah Goldstein
  2024-02-24  9:09   ` [PATCH v8] " James Tirta Halim
@ 2024-02-24  9:29   ` James Tirta Halim
  2 siblings, 0 replies; 25+ messages in thread
From: James Tirta Halim @ 2024-02-24  9:29 UTC (permalink / raw)
  To: tirtajames45; +Cc: goldstein.w.n, libc-alpha

(Resend because previous v8 was missing sysdeps/*/str-two-way.h)

Find the rarest byte in NE. Find the parts of HS that matches the rare byte
and the byte after it. If found, shift back to the start of NE in HS and
vector compare with NE.

Timings (Core i3-1115G4):
basic_memmem twoway_memmem __memmem_avx2 __memmem_avx512 __memmem_generic __memmem_sse2
Average:
25905.8 4117.55 1574.32 850.412 3011.89 2190.56
Total:
6.78732e+06 1.0788e+06 412471 222808 789116 573927

Passes test-memmem

Changes in v1:
1. Add memmem-avx2.c

Changes in v2:
1. Add avx512 support with a generic header file
2. Use __memcmpeq instead of memcmp
3. Remove scalar loop
4. Fix unsafe unaligned load

Changes in v3:
1. Avoid checking for alignment to the start of the page since that will be rare
2. Use __memcmpeq instead of __memcmpeq_avx2 (it generates undefined
reference errors)
3. Add memmem.c (needs review)
4. Add __memcmpeq_avx2 and __memcmpeq_avx512 to ifunc-impl-list.c (needs
review)
5. Add libc_hidden_builtin_def and MEMMEM to memmem.c (needs review)

Changes in v4:
1. Correct the cpu feature checks in ifunc-impl-list.c and memmem.c to
use AVX512BW and BMI1 for AVX512 and AVX2 and BMI1 for AVX2
2. Correct the Makefile to use the appropriate flags
3. Rename memmem-vectorized-avx.h to memmem-avx-base.h
4. Remove unused vector macros (POPCNT and LZCNT)

Changes in v5:
1. Rename SHIFT to RARE, OFF to OFF_S, OFF2 to OFF_E
2. Remove conditional for VEC_SIZE and ONES, and remove unused MASK_SIZE
3. Add comments
4. Limit needle length to VEC_SIZE when finding the rare byte

Changes in v6:
1. Fix patch apply error in memmem.c
2. Correctly use MIN(ne_len, VEC_SIZE) when checking if RARE is found at the end
of needle
3. Always do unaligned load at the tail code
4. Rename rarebyte_table to ___rarebyte_table
5. Add memmem-avx-base.c in which ___rarebyte_table is defined
6. Add memmem-avx-base to the Makefile
7. Add always_inline to find_rarest_byte
8. Change ((m << off) >> off) to (m & (ONES >> off))
9. Change void * to unsigned char * in find_rarest_byte

Changes in v7:
1. Fallback to generic memmem for long needles for guaranteed
linear-time worst-case performance
2. Use memmem instead of MEMMEM for libc_hidden_builtin_def in
memmem.c (string/memmem.c and sysdeps/x86_64/multiarch/memmem.c may
still need to be fixed for non-x86_64 builds to work. The changes were
made following string/strstr.c and sysdeps/x86_64/multiarch/strstr.c)
3. Change some (VEC *) casts to (const VEC *)

Changes in v8:
1. Remove libc_hidden_builtin_def in string/memmem.c and change libc_hidden_builtin_def to
libc_hidden_weak in sysdeps/*/memmem.c
2. Add memmem-sse2 (add to ifunc-impl-list.c, sysdeps/*/memmem.c, and
Makefile). sse2 is used if we have Fast_Unaligned_Load
3. avx2 and avx512 are used for ne_len <= VEC_SIZE * 2; sse2 for ne_len <=
VEC_SIZE (benchmark shows that sse2 is slower for ne_len <= VEC_SIZE * 2)
4. avx2 and avx512 fallback to two_way_long_needle; sse2 fallback to
__memmem_generic
5. Change MEMCMPEQ that is used for comparing the rest of the needle
with CMPEQ8. If ne_len <= VEC_SIZE * 2, CMPEQ8 the start and end of the
needle
6. If ne_len <= VEC_SIZE * 2, load the second needle vector
7. Implement BLSR with ((x) & ((x) - 1)), TZCNT (avx2) with
__builtin_ctz
8. Implement TZCNT (sse2) with ((x) ? _bit_scan_forward (x) : (MASK)
sizeof (MASK) * CHAR_BIT)
9. Add NOT_CROSSING_PAGE macro
10. Add MIN_VEC macro. If ne_len <= VEC_SIZE * 2, it expands to MIN
(ne_len, VEC_SIZE). Otherwise, it expands to ne_len, since ne_len will
always be <= VEC_SIZE
11. Add LONG_NEEDLE macro for checking if ne_len may be <= VEC_SIZE * 2
12. Add macros to change the name of two_way_long_needle and make it non-static in string/str-two-way.h
13. Add sysdeps/*/str-two-way.h

---
 string/memmem.c                            |   8 +-
 string/str-two-way.h                       |  13 +-
 sysdeps/x86_64/multiarch/Makefile          |   8 +
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |  13 ++
 sysdeps/x86_64/multiarch/memmem-avx-base.c |  37 +++
 sysdeps/x86_64/multiarch/memmem-avx-base.h | 255 +++++++++++++++++++++
 sysdeps/x86_64/multiarch/memmem-avx2.c     |   6 +
 sysdeps/x86_64/multiarch/memmem-avx512.c   |  13 ++
 sysdeps/x86_64/multiarch/memmem-sse2.c     |  16 ++
 sysdeps/x86_64/multiarch/memmem.c          |  73 ++++++
 sysdeps/x86_64/multiarch/str-two-way.h     |   1 +
 11 files changed, 439 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx-base.h
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-avx512.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem-sse2.c
 create mode 100644 sysdeps/x86_64/multiarch/memmem.c
 create mode 100644 sysdeps/x86_64/multiarch/str-two-way.h

diff --git a/string/memmem.c b/string/memmem.c
index a4117f8e1e..d04710bf92 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -25,6 +25,10 @@
 # define __memmem	memmem
 #endif
 
+#ifndef MEMMEM
+# define MEMMEM __memmem
+#endif
+
 #define RETURN_TYPE void *
 #define AVAILABLE(h, h_l, j, n_l) ((j) <= (h_l) - (n_l))
 #define FASTSEARCH(S,C,N) (void*) memchr ((void *)(S), (C), (N))
@@ -50,7 +54,7 @@
    The limit also implies worst-case performance is linear.
    Needles larger than 256 characters use the linear-time Two-Way algorithm.  */
 void *
-__memmem (const void *haystack, size_t hs_len,
+MEMMEM (const void *haystack, size_t hs_len,
 	  const void *needle, size_t ne_len)
 {
   const unsigned char *hs = (const unsigned char *) haystack;
@@ -77,7 +81,7 @@ __memmem (const void *haystack, size_t hs_len,
 
   /* Use Two-Way algorithm for very long needles.  */
   if (__builtin_expect (ne_len > 256, 0))
-    return two_way_long_needle (hs, hs_len, ne, ne_len);
+    return TWO_WAY_LONG_NEEDLE_FUNC_NAME (hs, hs_len, ne, ne_len);
 
   uint8_t shift[256];
   size_t tmp, shift1;
diff --git a/string/str-two-way.h b/string/str-two-way.h
index 0e663b957c..26d2853e0f 100644
--- a/string/str-two-way.h
+++ b/string/str-two-way.h
@@ -91,6 +91,15 @@
 # define RET0_IF_0(a) /* nothing */
 #endif
 
+#ifndef TWO_WAY_LONG_NEEDLE_FUNC_NAME
+# define TWO_WAY_LONG_NEEDLE_FUNC_NAME two_way_long_needle
+#endif
+#ifndef TWO_WAY_LONG_NEEDLE_NON_STATIC
+# define TWO_WAY_LONG_NEEDLE_STATIC static
+#else
+# define TWO_WAY_LONG_NEEDLE_STATIC
+#endif
+
 /* Perform a critical factorization of NEEDLE, of length NEEDLE_LEN.
    Return the index of the first byte in the right half, and set
    *PERIOD to the global period of the right half.
@@ -386,8 +395,8 @@ two_way_short_needle (const unsigned char *haystack, size_t haystack_len,
 
    Since this function is large and complex, block inlining to avoid
    slowing down the common case of small needles.  */
-__attribute__((noinline)) static RETURN_TYPE
-two_way_long_needle (const unsigned char *haystack, size_t haystack_len,
+__attribute__((noinline)) TWO_WAY_LONG_NEEDLE_STATIC RETURN_TYPE
+TWO_WAY_LONG_NEEDLE_FUNC_NAME (const unsigned char *haystack, size_t haystack_len,
 		     const unsigned char *needle, size_t needle_len)
 {
   size_t i; /* Index into current byte of NEEDLE.  */
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index d3d2270394..5c0139f17a 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -15,6 +15,10 @@ sysdep_routines += \
   memcmpeq-avx2-rtm \
   memcmpeq-evex \
   memcmpeq-sse2 \
+  memmem-avx-base \
+  memmem-avx2 \
+  memmem-avx512 \
+  memmem-sse2 \
   memmove-avx-unaligned-erms \
   memmove-avx-unaligned-erms-rtm \
   memmove-avx512-no-vzeroupper \
@@ -122,6 +126,10 @@ sysdep_routines += \
   varshift \
 # sysdep_routines
 
+CFLAGS-memmem-avx2.c += -mavx2 -mbmi -O3
+CFLAGS-memmem-avx512.c += -mavx512f -mavx512bw -mbmi -O3
+CFLAGS-memmem-sse2.c += -O3
+
 CFLAGS-strcspn-sse4.c += -msse4
 CFLAGS-strpbrk-sse4.c += -msse4
 CFLAGS-strspn-sse4.c += -msse4
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c4a21d4b7c..002d255e16 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -798,6 +798,19 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
                               __strstr_avx512)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic))
+  
+    /* Support sysdeps/x86_64/multiarch/memmem.c.  */
+  IFUNC_IMPL (i, name, memmem,
+              IFUNC_IMPL_ADD (array, i, memmem,
+		              (CPU_FEATURE_USABLE (AVX2)
+			      && CPU_FEATURE_USABLE (BMI1)),
+			      __memmem_avx2)
+              IFUNC_IMPL_ADD (array, i, memmem,
+                              (CPU_FEATURE_USABLE (AVX512BW)
+                               && CPU_FEATURE_USABLE (BMI1)),
+                              __memmem_avx512)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_generic)
+	      IFUNC_IMPL_ADD (array, i, memmem, 1, __memmem_sse2))
 
   /* Support sysdeps/x86_64/multiarch/wcschr.c.  */
   IFUNC_IMPL (i, name, wcschr,
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.c b/sysdeps/x86_64/multiarch/memmem-avx-base.c
new file mode 100644
index 0000000000..f8c5ed5f37
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.c
@@ -0,0 +1,37 @@
+/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+const unsigned char ___rarebyte_table[256] attribute_hidden
+    = { 0,   1,	  13,  56,  59,	 60,  61,  62,	63,  232, 248, 2,   158, 4,
+	5,   6,	  7,   8,   9,	 10,  14,  20,	26,  29,  37,  46,  52,	 53,
+	54,  55,  57,  58,  255, 172, 242, 193, 162, 174, 178, 182, 218, 219,
+	212, 180, 249, 197, 221, 210, 253, 231, 230, 224, 225, 226, 227, 223,
+	222, 220, 176, 213, 184, 229, 188, 164, 159, 209, 181, 203, 189, 216,
+	196, 192, 185, 205, 161, 168, 215, 187, 211, 194, 195, 165, 206, 204,
+	214, 198, 173, 179, 175, 183, 167, 202, 239, 201, 160, 241, 163, 246,
+	233, 238, 240, 254, 237, 208, 234, 250, 169, 186, 236, 217, 245, 243,
+	228, 170, 247, 244, 251, 235, 199, 200, 252, 207, 177, 191, 171, 190,
+	166, 3,	  140, 134, 124, 126, 86,  128, 95,  117, 114, 93,  81,	 87,
+	132, 96,  112, 97,  103, 82,  139, 89,	98,  88,  119, 74,  156, 115,
+	104, 75,  120, 106, 76,	 155, 90,  122, 107, 125, 152, 145, 136, 137,
+	101, 116, 102, 108, 99,	 141, 77,  78,	118, 79,  109, 100, 150, 73,
+	94,  72,  121, 151, 113, 135, 110, 105, 83,  91,  11,  12,  64,	 149,
+	146, 111, 65,  69,  66,	 15,  16,  17,	18,  19,  130, 92,  144, 123,
+	21,  22,  23,  24,  131, 133, 127, 142, 25,  70,  129, 27,  28,	 67,
+	153, 84,  143, 138, 147, 157, 148, 68,	71,  30,  31,  32,  33,	 34,
+	35,  36,  154, 38,  39,	 40,  41,  42,	80,  43,  44,  45,  47,	 48,
+	85,  49,  50,  51 };
diff --git a/sysdeps/x86_64/multiarch/memmem-avx-base.h b/sysdeps/x86_64/multiarch/memmem-avx-base.h
new file mode 100644
index 0000000000..71c15d8c2f
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx-base.h
@@ -0,0 +1,255 @@
+/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <immintrin.h>
+#include <inttypes.h>
+#include <string.h>
+#include <libc-pointer-arith.h>
+#include "str-two-way.h"
+
+#ifndef FUNC_NAME
+#  define FUNC_NAME __memmem_avx2
+#endif
+#ifndef VEC
+#  define VEC __m256i
+#endif
+#ifndef MASK
+#  define MASK uint32_t
+#endif
+#ifndef LOAD
+#  define LOAD(x) _mm256_load_si256 (x)
+#endif
+#ifndef LOADU
+#  define LOADU(x) _mm256_loadu_si256 (x)
+#endif
+#ifndef CMPEQ8_MASK
+#  define CMPEQ8_MASK(x, y) _mm256_movemask_epi8 (_mm256_cmpeq_epi8 (x, y))
+#endif
+#ifndef SETONE8
+#  define SETONE8(x) _mm256_set1_epi8 (x)
+#endif
+#ifndef TZCNT
+#  define TZCNT(x) __builtin_ctz (x)
+#endif
+#ifndef BLSR
+#  define BLSR(x) ((x) & ((x) -1))
+#endif
+#ifndef MEMMEM_GENERIC
+#  define MEMMEM_GENERIC __memmem_generic
+#endif
+#ifndef TWO_WAY_LONG_NEEDLE_THRESHOLD
+#  define TWO_WAY_LONG_NEEDLE_THRESHOLD VEC_SIZE
+#endif
+#ifndef VEC_SIZE
+#  define VEC_SIZE 32
+#endif
+#define ONES ((MASK) -1)
+
+#ifndef MEMCMPEQ
+#  define MEMCMPEQ __memcmpeq
+#endif
+#ifndef MEMCPY
+#  define MEMCPY memcpy
+#endif
+#ifndef MEMCHR
+#  define MEMCHR memchr
+#endif
+#ifndef PAGE_SIZE
+#  define PAGE_SIZE 4096
+#endif
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+#define NOT_CROSSING_PAGE(p, obj_size)                                        \
+  (PTR_DIFF (PTR_ALIGN_UP (p, PAGE_SIZE), p) >= obj_size)
+#if TWO_WAY_LONG_NEEDLE_THRESHOLD > VEC_SIZE
+#  define LONG_NEEDLE 1
+#  define MIN_VEC(ne_len) MIN (ne_len, VEC_SIZE)
+#else
+#  define LONG_NEEDLE 0
+#  define MIN_VEC(ne_len) (ne_len)
+#endif
+
+_Static_assert (VEC_SIZE == sizeof (VEC), "VEC_SIZE != sizeof (VEC).");
+_Static_assert (
+    TWO_WAY_LONG_NEEDLE_THRESHOLD <= VEC_SIZE * 2,
+    "FIND_MATCH() assumes TWO_WAY_LONG_NEEDLE_THRESHOLD <= VEC_SIZE * 2.");
+
+#if LONG_NEEDLE
+#  define FIND_MATCH()                                                        \
+    if (NOT_CROSSING_PAGE (hp, VEC_SIZE * 2))                                 \
+      {                                                                       \
+	/* Do a vector compare if we are not crossing a page. */              \
+	hv = LOADU ((const VEC *) hp);                                        \
+	cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;                        \
+	/* Compare only the relevant bits of the needle vector. */            \
+	if (cmpm == matchm)                                                   \
+	  {                                                                   \
+	    if (ne_len <= VEC_SIZE)                                           \
+	      return (void *) hp;                                             \
+	    /* Compare the rest of the needle. */                             \
+	    hv = LOADU ((const VEC *) hp + 1);                                \
+	    cmpm = (MASK) CMPEQ8_MASK (hv, nv_e) << matchsh_e;                \
+	    if (cmpm == matchm_e)                                             \
+	      return (void *) hp;                                             \
+	  }                                                                   \
+      }                                                                       \
+    else                                                                      \
+      {                                                                       \
+	if (!MEMCMPEQ (hp, ne, ne_len))                                       \
+	  return (void *) hp;                                                 \
+      }
+#else
+#  define FIND_MATCH()                                                        \
+    if (NOT_CROSSING_PAGE (hp, VEC_SIZE))                                     \
+      {                                                                       \
+	hv = LOADU ((const VEC *) hp);                                        \
+	cmpm = (MASK) CMPEQ8_MASK (hv, nv) << matchsh;                        \
+	if (cmpm == matchm)                                                   \
+	  return (void *) hp;                                                 \
+      }                                                                       \
+    else                                                                      \
+      {                                                                       \
+	if (!MEMCMPEQ (hp, ne, ne_len))                                       \
+	  return (void *) hp;                                                 \
+      }
+#endif
+
+extern void *MEMMEM_GENERIC (const void *, size_t, const void *,
+			     size_t) attribute_hidden;
+
+/* Lower is rarer. The table is based on the *.c and *.h files in glibc. */
+extern const unsigned char ___rarebyte_table[256] attribute_hidden;
+
+static inline void *__attribute__ ((always_inline))
+find_rarest_byte (const unsigned char *rare, size_t n)
+{
+  const unsigned char *p = (const unsigned char *) rare;
+  int c_rare = ___rarebyte_table[*rare];
+  int c;
+  for (; n--; ++p)
+    {
+      c = ___rarebyte_table[*p];
+      if (c < c_rare)
+	{
+	  rare = p;
+	  c_rare = c;
+	}
+    }
+  return (void *) rare;
+}
+
+void *
+FUNC_NAME (const void *hs, size_t hs_len, const void *ne, size_t ne_len)
+{
+  if (ne_len == 1)
+    return (void *) MEMCHR (hs, *(unsigned char *) ne, hs_len);
+  if (__glibc_unlikely (ne_len == 0))
+    return (void *) hs;
+  if (__glibc_unlikely (hs_len < ne_len))
+    return NULL;
+  /* Linear-time worst-case performance is guaranteed by the generic
+   * implementation using the Two-Way algorithm. */
+  if (__glibc_unlikely (ne_len > TWO_WAY_LONG_NEEDLE_THRESHOLD))
+    return MEMMEM_GENERIC (hs, hs_len, ne, ne_len);
+  VEC hv0, hv1, hv, nv;
+  MASK i, hm0, hm1, m, cmpm;
+  const unsigned int matchsh = ne_len < VEC_SIZE ? VEC_SIZE - ne_len : 0;
+  const MASK matchm = ONES << matchsh;
+#if LONG_NEEDLE
+  VEC nv_e;
+  const unsigned int matchsh_e
+      = ne_len < VEC_SIZE * 2 ? VEC_SIZE * 2 - ne_len : 0;
+  const MASK matchm_e = ONES << matchsh_e;
+#endif
+  const unsigned char *h = (const unsigned char *) hs;
+  const unsigned char *const end = h + hs_len - ne_len;
+  const unsigned char *hp;
+  size_t rare = PTR_DIFF (
+      find_rarest_byte ((const unsigned char *) ne, MIN_VEC (ne_len)), ne);
+  /* RARE will always be the first byte to find.
+     If RARE is at the end of the needle, use the byte before it. */
+  if (rare == MIN_VEC (ne_len) - 1)
+    --rare;
+  const VEC nv0 = SETONE8 (*((char *) ne + rare));
+  const VEC nv1 = SETONE8 (*((char *) ne + rare + 1));
+  unsigned int off_e = (PTR_DIFF (end, h) < VEC_SIZE)
+			   ? VEC_SIZE - (unsigned int) (end - h) - 1
+			   : 0;
+  /* Start from the position of RARE. */
+  h += rare;
+  /* Load the needle vector. */
+  if (NOT_CROSSING_PAGE (ne, VEC_SIZE)
+      || (LONG_NEEDLE ? ne_len >= VEC_SIZE : 0))
+    nv = LOADU ((const VEC *) ne);
+  else
+    MEMCPY (&nv, ne, MIN_VEC (ne_len));
+#if LONG_NEEDLE
+  if (ne_len >= VEC_SIZE)
+    {
+      if (NOT_CROSSING_PAGE (ne, VEC_SIZE * 2))
+	nv_e = LOADU ((const VEC *) ne + 1);
+      else
+	MEMCPY (&nv_e, (const unsigned char *) ne + VEC_SIZE,
+		MIN (VEC_SIZE, ne_len - VEC_SIZE));
+    }
+#endif
+  const unsigned int off_s = PTR_DIFF (h, PTR_ALIGN_DOWN (h, VEC_SIZE));
+  /* Align down to VEC_SIZE. */
+  h -= off_s;
+  hv0 = LOAD ((const VEC *) h);
+  hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+  hm1 = (MASK) CMPEQ8_MASK (hv0, nv1) >> 1;
+  /* Clear the irrelevant bits from aligning down (OFF_S) and ones that are out
+   * of bounds (OFF_E). */
+  m = ((hm0 & hm1) >> off_s) & (ONES >> off_e);
+  while (m)
+    {
+      i = TZCNT (m);
+      m = BLSR (m);
+      hp = h + off_s + i - rare;
+      FIND_MATCH ();
+    }
+  h += VEC_SIZE - 1;
+  for (; h - rare + VEC_SIZE <= end; h += VEC_SIZE)
+    {
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      m = hm0 & hm1;
+      while (m)
+	{
+	match:
+	  i = TZCNT (m);
+	  m = BLSR (m);
+	  hp = h + i - rare;
+	  FIND_MATCH ();
+	}
+    }
+  if (h - rare <= end)
+    {
+      off_e = VEC_SIZE - (unsigned int) (end - (h - rare)) - 1;
+      hv0 = LOADU ((const VEC *) h);
+      hv1 = LOAD ((const VEC *) (h + 1));
+      hm1 = (MASK) CMPEQ8_MASK (hv1, nv1);
+      hm0 = (MASK) CMPEQ8_MASK (hv0, nv0);
+      /* Clear the irrelevant bits that are out of bounds. */
+      m = hm0 & hm1 & (ONES >> off_e);
+      if (m)
+	goto match;
+    }
+  return NULL;
+}
diff --git a/sysdeps/x86_64/multiarch/memmem-avx2.c b/sysdeps/x86_64/multiarch/memmem-avx2.c
new file mode 100644
index 0000000000..ef5e7c1c67
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx2.c
@@ -0,0 +1,6 @@
+#include "str-two-way.h"
+#define MEMMEM_GENERIC TWO_WAY_LONG_NEEDLE_FUNC_NAME
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD ((VEC_SIZE) *2)
+#define VEC_SIZE 32
+#define FUNC_NAME __memmem_avx2
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-avx512.c b/sysdeps/x86_64/multiarch/memmem-avx512.c
new file mode 100644
index 0000000000..b1f23889ec
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-avx512.c
@@ -0,0 +1,13 @@
+#include "str-two-way.h"
+#define MEMMEM_GENERIC TWO_WAY_LONG_NEEDLE_FUNC_NAME
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD ((VEC_SIZE) *2)
+#define VEC_SIZE 64
+#define VEC __m512i
+#define MASK uint64_t
+#define LOAD(x) _mm512_load_si512 (x)
+#define LOADU(x) _mm512_loadu_si512 (x)
+#define CMPEQ8_MASK(x, y) _mm512_cmpeq_epi8_mask (x, y)
+#define SETONE8(x) _mm512_set1_epi8 (x)
+#define TZCNT(x) _tzcnt_u64 (x)
+#define FUNC_NAME __memmem_avx512
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem-sse2.c b/sysdeps/x86_64/multiarch/memmem-sse2.c
new file mode 100644
index 0000000000..a69e35a25b
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem-sse2.c
@@ -0,0 +1,16 @@
+#include <x86intrin.h>
+
+#define VEC __m128i
+#define VEC_SIZE 16
+#define MASK uint16_t
+#define LOAD(x) _mm_load_si128 (x)
+#define LOADU(x) _mm_loadu_si128 (x)
+#define CMPEQ8_MASK(x, y) _mm_movemask_epi8 (_mm_cmpeq_epi8 (x, y))
+#define SETONE8(x) _mm_set1_epi8 (x)
+#define TZCNT(x)                                                              \
+  ((x) ? _bit_scan_forward (x) : (MASK) sizeof (MASK) * CHAR_BIT)
+
+#define FUNC_NAME __memmem_sse2
+#define TWO_WAY_LONG_NEEDLE_THRESHOLD VEC_SIZE
+
+#include "memmem-avx-base.h"
diff --git a/sysdeps/x86_64/multiarch/memmem.c b/sysdeps/x86_64/multiarch/memmem.c
new file mode 100644
index 0000000000..69ee4867ad
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memmem.c
@@ -0,0 +1,73 @@
+/* Multiple versions of memmem.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2012-2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Redefine memmem so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+#undef memmem
+#define memmem __redirect_memmem
+#include <string.h>
+#undef memmem
+
+#define MEMMEM __memmem_generic
+#ifdef SHARED
+#  undef libc_hidden_weak
+#  define libc_hidden_weak(name)                                              \
+    __hidden_ver1 (__memmem_generic, __GI_memmem, __memmem_generic);
+#endif
+
+#include "str-two-way.h"
+#define TWO_WAY_LONG_NEEDLE_NON_STATIC
+#include "string/memmem.c"
+
+extern __typeof (__redirect_memmem) __memmem_avx2 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_avx512 attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_generic attribute_hidden;
+extern __typeof (__redirect_memmem) __memmem_sse2 attribute_hidden;
+
+#define SYMBOL_NAME memmem
+
+#include "init-arch.h"
+
+/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
+   ifunc symbol properly.  */
+extern __typeof (__redirect_memmem) __libc_memmem;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features *cpu_features = __get_cpu_features ();
+
+  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512)
+      && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx512;
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI1))
+    return __memmem_avx2;
+
+  if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
+    return __memmem_sse2;
+
+  return __memmem_generic;
+}
+
+libc_ifunc_redirected (__redirect_memmem, __libc_memmem, IFUNC_SELECTOR ());
+#undef memmem
+strong_alias (__libc_memmem, __memmem)
diff --git a/sysdeps/x86_64/multiarch/str-two-way.h b/sysdeps/x86_64/multiarch/str-two-way.h
new file mode 100644
index 0000000000..b9a6ddc455
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/str-two-way.h
@@ -0,0 +1 @@
+#define TWO_WAY_LONG_NEEDLE_FUNC_NAME __two_way_long_needle
-- 
2.43.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-21 20:30     ` Alexander Monakov
  2024-02-21 22:17       ` Noah Goldstein
@ 2024-02-27 15:06       ` Rich Felker
  2024-03-01 21:31         ` Gabriel Ravier
  1 sibling, 1 reply; 25+ messages in thread
From: Rich Felker @ 2024-02-27 15:06 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Noah Goldstein, James Tirta Halim, libc-alpha

On Wed, Feb 21, 2024 at 11:30:16PM +0300, Alexander Monakov wrote:
> 
> On Wed, 21 Feb 2024, Noah Goldstein wrote:
> 
> > > +#ifndef TZCNT
> > > +#  define TZCNT(x) _tzcnt_u32 (x)
> > > +#endif
> > Use `__builtin_ctz`
> > > +#ifndef BLSR
> > > +#  define BLSR(x) _blsr_u32 (x)
> > > +#endif
> > 
> > Think you can drop the `BLSR` define (here and in the avx512)
> > and just replace with `((x) & ((x) - 1))`
> > any reasonable compiler will optimize that correctly.
> 
> I am really confused why review of such minor technical details is happening
> as if the proposed change is desirable and the goal is to include it in Glibc,
> and algorithm-wise it's all fine including the relevance of rarebyte_table to
> real-world uses of memmem and handling of page boundaries when iterating over
> the haystack. Not to mention the necessity of carrying SIMD variants of memmem
> in Glibc.

Same. I would really like to see glibc stop entertaining big-O
regressions in the form of magic tricks that happen to work well on
the submitter's test cases. It's reminiscent of the good ol' days of:

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strstr.c;hb=0ecb606cb6cf65de1d9fc8a919bceb4be476c602

It's also really not nice to people who do honestly want to contribute
to drag them along through revising something that's never going to
make sense to include. High-level "is this desirable to begin with?"
should really be resolved before code-review-for-inclusion.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-23 17:27         ` Adhemerval Zanella Netto
@ 2024-02-29 20:19           ` Alexander Monakov
  2024-03-02 21:00             ` Noah Goldstein
  2024-03-05 15:25             ` Adhemerval Zanella Netto
  0 siblings, 2 replies; 25+ messages in thread
From: Alexander Monakov @ 2024-02-29 20:19 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: Noah Goldstein, James Tirta Halim, libc-alpha


On Fri, 23 Feb 2024, Adhemerval Zanella Netto wrote:

> Alexandre, are you reservation about this optimization related to extra code
> and data required to optimize for a limited input range?

No, my concern is more general. As I see it, Noah is offering target-specific
feedback without making it clear whether he is deferring high-level decisions
to someone else, or taking the responsibility for them himself (and giving
an implicit ack by jumping straight to technical review). But as Rich said,
high-level review really need to be done before the patch is rerolled to v8
on coding style and other miscellanea. That includes:

1. "Is this desirable on the high level?" The people who initially bear
the cost of mistakes are the users (who did not ask for an AVX2 memmem
in the first place) and distribution maintainers who triage the issues. Adding
a new SIMD variant to Glibc is not without cost. Why is it important that
Glibc carries an AVX2 memmem which achieves only a 2x speedup according to
microbenchmark provided by the submitter, despite using 32-byte vectors?
Shouldn't it aim for a 32x speedup over the generic implementation?
Would you entertain AVX-512 strfry and memfrob?

2. "Is the algorithm correct?"

3. "Is the algorithm efficient?" (big-O time and space complexity)

4. "Are the risks of bugs and regressions acceptable?"

5. "Are there any potential security issues?"

6. "Are the size and energy trade-offs acceptable?" In this particular case,
the look-up table probably incurs a page fault on first use, and might even
cause an extra page fault for programs that don't use memmem, by virtue of
pushing apart other read-only data that is more frequently used. A micro-
benchmark wouldn't capture this.

7. "Is test coverage adequate?" If I understand correctly, the difficult
cases from the strstr testsuite were not used for memmem, and there was
no discussion of cases that hit the worst case for the proposed algorithm.

I see AVX-512 strstr was accepted without mentioning it's O(n*m).

Alexander

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-27 15:06       ` Rich Felker
@ 2024-03-01 21:31         ` Gabriel Ravier
  0 siblings, 0 replies; 25+ messages in thread
From: Gabriel Ravier @ 2024-03-01 21:31 UTC (permalink / raw)
  To: Rich Felker, Alexander Monakov
  Cc: Noah Goldstein, James Tirta Halim, libc-alpha

On 2/27/24 15:06, Rich Felker wrote:
> On Wed, Feb 21, 2024 at 11:30:16PM +0300, Alexander Monakov wrote:
>> On Wed, 21 Feb 2024, Noah Goldstein wrote:
>>
>>>> +#ifndef TZCNT
>>>> +#  define TZCNT(x) _tzcnt_u32 (x)
>>>> +#endif
>>> Use `__builtin_ctz`
>>>> +#ifndef BLSR
>>>> +#  define BLSR(x) _blsr_u32 (x)
>>>> +#endif
>>> Think you can drop the `BLSR` define (here and in the avx512)
>>> and just replace with `((x) & ((x) - 1))`
>>> any reasonable compiler will optimize that correctly.
>> I am really confused why review of such minor technical details is happening
>> as if the proposed change is desirable and the goal is to include it in Glibc,
>> and algorithm-wise it's all fine including the relevance of rarebyte_table to
>> real-world uses of memmem and handling of page boundaries when iterating over
>> the haystack. Not to mention the necessity of carrying SIMD variants of memmem
>> in Glibc.
> Same. I would really like to see glibc stop entertaining big-O
> regressions in the form of magic tricks that happen to work well on
> the submitter's test cases. It's reminiscent of the good ol' days of:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strstr.c;hb=0ecb606cb6cf65de1d9fc8a919bceb4be476c602


...or reminiscent of the days of right now, given glibc seems to still 
use pretty much the same algorithm for wcsstr. At least it looks like 
there's a patch currently being reviewed to fix that.


>
> It's also really not nice to people who do honestly want to contribute
> to drag them along through revising something that's never going to
> make sense to include. High-level "is this desirable to begin with?"
> should really be resolved before code-review-for-inclusion.
>
> Rich



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-29 20:19           ` Alexander Monakov
@ 2024-03-02 21:00             ` Noah Goldstein
  2024-03-05 15:25             ` Adhemerval Zanella Netto
  1 sibling, 0 replies; 25+ messages in thread
From: Noah Goldstein @ 2024-03-02 21:00 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Adhemerval Zanella Netto, James Tirta Halim, libc-alpha

On Thu, Feb 29, 2024 at 2:19 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Fri, 23 Feb 2024, Adhemerval Zanella Netto wrote:
>
> > Alexandre, are you reservation about this optimization related to extra code
> > and data required to optimize for a limited input range?
>
> No, my concern is more general. As I see it, Noah is offering target-specific
> feedback without making it clear whether he is deferring high-level decisions
> to someone else, or taking the responsibility for them himself (and giving
> an implicit ack by jumping straight to technical review). But as Rich said,
> high-level review really need to be done before the patch is rerolled to v8
> on coding style and other miscellanea. That includes:
>

There was no implicit ack (or at the very least no intended one).
My opinion is/was we can review the technical in parallel with
and independently from deciding if the patch is desirable at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-02-29 20:19           ` Alexander Monakov
  2024-03-02 21:00             ` Noah Goldstein
@ 2024-03-05 15:25             ` Adhemerval Zanella Netto
  2024-03-05 17:05               ` Noah Goldstein
  1 sibling, 1 reply; 25+ messages in thread
From: Adhemerval Zanella Netto @ 2024-03-05 15:25 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Noah Goldstein, James Tirta Halim, libc-alpha, H.J. Lu



On 29/02/24 17:19, Alexander Monakov wrote:
> 
> On Fri, 23 Feb 2024, Adhemerval Zanella Netto wrote:
> 
>> Alexandre, are you reservation about this optimization related to extra code
>> and data required to optimize for a limited input range?
> 
> No, my concern is more general. As I see it, Noah is offering target-specific
> feedback without making it clear whether he is deferring high-level decisions
> to someone else, or taking the responsibility for them himself (and giving
> an implicit ack by jumping straight to technical review). But as Rich said,
> high-level review really need to be done before the patch is rerolled to v8
> on coding style and other miscellanea. That includes:
> 
> 1. "Is this desirable on the high level?" The people who initially bear
> the cost of mistakes are the users (who did not ask for an AVX2 memmem
> in the first place) and distribution maintainers who triage the issues. Adding
> a new SIMD variant to Glibc is not without cost. Why is it important that
> Glibc carries an AVX2 memmem which achieves only a 2x speedup according to
> microbenchmark provided by the submitter, despite using 32-byte vectors?
> Shouldn't it aim for a 32x speedup over the generic implementation?
> Would you entertain AVX-512 strfry and memfrob?

I tend to agree and I was outvoted when Intel proposed a SSE/AVX2 optimized
strcat implementation (specially because we already have optimized strlen
and strcpy, and strcat is also a bad interface).

But for memmem/strstr SIMD version I don't have strong opinion, nor which
speedup threshold we should aim for inclusion. I tend to agree with you that 
a 2x speedup with a limited haystack size for such code complexity
is not really ideal.  

> 
> 2. "Is the algorithm correct?"
> 
> 3. "Is the algorithm efficient?" (big-O time and space complexity)

Also agree, and I think we already have previous discussion before that
inefficient implementations should be not accepted, specially when the
generic implementation does not show the deficiency. 

> 
> 4. "Are the risks of bugs and regressions acceptable?"
> 
> 5. "Are there any potential security issues?"
> 
> 6. "Are the size and energy trade-offs acceptable?" In this particular case,
> the look-up table probably incurs a page fault on first use, and might even
> cause an extra page fault for programs that don't use memmem, by virtue of
> pushing apart other read-only data that is more frequently used. A micro-
> benchmark wouldn't capture this.

This would be quite hard to evaluate, but I agree that we should be parsimonious
about data segment increase. 

> 
> 7. "Is test coverage adequate?" If I understand correctly, the difficult
> cases from the strstr testsuite were not used for memmem, and there was
> no discussion of cases that hit the worst case for the proposed algorithm.

Yes, we are lacking some testing coverage for cases that might trigger
quadratic behavior on some case. I added some extra tests on my recent
wcsstr patch [1] but I do agree that we should improve it further.

> 
> I see AVX-512 strstr was accepted without mentioning it's O(n*m).

Yes, and I think it was a mistake (I was not aware of this until now).
So now we some arch optimizations for strstr/memmem/strcasestr:

  1. sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S

  2. sysdeps/x86_64/multiarch/strstr-avx512.c

  3. sysdeps/powerpc/powerpc64/power8/strcasestr.S

  4. sysdeps/s390/strstr-arch13.S

  5. sysdeps/s390/memmem-arch13.S

The x86_64 sse2 one (1.) seems to be optimizing the linear search for
short needles similar to generic implementation (strstr2/strstr3).

I have not dig into the x86_64 avx one (2.), but if this really O(n*m) I
think we should remove it.

For powerpc my wild guess this is similar to the old ststr optimization 
where it was not really an improvement (1e9a550ba41a5453c6578bb748fe2223a87e3024).

The s390 ones (4., 5.) seems similar to x86_64 sse2 one where it optimizes
the linear search for short needles (but I not fully sure if it is not
O(n*m)).

So I think it would be worth to discuss if we should to remove the x86_64
avx512 one and set the bar to avoid adding new strstr/memmem/strcasestr
with O(n*m) behavior.

Thoughts?

> 
> Alexander

[1] https://patchwork.sourceware.org/project/glibc/patch/20240301171524.3706554-3-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c
  2024-03-05 15:25             ` Adhemerval Zanella Netto
@ 2024-03-05 17:05               ` Noah Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Noah Goldstein @ 2024-03-05 17:05 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Alexander Monakov, James Tirta Halim, libc-alpha, H.J. Lu

On Tue, Mar 5, 2024 at 9:25 AM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 29/02/24 17:19, Alexander Monakov wrote:
> >
> > On Fri, 23 Feb 2024, Adhemerval Zanella Netto wrote:
> >
> >> Alexandre, are you reservation about this optimization related to extra code
> >> and data required to optimize for a limited input range?
> >
> > No, my concern is more general. As I see it, Noah is offering target-specific
> > feedback without making it clear whether he is deferring high-level decisions
> > to someone else, or taking the responsibility for them himself (and giving
> > an implicit ack by jumping straight to technical review). But as Rich said,
> > high-level review really need to be done before the patch is rerolled to v8
> > on coding style and other miscellanea. That includes:
> >
> > 1. "Is this desirable on the high level?" The people who initially bear
> > the cost of mistakes are the users (who did not ask for an AVX2 memmem
> > in the first place) and distribution maintainers who triage the issues. Adding
> > a new SIMD variant to Glibc is not without cost. Why is it important that
> > Glibc carries an AVX2 memmem which achieves only a 2x speedup according to
> > microbenchmark provided by the submitter, despite using 32-byte vectors?
> > Shouldn't it aim for a 32x speedup over the generic implementation?
> > Would you entertain AVX-512 strfry and memfrob?
>
> I tend to agree and I was outvoted when Intel proposed a SSE/AVX2 optimized
> strcat implementation (specially because we already have optimized strlen
> and strcpy, and strcat is also a bad interface).
>
> But for memmem/strstr SIMD version I don't have strong opinion, nor which
> speedup threshold we should aim for inclusion. I tend to agree with you that
> a 2x speedup with a limited haystack size for such code complexity
> is not really ideal.
>
> >
> > 2. "Is the algorithm correct?"
> >
> > 3. "Is the algorithm efficient?" (big-O time and space complexity)
>
> Also agree, and I think we already have previous discussion before that
> inefficient implementations should be not accepted, specially when the
> generic implementation does not show the deficiency.
>
> >
> > 4. "Are the risks of bugs and regressions acceptable?"
> >
> > 5. "Are there any potential security issues?"
> >
> > 6. "Are the size and energy trade-offs acceptable?" In this particular case,
> > the look-up table probably incurs a page fault on first use, and might even
> > cause an extra page fault for programs that don't use memmem, by virtue of
> > pushing apart other read-only data that is more frequently used. A micro-
> > benchmark wouldn't capture this.
>
> This would be quite hard to evaluate, but I agree that we should be parsimonious
> about data segment increase.
>
> >
> > 7. "Is test coverage adequate?" If I understand correctly, the difficult
> > cases from the strstr testsuite were not used for memmem, and there was
> > no discussion of cases that hit the worst case for the proposed algorithm.
>
> Yes, we are lacking some testing coverage for cases that might trigger
> quadratic behavior on some case. I added some extra tests on my recent
> wcsstr patch [1] but I do agree that we should improve it further.
>
> >
> > I see AVX-512 strstr was accepted without mentioning it's O(n*m).
>
> Yes, and I think it was a mistake (I was not aware of this until now).
> So now we some arch optimizations for strstr/memmem/strcasestr:
>
>   1. sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S
>
>   2. sysdeps/x86_64/multiarch/strstr-avx512.c
>
>   3. sysdeps/powerpc/powerpc64/power8/strcasestr.S
>
>   4. sysdeps/s390/strstr-arch13.S
>
>   5. sysdeps/s390/memmem-arch13.S
>
> The x86_64 sse2 one (1.) seems to be optimizing the linear search for
> short needles similar to generic implementation (strstr2/strstr3).
>
> I have not dig into the x86_64 avx one (2.), but if this really O(n*m) I
> think we should remove it.
>
> For powerpc my wild guess this is similar to the old ststr optimization
> where it was not really an improvement (1e9a550ba41a5453c6578bb748fe2223a87e3024).
>
> The s390 ones (4., 5.) seems similar to x86_64 sse2 one where it optimizes
> the linear search for short needles (but I not fully sure if it is not
> O(n*m)).
>
> So I think it would be worth to discuss if we should to remove the x86_64
> avx512 one and set the bar to avoid adding new strstr/memmem/strcasestr
> with O(n*m) behavior.

+1

>
> Thoughts?
>
> >
> > Alexander
>
> [1] https://patchwork.sourceware.org/project/glibc/patch/20240301171524.3706554-3-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-03-05 17:06 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-18  8:26 [PATCH v6] sysdeps/x86_64/multiarch/memmem-avx2.c: add memmem-avx2.c James Tirta Halim
2024-02-19  0:07 ` Noah Goldstein
2024-02-19  8:13   ` Alexander Monakov
2024-02-19 14:25     ` Adhemerval Zanella Netto
2024-02-19 17:20       ` Noah Goldstein
2024-02-20  3:00         ` James
2024-02-20 14:30           ` Adhemerval Zanella Netto
2024-02-20 15:16             ` James
2024-02-20 16:13               ` Noah Goldstein
2024-02-20 16:26                 ` James
2024-02-20 16:38                   ` Noah Goldstein
2024-02-21  6:57 ` [PATCH v7] " James Tirta Halim
2024-02-21 17:17   ` Noah Goldstein
2024-02-21 20:30     ` Alexander Monakov
2024-02-21 22:17       ` Noah Goldstein
2024-02-23 17:27         ` Adhemerval Zanella Netto
2024-02-29 20:19           ` Alexander Monakov
2024-03-02 21:00             ` Noah Goldstein
2024-03-05 15:25             ` Adhemerval Zanella Netto
2024-03-05 17:05               ` Noah Goldstein
2024-02-27 15:06       ` Rich Felker
2024-03-01 21:31         ` Gabriel Ravier
2024-02-24  4:25     ` James
2024-02-24  9:09   ` [PATCH v8] " James Tirta Halim
2024-02-24  9:29   ` James Tirta Halim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).