public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v1 1/6] String: Add __memcmpeq as build target
@ 2021-10-27  2:43 Noah Goldstein
  2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit just adds __memcmpeq as a build target so that
implementations for __memcmpeq that are not just aliases to memcmp can
be supported.
---
 string/Makefile   |  2 +-
 string/memcmpeq.c | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 string/memcmpeq.c

diff --git a/string/Makefile b/string/Makefile
index 40d6fac133..2199dd30b7 100644
--- a/string/Makefile
+++ b/string/Makefile
@@ -34,7 +34,7 @@ routines	:= strcat strchr strcmp strcoll strcpy strcspn		\
 		   strerror _strerror strlen strnlen			\
 		   strncat strncmp strncpy				\
 		   strrchr strpbrk strsignal strspn strstr strtok	\
-		   strtok_r strxfrm memchr memcmp memmove memset	\
+		   strtok_r strxfrm memchr memcmp memcmpeq memmove memset	\
 		   mempcpy bcopy bzero ffs ffsll stpcpy stpncpy		\
 		   strcasecmp strncase strcasecmp_l strncase_l		\
 		   memccpy memcpy wordcopy strsep strcasestr		\
diff --git a/string/memcmpeq.c b/string/memcmpeq.c
new file mode 100644
index 0000000000..08726325a8
--- /dev/null
+++ b/string/memcmpeq.c
@@ -0,0 +1,24 @@
+/* Copyright (C) 1991-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* This file is intentionally left empty.  It exists so that both
+   architectures which implement __memcmpeq seperately from memcmp and
+   architectures which implement __memcmpeq by having it alias memcmp will
+   build.
+
+   The alias for __memcmpeq to memcmp for the C implementation is in
+   memcmp.c.  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
@ 2021-10-27  2:43 ` Noah Goldstein
  2021-10-27 12:45   ` H.J. Lu
  2021-10-27 16:07   ` [PATCH v2 " Noah Goldstein
  2021-10-27  2:43 ` [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex Noah Goldstein
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
use the existing ones in memcmp. This will be useful for testing
implementations of __memcmpeq that do not just alias memcmp.
---
 benchtests/Makefile         |  2 +-
 benchtests/bench-memcmp.c   |  4 +++-
 benchtests/bench-memcmpeq.c | 20 ++++++++++++++++++++
 3 files changed, 24 insertions(+), 2 deletions(-)
 create mode 100644 benchtests/bench-memcmpeq.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index b690aaf65b..7be0e47c47 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -103,7 +103,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
 endif
 
 # String function benchmarks.
-string-benchset := memccpy memchr memcmp memcpy memmem memmove \
+string-benchset := memccpy memchr memcmp memcmpeq memcpy memmem memmove \
 		   mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		   strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		   strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
index 0d6a93bf29..546b06e1ab 100644
--- a/benchtests/bench-memcmp.c
+++ b/benchtests/bench-memcmp.c
@@ -17,7 +17,9 @@
    <https://www.gnu.org/licenses/>.  */
 
 #define TEST_MAIN
-#ifdef WIDE
+#ifdef TEST_MEMCMPEQ
+# define TEST_NAME "__memcmpeq"
+#elif defined WIDE
 # define TEST_NAME "wmemcmp"
 #else
 # define TEST_NAME "memcmp"
diff --git a/benchtests/bench-memcmpeq.c b/benchtests/bench-memcmpeq.c
new file mode 100644
index 0000000000..e918d4f77c
--- /dev/null
+++ b/benchtests/bench-memcmpeq.c
@@ -0,0 +1,20 @@
+/* Measure __memcmpeq functions.
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define TEST_MEMCMPEQ 1
+#include "bench-memcmp.c"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
  2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
@ 2021-10-27  2:43 ` Noah Goldstein
  2021-10-27 12:47   ` H.J. Lu
  2021-10-27  2:43 ` [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S Noah Goldstein
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.
---
 sysdeps/generic/ifunc-init.h                 |  5 +-
 sysdeps/x86_64/memcmp.S                      |  9 ++--
 sysdeps/x86_64/multiarch/Makefile            |  4 ++
 sysdeps/x86_64/multiarch/ifunc-impl-list.c   | 21 +++++++++
 sysdeps/x86_64/multiarch/ifunc-memcmpeq.h    | 49 ++++++++++++++++++++
 sysdeps/x86_64/multiarch/memcmp-sse2.S       |  4 +-
 sysdeps/x86_64/multiarch/memcmp.c            |  3 --
 sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S | 12 +++++
 sysdeps/x86_64/multiarch/memcmpeq-avx2.S     | 23 +++++++++
 sysdeps/x86_64/multiarch/memcmpeq-evex.S     | 23 +++++++++
 sysdeps/x86_64/multiarch/memcmpeq-sse2.S     | 23 +++++++++
 sysdeps/x86_64/multiarch/memcmpeq.c          | 35 ++++++++++++++
 12 files changed, 202 insertions(+), 9 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
 create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
 create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-avx2.S
 create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-evex.S
 create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-sse2.S
 create mode 100644 sysdeps/x86_64/multiarch/memcmpeq.c

diff --git a/sysdeps/generic/ifunc-init.h b/sysdeps/generic/ifunc-init.h
index 7f69485de8..ee8a8289c8 100644
--- a/sysdeps/generic/ifunc-init.h
+++ b/sysdeps/generic/ifunc-init.h
@@ -50,5 +50,8 @@
    '__<symbol>_<variant>' as the optimized implementation and
    '<symbol>_ifunc_selector' as the IFUNC selector.  */
 #define REDIRECT_NAME	EVALUATOR1 (__redirect, SYMBOL_NAME)
-#define OPTIMIZE(name)	EVALUATOR2 (SYMBOL_NAME, name)
 #define IFUNC_SELECTOR	EVALUATOR1 (SYMBOL_NAME, ifunc_selector)
+#define OPTIMIZE1(name)	EVALUATOR1 (SYMBOL_NAME, name)
+#define OPTIMIZE2(name)	EVALUATOR2 (SYMBOL_NAME, name)
+/* Default is to use OPTIMIZE2.  */
+#define OPTIMIZE(name)	OPTIMIZE2(name)
diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
index 8a03e572e8..b53f2c0866 100644
--- a/sysdeps/x86_64/memcmp.S
+++ b/sysdeps/x86_64/memcmp.S
@@ -356,9 +356,10 @@ L(ATR32res):
 	.p2align 4,, 4
 END(memcmp)
 
-#undef bcmp
+#ifdef USE_AS_MEMCMPEQ
+libc_hidden_def (memcmp)
+#else
+# undef bcmp
 weak_alias (memcmp, bcmp)
-#undef __memcmpeq
-strong_alias (memcmp, __memcmpeq)
 libc_hidden_builtin_def (memcmp)
-libc_hidden_def (__memcmpeq)
+#endif
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index 26be40959c..044778585b 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -7,7 +7,9 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
 		   memchr-sse2 rawmemchr-sse2 memchr-avx2 rawmemchr-avx2 \
 		   memrchr-sse2 memrchr-avx2 \
 		   memcmp-sse2 \
+		   memcmpeq-sse2 \
 		   memcmp-avx2-movbe \
+		   memcmpeq-avx2 \
 		   memcmp-sse4 memcpy-ssse3 \
 		   memmove-ssse3 \
 		   memcpy-ssse3-back \
@@ -42,6 +44,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
 		   memset-avx512-unaligned-erms \
 		   memchr-avx2-rtm \
 		   memcmp-avx2-movbe-rtm \
+		   memcmpeq-avx2-rtm \
 		   memmove-avx-unaligned-erms-rtm \
 		   memrchr-avx2-rtm \
 		   memset-avx2-unaligned-erms-rtm \
@@ -61,6 +64,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
 		   strrchr-avx2-rtm \
 		   memchr-evex \
 		   memcmp-evex-movbe \
+		   memcmpeq-evex \
 		   memmove-evex-unaligned-erms \
 		   memrchr-evex \
 		   memset-evex-unaligned-erms \
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 39ab10613b..f7f3806d1d 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -38,6 +38,27 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   size_t i = 0;
 
+  /* Support sysdeps/x86_64/multiarch/memcmpeq.c.  */
+  IFUNC_IMPL (i, name, __memcmpeq,
+	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
+			      (CPU_FEATURE_USABLE (AVX2)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (BMI2)),
+			      __memcmpeq_avx2)
+	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
+			      (CPU_FEATURE_USABLE (AVX2)
+			       && CPU_FEATURE_USABLE (BMI2)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (RTM)),
+			      __memcmpeq_avx2_rtm)
+	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
+			      (CPU_FEATURE_USABLE (AVX512VL)
+			       && CPU_FEATURE_USABLE (AVX512BW)
+                   && CPU_FEATURE_USABLE (MOVBE)
+			       && CPU_FEATURE_USABLE (BMI2)),
+			      __memcmpeq_evex)
+	      IFUNC_IMPL_ADD (array, i, __memcmpeq, 1, __memcmpeq_sse2))
+
   /* Support sysdeps/x86_64/multiarch/memchr.c.  */
   IFUNC_IMPL (i, name, memchr,
 	      IFUNC_IMPL_ADD (array, i, memchr,
diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
new file mode 100644
index 0000000000..3319a9568a
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
@@ -0,0 +1,49 @@
+/* Common definition for __memcmpeq ifunc selections.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+# include <init-arch.h>
+
+extern __typeof (REDIRECT_NAME) OPTIMIZE1 (sse2) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE1 (avx2) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE1 (avx2_rtm) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE1 (evex) attribute_hidden;
+
+static inline void *
+IFUNC_SELECTOR (void)
+{
+  const struct cpu_features* cpu_features = __get_cpu_features ();
+
+  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
+      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
+      && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+    {
+      if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
+	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
+	return OPTIMIZE1 (evex);
+
+      if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
+	return OPTIMIZE1 (avx2_rtm);
+
+      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
+	return OPTIMIZE1 (avx2);
+    }
+
+  return OPTIMIZE1 (sse2);
+}
diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index 7b30b7ca2e..132d6fb339 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -17,7 +17,9 @@
    <https://www.gnu.org/licenses/>.  */
 
 #if IS_IN (libc)
-# define memcmp __memcmp_sse2
+# ifndef memcmp
+#  define memcmp __memcmp_sse2
+# endif
 
 # ifdef SHARED
 #  undef libc_hidden_builtin_def
diff --git a/sysdeps/x86_64/multiarch/memcmp.c b/sysdeps/x86_64/multiarch/memcmp.c
index 7b3409b1dd..fe725f3563 100644
--- a/sysdeps/x86_64/multiarch/memcmp.c
+++ b/sysdeps/x86_64/multiarch/memcmp.c
@@ -29,9 +29,6 @@
 libc_ifunc_redirected (__redirect_memcmp, memcmp, IFUNC_SELECTOR ());
 # undef bcmp
 weak_alias (memcmp, bcmp)
-# undef __memcmpeq
-strong_alias (memcmp, __memcmpeq)
-libc_hidden_def (__memcmpeq)
 
 # ifdef SHARED
 __hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp)
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
new file mode 100644
index 0000000000..24b6a0c9ff
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
@@ -0,0 +1,12 @@
+#ifndef MEMCMP
+# define MEMCMP __memcmpeq_avx2_rtm
+#endif
+
+#define ZERO_UPPER_VEC_REGISTERS_RETURN \
+  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
+
+#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
+
+#define SECTION(p) p##.avx.rtm
+
+#include "memcmpeq-avx2.S"
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
new file mode 100644
index 0000000000..0181ea0d8d
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
@@ -0,0 +1,23 @@
+/* __memcmpeq optimized with AVX2.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef MEMCMP
+# define MEMCMP	__memcmpeq_avx2
+#endif
+
+#include "memcmp-avx2-movbe.S"
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-evex.S b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
new file mode 100644
index 0000000000..951e1e9560
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
@@ -0,0 +1,23 @@
+/* __memcmpeq optimized with EVEX.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef MEMCMP
+# define MEMCMP	__memcmpeq_evex
+#endif
+
+#include "memcmp-evex-movbe.S"
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-sse2.S b/sysdeps/x86_64/multiarch/memcmpeq-sse2.S
new file mode 100644
index 0000000000..c488cbbcd9
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memcmpeq-sse2.S
@@ -0,0 +1,23 @@
+/* __memcmpeq optimized with SSE2.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef memcmp
+# define memcmp	__memcmpeq_sse2
+#endif
+#define USE_AS_MEMCMPEQ	1
+#include "memcmp-sse2.S"
diff --git a/sysdeps/x86_64/multiarch/memcmpeq.c b/sysdeps/x86_64/multiarch/memcmpeq.c
new file mode 100644
index 0000000000..163e56047e
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/memcmpeq.c
@@ -0,0 +1,35 @@
+/* Multiple versions of __memcmpeq.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2017-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+#if IS_IN (libc)
+# define __memcmpeq __redirect___memcmpeq
+# include <string.h>
+# undef __memcmpeq
+
+# define SYMBOL_NAME __memcmpeq
+# include "ifunc-memcmpeq.h"
+
+libc_ifunc_redirected (__redirect___memcmpeq, __memcmpeq, IFUNC_SELECTOR ());
+
+# ifdef SHARED
+__hidden_ver1 (__memcmpeq, __GI___memcmpeq, __redirect___memcmpeq)
+    __attribute__ ((visibility ("hidden"))) __attribute_copy__ (__memcmpeq);
+# endif
+#endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
  2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
  2021-10-27  2:43 ` [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex Noah Goldstein
@ 2021-10-27  2:43 ` Noah Goldstein
  2021-10-27 12:48   ` H.J. Lu
  2021-10-27  2:43 ` [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S Noah Goldstein
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit does not modify any of the memcmp
implementation. It just adds __memcmpeq ifdefs to skip obvious cases
where computing the proper 1/-1 required by memcmp is not needed.
---
 sysdeps/x86_64/memcmp.S | 55 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
index b53f2c0866..c245383963 100644
--- a/sysdeps/x86_64/memcmp.S
+++ b/sysdeps/x86_64/memcmp.S
@@ -49,34 +49,63 @@ L(s2b):
 	movzwl	(%rdi),	%eax
 	movzwl	(%rdi, %rsi), %edx
 	subq    $2, %r10
+#ifdef USE_AS_MEMCMPEQ
+	je	L(finz1)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$2, %rdi
 	cmpl	%edx, %eax
+#ifdef USE_AS_MEMCMPEQ
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s4b):
 	testq	$4, %r10
 	jz	L(s8b)
 	movl	(%rdi),	%eax
 	movl	(%rdi, %rsi), %edx
 	subq    $4, %r10
+#ifdef USE_AS_MEMCMPEQ
+	je	L(finz1)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$4, %rdi
 	cmpl	%edx, %eax
+#ifdef USE_AS_MEMCMPEQ
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s8b):
 	testq	$8, %r10
 	jz	L(s16b)
 	movq	(%rdi),	%rax
 	movq	(%rdi, %rsi), %rdx
 	subq    $8, %r10
+#ifdef USE_AS_MEMCMPEQ
+	je	L(sub_return8)
+#else
 	je	L(fin2_7)
+#endif
 	addq	$8, %rdi
 	cmpq	%rdx, %rax
+#ifdef USE_AS_MEMCMPEQ
+	jnz	L(neq_early)
+#else
 	jnz	L(fin2_7)
+#endif
 L(s16b):
 	movdqu    (%rdi), %xmm1
 	movdqu    (%rdi, %rsi), %xmm0
 	pcmpeqb   %xmm0, %xmm1
+#ifdef USE_AS_MEMCMPEQ
+	pmovmskb  %xmm1, %eax
+	subl      $0xffff, %eax
+	ret
+#else
 	pmovmskb  %xmm1, %edx
 	xorl	  %eax, %eax
 	subl      $0xffff, %edx
@@ -86,7 +115,7 @@ L(s16b):
 	movzbl	 (%rcx), %eax
 	movzbl	 (%rsi, %rcx), %edx
 	jmp	 L(finz1)
-
+#endif
 	.p2align 4,, 4
 L(finr1b):
 	movzbl	(%rdi), %eax
@@ -95,7 +124,15 @@ L(finz1):
 	subl	%edx, %eax
 L(exit):
 	ret
-
+#ifdef USE_AS_MEMCMPEQ
+	.p2align 4,, 4
+L(sub_return8):
+	subq	%rdx, %rax
+	movl	%eax, %edx
+	shrq	$32, %rax
+	orl	%edx, %eax
+	ret
+#else
 	.p2align 4,, 4
 L(fin2_7):
 	cmpq	%rdx, %rax
@@ -111,12 +148,17 @@ L(fin2_7):
 	movzbl  %dl, %edx
 	subl	%edx, %eax
 	ret
-
+#endif
 	.p2align 4,, 4
 L(finz):
 	xorl	%eax, %eax
 	ret
-
+#ifdef USE_AS_MEMCMPEQ
+	.p2align 4,, 4
+L(neq_early):
+	movl	$1, %eax
+	ret
+#endif
 	/* For blocks bigger than 32 bytes
 	   1. Advance one of the addr pointer to be 16B aligned.
 	   2. Treat the case of both addr pointers aligned to 16B
@@ -246,11 +288,16 @@ L(mt16):
 
 	.p2align 4,, 4
 L(neq):
+#ifdef USE_AS_MEMCMPEQ
+	movl	$1, %eax
+    ret
+#else
 	bsfl      %edx, %ecx
 	movzbl	 (%rdi, %rcx), %eax
 	addq	 %rdi, %rsi
 	movzbl	 (%rsi,%rcx), %edx
 	jmp	 L(finz1)
+#endif
 
 	.p2align 4,, 4
 L(ATR):
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
                   ` (2 preceding siblings ...)
  2021-10-27  2:43 ` [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S Noah Goldstein
@ 2021-10-27  2:43 ` Noah Goldstein
  2021-10-27 12:48   ` H.J. Lu
  2021-10-27  2:43 ` [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S Noah Goldstein
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit adds new optimized __memcmpeq implementation for
avx2.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
---
 sysdeps/x86_64/multiarch/ifunc-impl-list.c   |   2 -
 sysdeps/x86_64/multiarch/ifunc-memcmpeq.h    |   2 +-
 sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S |   4 +-
 sysdeps/x86_64/multiarch/memcmpeq-avx2.S     | 309 ++++++++++++++++++-
 4 files changed, 308 insertions(+), 9 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index f7f3806d1d..535450f52c 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -42,13 +42,11 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   IFUNC_IMPL (i, name, __memcmpeq,
 	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
 			      (CPU_FEATURE_USABLE (AVX2)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (BMI2)),
 			      __memcmpeq_avx2)
 	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
 			      (CPU_FEATURE_USABLE (AVX2)
 			       && CPU_FEATURE_USABLE (BMI2)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (RTM)),
 			      __memcmpeq_avx2_rtm)
 	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
index 3319a9568a..e596c5048b 100644
--- a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
+++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
@@ -31,10 +31,10 @@ IFUNC_SELECTOR (void)
 
   if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
       && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
-      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
+	  && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
 	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
 	return OPTIMIZE1 (evex);
 
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
index 24b6a0c9ff..3264a4a76c 100644
--- a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
+++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
@@ -1,5 +1,5 @@
-#ifndef MEMCMP
-# define MEMCMP __memcmpeq_avx2_rtm
+#ifndef MEMCMPEQ
+# define MEMCMPEQ __memcmpeq_avx2_rtm
 #endif
 
 #define ZERO_UPPER_VEC_REGISTERS_RETURN \
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
index 0181ea0d8d..0bf59fb8fa 100644
--- a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
+++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
@@ -16,8 +16,309 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef MEMCMP
-# define MEMCMP	__memcmpeq_avx2
-#endif
+#if IS_IN (libc)
+
+/* __memcmpeq is implemented as:
+   1. Use ymm vector compares when possible. The only case where
+      vector compares is not possible for when size < VEC_SIZE
+      and loading from either s1 or s2 would cause a page cross.
+   2. Use xmm vector compare when size >= 8 bytes.
+   3. Optimistically compare up to first 4 * VEC_SIZE one at a
+      to check for early mismatches. Only do this if its guranteed the
+      work is not wasted.
+   4. If size is 8 * VEC_SIZE or less, unroll the loop.
+   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
+      area.
+   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
+   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
+   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
+
+# include <sysdep.h>
+
+# ifndef MEMCMPEQ
+#  define MEMCMPEQ	__memcmpeq_avx2
+# endif
+
+# define VPCMPEQ	vpcmpeqb
+
+# ifndef VZEROUPPER
+#  define VZEROUPPER	vzeroupper
+# endif
+
+# ifndef SECTION
+#  define SECTION(p)	p##.avx
+# endif
+
+# define VEC_SIZE 32
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (MEMCMPEQ, 6)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %edx
+# endif
+	cmp	$VEC_SIZE, %RDX_LP
+	jb	L(less_vec)
+
+	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
+	vmovdqu	(%rsi), %ymm1
+	VPCMPEQ	(%rdi), %ymm1, %ymm1
+	vpmovmskb %ymm1, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(last_1x_vec)
+
+	/* Check second VEC no matter what.  */
+	vmovdqu	VEC_SIZE(%rsi), %ymm2
+	VPCMPEQ	VEC_SIZE(%rdi), %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	/* If all 4 VEC where equal eax will be all 1s so incl will overflow
+	   and set zero flag.  */
+	incl	%eax
+	jnz	L(return_neq0)
+
+	/* Less than 4 * VEC.  */
+	cmpq	$(VEC_SIZE * 4), %rdx
+	jbe	L(last_2x_vec)
+
+	/* Check third and fourth VEC no matter what.  */
+	vmovdqu	(VEC_SIZE * 2)(%rsi), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+	vpmovmskb %ymm3, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+	jnz	L(return_neq0)
+
+	/* Go to 4x VEC loop.  */
+	cmpq	$(VEC_SIZE * 8), %rdx
+	ja	L(more_8x_vec)
+
+	/* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
+	   branches.  */
+
+	/* Adjust rsi and rdi to avoid indexed address mode. This end up
+	   saving a 16 bytes of code, prevents unlamination, and bottlenecks in
+	   the AGU.  */
+	addq	%rdx, %rsi
+	vmovdqu	-(VEC_SIZE * 4)(%rsi), %ymm1
+	vmovdqu	-(VEC_SIZE * 3)(%rsi), %ymm2
+	addq	%rdx, %rdi
+
+	VPCMPEQ	-(VEC_SIZE * 4)(%rdi), %ymm1, %ymm1
+	VPCMPEQ	-(VEC_SIZE * 3)(%rdi), %ymm2, %ymm2
+
+	vmovdqu	-(VEC_SIZE * 2)(%rsi), %ymm3
+	VPCMPEQ	-(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+	vmovdqu	-VEC_SIZE(%rsi), %ymm4
+	VPCMPEQ	-VEC_SIZE(%rdi), %ymm4, %ymm4
+
+	/* Reduce VEC0 - VEC4.  */
+	vpand	%ymm1, %ymm2, %ymm2
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm2, %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+L(return_neq0):
+L(return_vzeroupper):
+	ZERO_UPPER_VEC_REGISTERS_RETURN
 
-#include "memcmp-avx2-movbe.S"
+	/* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte
+	   aligned.  */
+	.p2align 5
+L(less_vec):
+	/* Check if one or less char. This is necessary for size = 0 but is
+	   also faster for size = 1.  */
+	cmpl	$1, %edx
+	jbe	L(one_or_less)
+
+	/* Check if loading one VEC from either s1 or s2 could cause a page
+	   cross. This can have false positives but is by far the fastest
+	   method.  */
+	movl	%edi, %eax
+	orl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	jg	L(page_cross_less_vec)
+
+	/* No page cross possible.  */
+	vmovdqu	(%rsi), %ymm2
+	VPCMPEQ	(%rdi), %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	incl	%eax
+	/* Result will be zero if s1 and s2 match. Otherwise first set bit
+	   will be first mismatch.  */
+	bzhil	%edx, %eax, %eax
+	VZEROUPPER_RETURN
+
+	/* Relatively cold but placing close to L(less_vec) for 2 byte jump
+	   encoding.  */
+	.p2align 4
+L(one_or_less):
+	jb	L(zero)
+	movzbl	(%rsi), %ecx
+	movzbl	(%rdi), %eax
+	subl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+	/* Within the same 16 byte block is L(one_or_less).  */
+L(zero):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4
+L(last_1x_vec):
+	vmovdqu	-(VEC_SIZE * 1)(%rsi, %rdx), %ymm1
+	VPCMPEQ	-(VEC_SIZE * 1)(%rdi, %rdx), %ymm1, %ymm1
+	vpmovmskb %ymm1, %eax
+	incl	%eax
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(last_2x_vec):
+	vmovdqu	-(VEC_SIZE * 2)(%rsi, %rdx), %ymm1
+	VPCMPEQ	-(VEC_SIZE * 2)(%rdi, %rdx), %ymm1, %ymm1
+	vmovdqu	-(VEC_SIZE * 1)(%rsi, %rdx), %ymm2
+	VPCMPEQ	-(VEC_SIZE * 1)(%rdi, %rdx), %ymm2, %ymm2
+	vpand	%ymm1, %ymm2, %ymm2
+	vpmovmskb %ymm2, %eax
+	incl	%eax
+	VZEROUPPER_RETURN
+
+	.p2align 4
+L(more_8x_vec):
+	/* Set end of s1 in rdx.  */
+	leaq	-(VEC_SIZE * 4)(%rdi, %rdx), %rdx
+	/* rsi stores s2 - s1. This allows loop to only update one pointer.
+	 */
+	subq	%rdi, %rsi
+	/* Align s1 pointer.  */
+	andq	$-VEC_SIZE, %rdi
+	/* Adjust because first 4x vec where check already.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(loop_4x_vec):
+	/* rsi has s2 - s1 so get correct address by adding s1 (in rdi).  */
+	vmovdqu	(%rsi, %rdi), %ymm1
+	VPCMPEQ	(%rdi), %ymm1, %ymm1
+
+	vmovdqu	VEC_SIZE(%rsi, %rdi), %ymm2
+	VPCMPEQ	VEC_SIZE(%rdi), %ymm2, %ymm2
+
+	vmovdqu	(VEC_SIZE * 2)(%rsi, %rdi), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi, %rdi), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
+
+	vpand	%ymm1, %ymm2, %ymm2
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm2, %ymm4, %ymm4
+	vpmovmskb %ymm4, %eax
+	incl	%eax
+	jnz	L(return_neq1)
+	subq	$-(VEC_SIZE * 4), %rdi
+	/* Check if s1 pointer at end.  */
+	cmpq	%rdx, %rdi
+	jb	L(loop_4x_vec)
+
+	vmovdqu	(VEC_SIZE * 3)(%rsi, %rdx), %ymm4
+	VPCMPEQ	(VEC_SIZE * 3)(%rdx), %ymm4, %ymm4
+	subq	%rdx, %rdi
+	/* rdi has 4 * VEC_SIZE - remaining length.  */
+	cmpl	$(VEC_SIZE * 3), %edi
+	jae	L(8x_last_1x_vec)
+	/* Load regardless of branch.  */
+	vmovdqu	(VEC_SIZE * 2)(%rsi, %rdx), %ymm3
+	VPCMPEQ	(VEC_SIZE * 2)(%rdx), %ymm3, %ymm3
+	cmpl	$(VEC_SIZE * 2), %edi
+	jae	L(8x_last_2x_vec)
+	/* Check last 4 VEC.  */
+	vmovdqu	VEC_SIZE(%rsi, %rdx), %ymm1
+	VPCMPEQ	VEC_SIZE(%rdx), %ymm1, %ymm1
+
+	vmovdqu	(%rsi, %rdx), %ymm2
+	VPCMPEQ	(%rdx), %ymm2, %ymm2
+
+	vpand	%ymm3, %ymm4, %ymm4
+	vpand	%ymm1, %ymm2, %ymm3
+L(8x_last_2x_vec):
+	vpand	%ymm3, %ymm4, %ymm4
+L(8x_last_1x_vec):
+	vpmovmskb %ymm4, %eax
+	/* Restore s1 pointer to rdi.  */
+	incl	%eax
+L(return_neq1):
+	VZEROUPPER_RETURN
+
+	/* Relatively cold case as page cross are unexpected.  */
+	.p2align 4
+L(page_cross_less_vec):
+	cmpl	$16, %edx
+	jae	L(between_16_31)
+	cmpl	$8, %edx
+	ja	L(between_9_15)
+	cmpl	$4, %edx
+	jb	L(between_2_3)
+	/* From 4 to 8 bytes.  No branch when size == 4.  */
+	movl	(%rdi), %eax
+	subl	(%rsi), %eax
+	movl	-4(%rdi, %rdx), %ecx
+	movl	-4(%rsi, %rdx), %edi
+	subl	%edi, %ecx
+	orl	%ecx, %eax
+	ret
+
+	.p2align 4,, 8
+L(between_16_31):
+	/* From 16 to 31 bytes.  No branch when size == 16.  */
+
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovdqu	(%rsi), %xmm1
+	vpcmpeqb (%rdi), %xmm1, %xmm1
+	vmovdqu	-16(%rsi, %rdx), %xmm2
+	vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
+	vpand	%xmm1, %xmm2, %xmm2
+	vpmovmskb %xmm2, %eax
+	notw	%ax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_9_15):
+	/* From 9 to 15 bytes.  */
+	movq	(%rdi), %rax
+	subq	(%rsi), %rax
+	movq	-8(%rdi, %rdx), %rcx
+	movq	-8(%rsi, %rdx), %rdi
+	subq	%rdi, %rcx
+	orq	%rcx, %rax
+	/* edx is guranteed to be a non-zero int.  */
+	cmovnz	%edx, %eax
+	ret
+
+	/* Don't align. This is cold and aligning here will cause code
+	   to spill into next cache line.  */
+L(between_2_3):
+	/* From 2 to 3 bytes.  No branch when size == 2.  */
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movzbl	-1(%rdi, %rdx), %ecx
+	/* All machines that support evex will insert a "merging uop"
+	   avoiding any serious partial register stalls.  */
+	subb	-1(%rsi, %rdx), %cl
+	orl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+
+    /* 2 Bytes from next cache line. */
+END (MEMCMPEQ)
+#endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
                   ` (3 preceding siblings ...)
  2021-10-27  2:43 ` [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S Noah Goldstein
@ 2021-10-27  2:43 ` Noah Goldstein
  2021-10-27  2:44   ` Noah Goldstein
  2021-10-27 12:49   ` H.J. Lu
  2021-10-27 12:42 ` [PATCH v1 1/6] String: Add __memcmpeq as build target H.J. Lu
  2021-10-28 17:57 ` Joseph Myers
  6 siblings, 2 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:43 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit adds new optimized __memcmpeq implementation for
evex.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
---
 sysdeps/x86_64/multiarch/ifunc-impl-list.c |   1 -
 sysdeps/x86_64/multiarch/ifunc-memcmpeq.h  |   1 -
 sysdeps/x86_64/multiarch/memcmpeq-evex.S   | 308 ++++++++++++++++++++-
 3 files changed, 304 insertions(+), 6 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 535450f52c..ea8df9f9b9 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, __memcmpeq,
 			      (CPU_FEATURE_USABLE (AVX512VL)
 			       && CPU_FEATURE_USABLE (AVX512BW)
-                   && CPU_FEATURE_USABLE (MOVBE)
 			       && CPU_FEATURE_USABLE (BMI2)),
 			      __memcmpeq_evex)
 	      IFUNC_IMPL_ADD (array, i, __memcmpeq, 1, __memcmpeq_sse2))
diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
index e596c5048b..2ea38adf05 100644
--- a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
+++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
@@ -34,7 +34,6 @@ IFUNC_SELECTOR (void)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     {
       if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
-	  && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
 	  && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
 	return OPTIMIZE1 (evex);
 
diff --git a/sysdeps/x86_64/multiarch/memcmpeq-evex.S b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
index 951e1e9560..f27e732036 100644
--- a/sysdeps/x86_64/multiarch/memcmpeq-evex.S
+++ b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
@@ -16,8 +16,308 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#ifndef MEMCMP
-# define MEMCMP	__memcmpeq_evex
-#endif
+#if IS_IN (libc)
+
+/* __memcmpeq is implemented as:
+   1. Use ymm vector compares when possible. The only case where
+      vector compares is not possible for when size < VEC_SIZE
+      and loading from either s1 or s2 would cause a page cross.
+   2. Use xmm vector compare when size >= 8 bytes.
+   3. Optimistically compare up to first 4 * VEC_SIZE one at a
+      to check for early mismatches. Only do this if its guranteed the
+      work is not wasted.
+   4. If size is 8 * VEC_SIZE or less, unroll the loop.
+   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
+      area.
+   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
+   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
+   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
+
+# include <sysdep.h>
+
+# ifndef MEMCMPEQ
+#  define MEMCMPEQ	__memcmpeq_evex
+# endif
+
+# define VMOVU	vmovdqu64
+# define VPCMP	vpcmpub
+# define VPTEST	vptestmb
+
+# define VEC_SIZE	32
+# define PAGE_SIZE	4096
+
+# define YMM0		ymm16
+# define YMM1		ymm17
+# define YMM2		ymm18
+# define YMM3		ymm19
+# define YMM4		ymm20
+# define YMM5		ymm21
+# define YMM6		ymm22
+
+
+	.section .text.evex, "ax", @progbits
+ENTRY_P2ALIGN (MEMCMPEQ, 6)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%edx, %edx
+# endif
+	cmp	$VEC_SIZE, %RDX_LP
+	jb	L(less_vec)
+
+	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
+	VMOVU	(%rsi), %YMM1
+	/* Use compare not equals to directly check for mismatch.  */
+	VPCMP	$4, (%rdi), %YMM1, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	cmpq	$(VEC_SIZE * 2), %rdx
+	jbe	L(last_1x_vec)
+
+	/* Check second VEC no matter what.  */
+	VMOVU	VEC_SIZE(%rsi), %YMM2
+	VPCMP	$4, VEC_SIZE(%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Less than 4 * VEC.  */
+	cmpq	$(VEC_SIZE * 4), %rdx
+	jbe	L(last_2x_vec)
+
+	/* Check third and fourth VEC no matter what.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
+	VPCMP	$4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
+	VPCMP	$4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq0)
+
+	/* Go to 4x VEC loop.  */
+	cmpq	$(VEC_SIZE * 8), %rdx
+	ja	L(more_8x_vec)
+
+	/* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
+	   branches.  */
+
+	VMOVU	-(VEC_SIZE * 4)(%rsi, %rdx), %YMM1
+	VMOVU	-(VEC_SIZE * 3)(%rsi, %rdx), %YMM2
+	addq	%rdx, %rdi
+
+	/* Wait to load from s1 until addressed adjust due to
+	   unlamination.  */
+
+	/* vpxor will be all 0s if s1 and s2 are equal. Otherwise it
+	   will have some 1s.  */
+	vpxorq	-(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1
+	/* Ternary logic to xor -(VEC_SIZE * 3)(%rdi) with YMM2 while
+	   oring with YMM1. Result is stored in YMM1.  */
+	vpternlogd $0xde, -(VEC_SIZE * 3)(%rdi), %YMM1, %YMM2
+
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	vpxorq	-(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+	/* Or together YMM1, YMM2, and YMM3 into YMM3.  */
+	VMOVU	-(VEC_SIZE)(%rsi, %rdx), %YMM4
+	vpxorq	-(VEC_SIZE)(%rdi), %YMM4, %YMM4
+
+	/* Or together YMM2, YMM3, and YMM4 into YMM4.  */
+	vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
 
-#include "memcmp-evex-movbe.S"
+	/* Compare YMM4 with 0. If any 1s s1 and s2 don't match.  */
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq0):
+	ret
+
+	/* Fits in padding needed to .p2align 5 L(less_vec).  */
+L(last_1x_vec):
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM1
+	VPCMP	$4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1
+	kmovd	%k1, %eax
+	ret
+
+	/* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32
+	   byte aligned.  */
+	.p2align 5
+L(less_vec):
+	/* Check if one or less char. This is necessary for size = 0 but
+	   is also faster for size = 1.  */
+	cmpl	$1, %edx
+	jbe	L(one_or_less)
+
+	/* Check if loading one VEC from either s1 or s2 could cause a
+	   page cross. This can have false positives but is by far the
+	   fastest method.  */
+	movl	%edi, %eax
+	orl	%esi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	jg	L(page_cross_less_vec)
+
+	/* No page cross possible.  */
+	VMOVU	(%rsi), %YMM2
+	VPCMP	$4, (%rdi), %YMM2, %k1
+	kmovd	%k1, %eax
+	/* Result will be zero if s1 and s2 match. Otherwise first set
+	   bit will be first mismatch.  */
+	bzhil	%edx, %eax, %eax
+	ret
+
+	/* Relatively cold but placing close to L(less_vec) for 2 byte
+	   jump encoding.  */
+	.p2align 4
+L(one_or_less):
+	jb	L(zero)
+	movzbl	(%rsi), %ecx
+	movzbl	(%rdi), %eax
+	subl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+	/* Within the same 16 byte block is L(one_or_less).  */
+L(zero):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4
+L(last_2x_vec):
+	VMOVU	-(VEC_SIZE * 2)(%rsi, %rdx), %YMM1
+	vpxorq	-(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1
+	VMOVU	-(VEC_SIZE * 1)(%rsi, %rdx), %YMM2
+	vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2
+	VPTEST	%YMM2, %YMM2, %k1
+	kmovd	%k1, %eax
+	ret
+
+	.p2align 4
+L(more_8x_vec):
+	/* Set end of s1 in rdx.  */
+	leaq	-(VEC_SIZE * 4)(%rdi, %rdx), %rdx
+	/* rsi stores s2 - s1. This allows loop to only update one
+	   pointer.  */
+	subq	%rdi, %rsi
+	/* Align s1 pointer.  */
+	andq	$-VEC_SIZE, %rdi
+	/* Adjust because first 4x vec where check already.  */
+	subq	$-(VEC_SIZE * 4), %rdi
+	.p2align 4
+L(loop_4x_vec):
+	VMOVU	(%rsi, %rdi), %YMM1
+	vpxorq	(%rdi), %YMM1, %YMM1
+
+	VMOVU	VEC_SIZE(%rsi, %rdi), %YMM2
+	vpternlogd $0xde, (VEC_SIZE)(%rdi), %YMM1, %YMM2
+
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdi), %YMM3
+	vpxorq	(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
+
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdi), %YMM4
+	vpxorq	(VEC_SIZE * 3)(%rdi), %YMM4, %YMM4
+
+	vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+	testl	%eax, %eax
+	jnz	L(return_neq2)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(loop_4x_vec)
+
+	subq	%rdx, %rdi
+	VMOVU	(VEC_SIZE * 3)(%rsi, %rdx), %YMM4
+	vpxorq	(VEC_SIZE * 3)(%rdx), %YMM4, %YMM4
+	/* rdi has 4 * VEC_SIZE - remaining length.  */
+	cmpl	$(VEC_SIZE * 3), %edi
+	jae	L(8x_last_1x_vec)
+	/* Load regardless of branch.  */
+	VMOVU	(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
+	/* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while
+	   oring with YMM4. Result is stored in YMM4.  */
+	vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4
+	cmpl	$(VEC_SIZE * 2), %edi
+	jae	L(8x_last_2x_vec)
+
+	VMOVU	VEC_SIZE(%rsi, %rdx), %YMM2
+	vpxorq	VEC_SIZE(%rdx), %YMM2, %YMM2
+
+	VMOVU	(%rsi, %rdx), %YMM1
+	vpxorq	(%rdx), %YMM1, %YMM1
+
+	vpternlogd $0xfe, %YMM1, %YMM2, %YMM4
+L(8x_last_1x_vec):
+L(8x_last_2x_vec):
+	VPTEST	%YMM4, %YMM4, %k1
+	kmovd	%k1, %eax
+L(return_neq2):
+	ret
+
+	/* Relatively cold case as page cross are unexpected.  */
+	.p2align 4
+L(page_cross_less_vec):
+	cmpl	$16, %edx
+	jae	L(between_16_31)
+	cmpl	$8, %edx
+	ja	L(between_9_15)
+	cmpl	$4, %edx
+	jb	L(between_2_3)
+	/* From 4 to 8 bytes.  No branch when size == 4.  */
+	movl	(%rdi), %eax
+	subl	(%rsi), %eax
+	movl	-4(%rdi, %rdx), %ecx
+	movl	-4(%rsi, %rdx), %edi
+	subl	%edi, %ecx
+	orl	%ecx, %eax
+	ret
+
+	.p2align 4,, 8
+L(between_16_31):
+	/* From 16 to 31 bytes.  No branch when size == 16.  */
+
+	/* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
+	 */
+	vmovdqu	(%rsi), %xmm1
+	vpcmpeqb (%rdi), %xmm1, %xmm1
+	vmovdqu	-16(%rsi, %rdx), %xmm2
+	vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
+	vpand	%xmm1, %xmm2, %xmm2
+	vpmovmskb %xmm2, %eax
+	notw	%ax
+	/* No ymm register was touched.  */
+	ret
+
+	.p2align 4,, 8
+L(between_9_15):
+	/* From 9 to 15 bytes.  */
+	movq	(%rdi), %rax
+	subq	(%rsi), %rax
+	movq	-8(%rdi, %rdx), %rcx
+	movq	-8(%rsi, %rdx), %rdi
+	subq	%rdi, %rcx
+	orq	%rcx, %rax
+	/* edx is guranteed to be a non-zero int.  */
+	cmovnz	%edx, %eax
+	ret
+
+	/* Don't align. This is cold and aligning here will cause code
+	   to spill into next cache line.  */
+L(between_2_3):
+	/* From 2 to 3 bytes.  No branch when size == 2.  */
+	movzwl	(%rdi), %eax
+	movzwl	(%rsi), %ecx
+	subl	%ecx, %eax
+	movzbl	-1(%rdi, %rdx), %ecx
+	/* All machines that support evex will insert a "merging uop"
+	   avoiding any serious partial register stalls.  */
+	subb	-1(%rsi, %rdx), %cl
+	orl	%ecx, %eax
+	/* No ymm register was touched.  */
+	ret
+
+    /* 4 Bytes from next cache line. */
+END (MEMCMPEQ)
+#endif
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S
  2021-10-27  2:43 ` [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S Noah Goldstein
@ 2021-10-27  2:44   ` Noah Goldstein
  2021-10-27 12:49   ` H.J. Lu
  1 sibling, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27  2:44 UTC (permalink / raw)
  To: GNU C Library

[-- Attachment #1: Type: text/plain, Size: 13943 bytes --]

On Tue, Oct 26, 2021 at 9:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit adds new optimized __memcmpeq implementation for
> evex.
>
> The primary optimizations are:
>
> 1) skipping the logic to find the difference of the first mismatched
> byte.
>
> 2) not updating src/dst addresses as the non-equals logic does not
> need to be reused by different areas.
> ---
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c |   1 -
>  sysdeps/x86_64/multiarch/ifunc-memcmpeq.h  |   1 -
>  sysdeps/x86_64/multiarch/memcmpeq-evex.S   | 308 ++++++++++++++++++++-
>  3 files changed, 304 insertions(+), 6 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index 535450f52c..ea8df9f9b9 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, __memcmpeq,
>                               (CPU_FEATURE_USABLE (AVX512VL)
>                                && CPU_FEATURE_USABLE (AVX512BW)
> -                   && CPU_FEATURE_USABLE (MOVBE)
>                                && CPU_FEATURE_USABLE (BMI2)),
>                               __memcmpeq_evex)
>               IFUNC_IMPL_ADD (array, i, __memcmpeq, 1, __memcmpeq_sse2))
> diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> index e596c5048b..2ea38adf05 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> +++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> @@ -34,7 +34,6 @@ IFUNC_SELECTOR (void)
>        && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
>      {
>        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> -         && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
>           && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
>         return OPTIMIZE1 (evex);
>
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-evex.S b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> index 951e1e9560..f27e732036 100644
> --- a/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> @@ -16,8 +16,308 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#ifndef MEMCMP
> -# define MEMCMP        __memcmpeq_evex
> -#endif
> +#if IS_IN (libc)
> +
> +/* __memcmpeq is implemented as:
> +   1. Use ymm vector compares when possible. The only case where
> +      vector compares is not possible for when size < VEC_SIZE
> +      and loading from either s1 or s2 would cause a page cross.
> +   2. Use xmm vector compare when size >= 8 bytes.
> +   3. Optimistically compare up to first 4 * VEC_SIZE one at a
> +      to check for early mismatches. Only do this if its guranteed the
> +      work is not wasted.
> +   4. If size is 8 * VEC_SIZE or less, unroll the loop.
> +   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
> +      area.
> +   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
> +   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
> +   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
> +
> +# include <sysdep.h>
> +
> +# ifndef MEMCMPEQ
> +#  define MEMCMPEQ     __memcmpeq_evex
> +# endif
> +
> +# define VMOVU vmovdqu64
> +# define VPCMP vpcmpub
> +# define VPTEST        vptestmb
> +
> +# define VEC_SIZE      32
> +# define PAGE_SIZE     4096
> +
> +# define YMM0          ymm16
> +# define YMM1          ymm17
> +# define YMM2          ymm18
> +# define YMM3          ymm19
> +# define YMM4          ymm20
> +# define YMM5          ymm21
> +# define YMM6          ymm22
> +
> +
> +       .section .text.evex, "ax", @progbits
> +ENTRY_P2ALIGN (MEMCMPEQ, 6)
> +# ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %edx
> +# endif
> +       cmp     $VEC_SIZE, %RDX_LP
> +       jb      L(less_vec)
> +
> +       /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
> +       VMOVU   (%rsi), %YMM1
> +       /* Use compare not equals to directly check for mismatch.  */
> +       VPCMP   $4, (%rdi), %YMM1, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(last_1x_vec)
> +
> +       /* Check second VEC no matter what.  */
> +       VMOVU   VEC_SIZE(%rsi), %YMM2
> +       VPCMP   $4, VEC_SIZE(%rdi), %YMM2, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       /* Less than 4 * VEC.  */
> +       cmpq    $(VEC_SIZE * 4), %rdx
> +       jbe     L(last_2x_vec)
> +
> +       /* Check third and fourth VEC no matter what.  */
> +       VMOVU   (VEC_SIZE * 2)(%rsi), %YMM3
> +       VPCMP   $4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       VMOVU   (VEC_SIZE * 3)(%rsi), %YMM4
> +       VPCMP   $4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       /* Go to 4x VEC loop.  */
> +       cmpq    $(VEC_SIZE * 8), %rdx
> +       ja      L(more_8x_vec)
> +
> +       /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
> +          branches.  */
> +
> +       VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %YMM1
> +       VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %YMM2
> +       addq    %rdx, %rdi
> +
> +       /* Wait to load from s1 until addressed adjust due to
> +          unlamination.  */
> +
> +       /* vpxor will be all 0s if s1 and s2 are equal. Otherwise it
> +          will have some 1s.  */
> +       vpxorq  -(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1
> +       /* Ternary logic to xor -(VEC_SIZE * 3)(%rdi) with YMM2 while
> +          oring with YMM1. Result is stored in YMM1.  */
> +       vpternlogd $0xde, -(VEC_SIZE * 3)(%rdi), %YMM1, %YMM2
> +
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
> +       vpxorq  -(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
> +       /* Or together YMM1, YMM2, and YMM3 into YMM3.  */
> +       VMOVU   -(VEC_SIZE)(%rsi, %rdx), %YMM4
> +       vpxorq  -(VEC_SIZE)(%rdi), %YMM4, %YMM4
> +
> +       /* Or together YMM2, YMM3, and YMM4 into YMM4.  */
> +       vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
>
> -#include "memcmp-evex-movbe.S"
> +       /* Compare YMM4 with 0. If any 1s s1 and s2 don't match.  */
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +L(return_neq0):
> +       ret
> +
> +       /* Fits in padding needed to .p2align 5 L(less_vec).  */
> +L(last_1x_vec):
> +       VMOVU   -(VEC_SIZE * 1)(%rsi, %rdx), %YMM1
> +       VPCMP   $4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1
> +       kmovd   %k1, %eax
> +       ret
> +
> +       /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32
> +          byte aligned.  */
> +       .p2align 5
> +L(less_vec):
> +       /* Check if one or less char. This is necessary for size = 0 but
> +          is also faster for size = 1.  */
> +       cmpl    $1, %edx
> +       jbe     L(one_or_less)
> +
> +       /* Check if loading one VEC from either s1 or s2 could cause a
> +          page cross. This can have false positives but is by far the
> +          fastest method.  */
> +       movl    %edi, %eax
> +       orl     %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> +       jg      L(page_cross_less_vec)
> +
> +       /* No page cross possible.  */
> +       VMOVU   (%rsi), %YMM2
> +       VPCMP   $4, (%rdi), %YMM2, %k1
> +       kmovd   %k1, %eax
> +       /* Result will be zero if s1 and s2 match. Otherwise first set
> +          bit will be first mismatch.  */
> +       bzhil   %edx, %eax, %eax
> +       ret
> +
> +       /* Relatively cold but placing close to L(less_vec) for 2 byte
> +          jump encoding.  */
> +       .p2align 4
> +L(one_or_less):
> +       jb      L(zero)
> +       movzbl  (%rsi), %ecx
> +       movzbl  (%rdi), %eax
> +       subl    %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +       /* Within the same 16 byte block is L(one_or_less).  */
> +L(zero):
> +       xorl    %eax, %eax
> +       ret
> +
> +       .p2align 4
> +L(last_2x_vec):
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %YMM1
> +       vpxorq  -(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1
> +       VMOVU   -(VEC_SIZE * 1)(%rsi, %rdx), %YMM2
> +       vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2
> +       VPTEST  %YMM2, %YMM2, %k1
> +       kmovd   %k1, %eax
> +       ret
> +
> +       .p2align 4
> +L(more_8x_vec):
> +       /* Set end of s1 in rdx.  */
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rdx), %rdx
> +       /* rsi stores s2 - s1. This allows loop to only update one
> +          pointer.  */
> +       subq    %rdi, %rsi
> +       /* Align s1 pointer.  */
> +       andq    $-VEC_SIZE, %rdi
> +       /* Adjust because first 4x vec where check already.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       .p2align 4
> +L(loop_4x_vec):
> +       VMOVU   (%rsi, %rdi), %YMM1
> +       vpxorq  (%rdi), %YMM1, %YMM1
> +
> +       VMOVU   VEC_SIZE(%rsi, %rdi), %YMM2
> +       vpternlogd $0xde, (VEC_SIZE)(%rdi), %YMM1, %YMM2
> +
> +       VMOVU   (VEC_SIZE * 2)(%rsi, %rdi), %YMM3
> +       vpxorq  (VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
> +
> +       VMOVU   (VEC_SIZE * 3)(%rsi, %rdi), %YMM4
> +       vpxorq  (VEC_SIZE * 3)(%rdi), %YMM4, %YMM4
> +
> +       vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq2)
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(loop_4x_vec)
> +
> +       subq    %rdx, %rdi
> +       VMOVU   (VEC_SIZE * 3)(%rsi, %rdx), %YMM4
> +       vpxorq  (VEC_SIZE * 3)(%rdx), %YMM4, %YMM4
> +       /* rdi has 4 * VEC_SIZE - remaining length.  */
> +       cmpl    $(VEC_SIZE * 3), %edi
> +       jae     L(8x_last_1x_vec)
> +       /* Load regardless of branch.  */
> +       VMOVU   (VEC_SIZE * 2)(%rsi, %rdx), %YMM3
> +       /* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while
> +          oring with YMM4. Result is stored in YMM4.  */
> +       vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4
> +       cmpl    $(VEC_SIZE * 2), %edi
> +       jae     L(8x_last_2x_vec)
> +
> +       VMOVU   VEC_SIZE(%rsi, %rdx), %YMM2
> +       vpxorq  VEC_SIZE(%rdx), %YMM2, %YMM2
> +
> +       VMOVU   (%rsi, %rdx), %YMM1
> +       vpxorq  (%rdx), %YMM1, %YMM1
> +
> +       vpternlogd $0xfe, %YMM1, %YMM2, %YMM4
> +L(8x_last_1x_vec):
> +L(8x_last_2x_vec):
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +L(return_neq2):
> +       ret
> +
> +       /* Relatively cold case as page cross are unexpected.  */
> +       .p2align 4
> +L(page_cross_less_vec):
> +       cmpl    $16, %edx
> +       jae     L(between_16_31)
> +       cmpl    $8, %edx
> +       ja      L(between_9_15)
> +       cmpl    $4, %edx
> +       jb      L(between_2_3)
> +       /* From 4 to 8 bytes.  No branch when size == 4.  */
> +       movl    (%rdi), %eax
> +       subl    (%rsi), %eax
> +       movl    -4(%rdi, %rdx), %ecx
> +       movl    -4(%rsi, %rdx), %edi
> +       subl    %edi, %ecx
> +       orl     %ecx, %eax
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_16_31):
> +       /* From 16 to 31 bytes.  No branch when size == 16.  */
> +
> +       /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
> +        */
> +       vmovdqu (%rsi), %xmm1
> +       vpcmpeqb (%rdi), %xmm1, %xmm1
> +       vmovdqu -16(%rsi, %rdx), %xmm2
> +       vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
> +       vpand   %xmm1, %xmm2, %xmm2
> +       vpmovmskb %xmm2, %eax
> +       notw    %ax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_9_15):
> +       /* From 9 to 15 bytes.  */
> +       movq    (%rdi), %rax
> +       subq    (%rsi), %rax
> +       movq    -8(%rdi, %rdx), %rcx
> +       movq    -8(%rsi, %rdx), %rdi
> +       subq    %rdi, %rcx
> +       orq     %rcx, %rax
> +       /* edx is guranteed to be a non-zero int.  */
> +       cmovnz  %edx, %eax
> +       ret
> +
> +       /* Don't align. This is cold and aligning here will cause code
> +          to spill into next cache line.  */
> +L(between_2_3):
> +       /* From 2 to 3 bytes.  No branch when size == 2.  */
> +       movzwl  (%rdi), %eax
> +       movzwl  (%rsi), %ecx
> +       subl    %ecx, %eax
> +       movzbl  -1(%rdi, %rdx), %ecx
> +       /* All machines that support evex will insert a "merging uop"
> +          avoiding any serious partial register stalls.  */
> +       subb    -1(%rsi, %rdx), %cl
> +       orl     %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +    /* 4 Bytes from next cache line. */
> +END (MEMCMPEQ)
> +#endif
> --
> 2.25.1
>


Roughly ~0-25% improvement over memcmp. Generally larger improvement
for the smaller size ranges which ultimately are the most important to
opimize for.

Numbers for new implementations attached.

Tests where run on the following CPUs:

Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html
Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

Some notes on the numbers.

There are some regressions in the sse2 version. I didn't optimize
these versions beyond defining out obviously irrelivant code for
bcmp. My intuition is that the slowdowns are alignment related. As
well I tested on hardware for which sse2 is outdate, so I am not sure
if these issues would translate to architectures that would actually
use sse2.

The avx2 and evex versions are basically universal improvements for
evex and avx2.

[-- Attachment #2: bcmp-tgl.pdf --]
[-- Type: application/pdf, Size: 223172 bytes --]

[-- Attachment #3: bcmp-skl.pdf --]
[-- Type: application/pdf, Size: 195097 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/6] String: Add __memcmpeq as build target
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
                   ` (4 preceding siblings ...)
  2021-10-27  2:43 ` [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S Noah Goldstein
@ 2021-10-27 12:42 ` H.J. Lu
  2021-10-27 18:46   ` Noah Goldstein
  2021-10-28 17:57 ` Joseph Myers
  6 siblings, 1 reply; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:42 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit just adds __memcmpeq as a build target so that
> implementations for __memcmpeq that are not just aliases to memcmp can
> be supported.
> ---
>  string/Makefile   |  2 +-
>  string/memcmpeq.c | 24 ++++++++++++++++++++++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>  create mode 100644 string/memcmpeq.c
>
> diff --git a/string/Makefile b/string/Makefile
> index 40d6fac133..2199dd30b7 100644
> --- a/string/Makefile
> +++ b/string/Makefile
> @@ -34,7 +34,7 @@ routines      := strcat strchr strcmp strcoll strcpy strcspn          \
>                    strerror _strerror strlen strnlen                    \
>                    strncat strncmp strncpy                              \
>                    strrchr strpbrk strsignal strspn strstr strtok       \
> -                  strtok_r strxfrm memchr memcmp memmove memset        \
> +                  strtok_r strxfrm memchr memcmp memcmpeq memmove memset       \
>                    mempcpy bcopy bzero ffs ffsll stpcpy stpncpy         \
>                    strcasecmp strncase strcasecmp_l strncase_l          \
>                    memccpy memcpy wordcopy strsep strcasestr            \
> diff --git a/string/memcmpeq.c b/string/memcmpeq.c
> new file mode 100644
> index 0000000000..08726325a8
> --- /dev/null
> +++ b/string/memcmpeq.c
> @@ -0,0 +1,24 @@
> +/* Copyright (C) 1991-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* This file is intentionally left empty.  It exists so that both
> +   architectures which implement __memcmpeq seperately from memcmp and
> +   architectures which implement __memcmpeq by having it alias memcmp will
> +   build.
> +
> +   The alias for __memcmpeq to memcmp for the C implementation is in
> +   memcmp.c.  */
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
@ 2021-10-27 12:45   ` H.J. Lu
  2021-10-27 16:08     ` Noah Goldstein
  2021-10-27 16:07   ` [PATCH v2 " Noah Goldstein
  1 sibling, 1 reply; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:45 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
> use the existing ones in memcmp. This will be useful for testing
> implementations of __memcmpeq that do not just alias memcmp.
> ---
>  benchtests/Makefile         |  2 +-
>  benchtests/bench-memcmp.c   |  4 +++-
>  benchtests/bench-memcmpeq.c | 20 ++++++++++++++++++++
>  3 files changed, 24 insertions(+), 2 deletions(-)
>  create mode 100644 benchtests/bench-memcmpeq.c
>
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index b690aaf65b..7be0e47c47 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -103,7 +103,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
>  endif
>
>  # String function benchmarks.
> -string-benchset := memccpy memchr memcmp memcpy memmem memmove \
> +string-benchset := memccpy memchr memcmp memcmpeq memcpy memmem memmove \
>                    mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
>                    strcat strchr strchrnul strcmp strcpy strcspn strlen \
>                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
> diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
> index 0d6a93bf29..546b06e1ab 100644
> --- a/benchtests/bench-memcmp.c
> +++ b/benchtests/bench-memcmp.c
> @@ -17,7 +17,9 @@
>     <https://www.gnu.org/licenses/>.  */
>
>  #define TEST_MAIN
> -#ifdef WIDE
> +#ifdef TEST_MEMCMPEQ
> +# define TEST_NAME "__memcmpeq"
> +#elif defined WIDE
>  # define TEST_NAME "wmemcmp"
>  #else
>  # define TEST_NAME "memcmp"

Please rename simple_memcmp to simple_memcmpeq.

> diff --git a/benchtests/bench-memcmpeq.c b/benchtests/bench-memcmpeq.c
> new file mode 100644
> index 0000000000..e918d4f77c
> --- /dev/null
> +++ b/benchtests/bench-memcmpeq.c
> @@ -0,0 +1,20 @@
> +/* Measure __memcmpeq functions.
> +   Copyright (C) 2015-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#define TEST_MEMCMPEQ 1
> +#include "bench-memcmp.c"
> --
> 2.25.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex
  2021-10-27  2:43 ` [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex Noah Goldstein
@ 2021-10-27 12:47   ` H.J. Lu
  0 siblings, 0 replies; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:47 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit adds support for __memcmpeq to be implemented
> seperately from memcmp. Support is added for versions optimized with
> sse2, avx2, and evex.
> ---
>  sysdeps/generic/ifunc-init.h                 |  5 +-
>  sysdeps/x86_64/memcmp.S                      |  9 ++--
>  sysdeps/x86_64/multiarch/Makefile            |  4 ++
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c   | 21 +++++++++
>  sysdeps/x86_64/multiarch/ifunc-memcmpeq.h    | 49 ++++++++++++++++++++
>  sysdeps/x86_64/multiarch/memcmp-sse2.S       |  4 +-
>  sysdeps/x86_64/multiarch/memcmp.c            |  3 --
>  sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S | 12 +++++
>  sysdeps/x86_64/multiarch/memcmpeq-avx2.S     | 23 +++++++++
>  sysdeps/x86_64/multiarch/memcmpeq-evex.S     | 23 +++++++++
>  sysdeps/x86_64/multiarch/memcmpeq-sse2.S     | 23 +++++++++
>  sysdeps/x86_64/multiarch/memcmpeq.c          | 35 ++++++++++++++
>  12 files changed, 202 insertions(+), 9 deletions(-)
>  create mode 100644 sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
>  create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
>  create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-avx2.S
>  create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-evex.S
>  create mode 100644 sysdeps/x86_64/multiarch/memcmpeq-sse2.S
>  create mode 100644 sysdeps/x86_64/multiarch/memcmpeq.c
>
> diff --git a/sysdeps/generic/ifunc-init.h b/sysdeps/generic/ifunc-init.h
> index 7f69485de8..ee8a8289c8 100644
> --- a/sysdeps/generic/ifunc-init.h
> +++ b/sysdeps/generic/ifunc-init.h
> @@ -50,5 +50,8 @@
>     '__<symbol>_<variant>' as the optimized implementation and
>     '<symbol>_ifunc_selector' as the IFUNC selector.  */
>  #define REDIRECT_NAME  EVALUATOR1 (__redirect, SYMBOL_NAME)
> -#define OPTIMIZE(name) EVALUATOR2 (SYMBOL_NAME, name)
>  #define IFUNC_SELECTOR EVALUATOR1 (SYMBOL_NAME, ifunc_selector)
> +#define OPTIMIZE1(name)        EVALUATOR1 (SYMBOL_NAME, name)
> +#define OPTIMIZE2(name)        EVALUATOR2 (SYMBOL_NAME, name)
> +/* Default is to use OPTIMIZE2.  */
> +#define OPTIMIZE(name) OPTIMIZE2(name)
> diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
> index 8a03e572e8..b53f2c0866 100644
> --- a/sysdeps/x86_64/memcmp.S
> +++ b/sysdeps/x86_64/memcmp.S
> @@ -356,9 +356,10 @@ L(ATR32res):
>         .p2align 4,, 4
>  END(memcmp)
>
> -#undef bcmp
> +#ifdef USE_AS_MEMCMPEQ
> +libc_hidden_def (memcmp)
> +#else
> +# undef bcmp
>  weak_alias (memcmp, bcmp)
> -#undef __memcmpeq
> -strong_alias (memcmp, __memcmpeq)
>  libc_hidden_builtin_def (memcmp)
> -libc_hidden_def (__memcmpeq)
> +#endif
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index 26be40959c..044778585b 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -7,7 +7,9 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
>                    memchr-sse2 rawmemchr-sse2 memchr-avx2 rawmemchr-avx2 \
>                    memrchr-sse2 memrchr-avx2 \
>                    memcmp-sse2 \
> +                  memcmpeq-sse2 \
>                    memcmp-avx2-movbe \
> +                  memcmpeq-avx2 \
>                    memcmp-sse4 memcpy-ssse3 \
>                    memmove-ssse3 \
>                    memcpy-ssse3-back \
> @@ -42,6 +44,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
>                    memset-avx512-unaligned-erms \
>                    memchr-avx2-rtm \
>                    memcmp-avx2-movbe-rtm \
> +                  memcmpeq-avx2-rtm \
>                    memmove-avx-unaligned-erms-rtm \
>                    memrchr-avx2-rtm \
>                    memset-avx2-unaligned-erms-rtm \
> @@ -61,6 +64,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
>                    strrchr-avx2-rtm \
>                    memchr-evex \
>                    memcmp-evex-movbe \
> +                  memcmpeq-evex \
>                    memmove-evex-unaligned-erms \
>                    memrchr-evex \
>                    memset-evex-unaligned-erms \
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index 39ab10613b..f7f3806d1d 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -38,6 +38,27 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>
>    size_t i = 0;
>
> +  /* Support sysdeps/x86_64/multiarch/memcmpeq.c.  */
> +  IFUNC_IMPL (i, name, __memcmpeq,
> +             IFUNC_IMPL_ADD (array, i, __memcmpeq,
> +                             (CPU_FEATURE_USABLE (AVX2)
> +                   && CPU_FEATURE_USABLE (MOVBE)
> +                              && CPU_FEATURE_USABLE (BMI2)),
> +                             __memcmpeq_avx2)
> +             IFUNC_IMPL_ADD (array, i, __memcmpeq,
> +                             (CPU_FEATURE_USABLE (AVX2)
> +                              && CPU_FEATURE_USABLE (BMI2)
> +                   && CPU_FEATURE_USABLE (MOVBE)
> +                              && CPU_FEATURE_USABLE (RTM)),
> +                             __memcmpeq_avx2_rtm)
> +             IFUNC_IMPL_ADD (array, i, __memcmpeq,
> +                             (CPU_FEATURE_USABLE (AVX512VL)
> +                              && CPU_FEATURE_USABLE (AVX512BW)
> +                   && CPU_FEATURE_USABLE (MOVBE)
> +                              && CPU_FEATURE_USABLE (BMI2)),
> +                             __memcmpeq_evex)
> +             IFUNC_IMPL_ADD (array, i, __memcmpeq, 1, __memcmpeq_sse2))
> +
>    /* Support sysdeps/x86_64/multiarch/memchr.c.  */
>    IFUNC_IMPL (i, name, memchr,
>               IFUNC_IMPL_ADD (array, i, memchr,
> diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> new file mode 100644
> index 0000000000..3319a9568a
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> @@ -0,0 +1,49 @@
> +/* Common definition for __memcmpeq ifunc selections.
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2017-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +# include <init-arch.h>
> +
> +extern __typeof (REDIRECT_NAME) OPTIMIZE1 (sse2) attribute_hidden;
> +extern __typeof (REDIRECT_NAME) OPTIMIZE1 (avx2) attribute_hidden;
> +extern __typeof (REDIRECT_NAME) OPTIMIZE1 (avx2_rtm) attribute_hidden;
> +extern __typeof (REDIRECT_NAME) OPTIMIZE1 (evex) attribute_hidden;
> +
> +static inline void *
> +IFUNC_SELECTOR (void)
> +{
> +  const struct cpu_features* cpu_features = __get_cpu_features ();
> +
> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
> +      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
> +      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
> +      && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
> +    {
> +      if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> +         && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
> +       return OPTIMIZE1 (evex);
> +
> +      if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
> +       return OPTIMIZE1 (avx2_rtm);
> +
> +      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
> +       return OPTIMIZE1 (avx2);
> +    }
> +
> +  return OPTIMIZE1 (sse2);
> +}
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index 7b30b7ca2e..132d6fb339 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -17,7 +17,9 @@
>     <https://www.gnu.org/licenses/>.  */
>
>  #if IS_IN (libc)
> -# define memcmp __memcmp_sse2
> +# ifndef memcmp
> +#  define memcmp __memcmp_sse2
> +# endif
>
>  # ifdef SHARED
>  #  undef libc_hidden_builtin_def
> diff --git a/sysdeps/x86_64/multiarch/memcmp.c b/sysdeps/x86_64/multiarch/memcmp.c
> index 7b3409b1dd..fe725f3563 100644
> --- a/sysdeps/x86_64/multiarch/memcmp.c
> +++ b/sysdeps/x86_64/multiarch/memcmp.c
> @@ -29,9 +29,6 @@
>  libc_ifunc_redirected (__redirect_memcmp, memcmp, IFUNC_SELECTOR ());
>  # undef bcmp
>  weak_alias (memcmp, bcmp)
> -# undef __memcmpeq
> -strong_alias (memcmp, __memcmpeq)
> -libc_hidden_def (__memcmpeq)
>
>  # ifdef SHARED
>  __hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp)
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
> new file mode 100644
> index 0000000000..24b6a0c9ff
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
> @@ -0,0 +1,12 @@
> +#ifndef MEMCMP
> +# define MEMCMP __memcmpeq_avx2_rtm
> +#endif
> +
> +#define ZERO_UPPER_VEC_REGISTERS_RETURN \
> +  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
> +
> +#define VZEROUPPER_RETURN jmp   L(return_vzeroupper)
> +
> +#define SECTION(p) p##.avx.rtm
> +
> +#include "memcmpeq-avx2.S"
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
> new file mode 100644
> index 0000000000..0181ea0d8d
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
> @@ -0,0 +1,23 @@
> +/* __memcmpeq optimized with AVX2.
> +   Copyright (C) 2017-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef MEMCMP
> +# define MEMCMP        __memcmpeq_avx2
> +#endif
> +
> +#include "memcmp-avx2-movbe.S"
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-evex.S b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> new file mode 100644
> index 0000000000..951e1e9560
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> @@ -0,0 +1,23 @@
> +/* __memcmpeq optimized with EVEX.
> +   Copyright (C) 2017-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef MEMCMP
> +# define MEMCMP        __memcmpeq_evex
> +#endif
> +
> +#include "memcmp-evex-movbe.S"
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-sse2.S b/sysdeps/x86_64/multiarch/memcmpeq-sse2.S
> new file mode 100644
> index 0000000000..c488cbbcd9
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-sse2.S
> @@ -0,0 +1,23 @@
> +/* __memcmpeq optimized with SSE2.
> +   Copyright (C) 2017-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#ifndef memcmp
> +# define memcmp        __memcmpeq_sse2
> +#endif
> +#define USE_AS_MEMCMPEQ        1
> +#include "memcmp-sse2.S"
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq.c b/sysdeps/x86_64/multiarch/memcmpeq.c
> new file mode 100644
> index 0000000000..163e56047e
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/memcmpeq.c
> @@ -0,0 +1,35 @@
> +/* Multiple versions of __memcmpeq.
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2017-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +/* Define multiple versions only for the definition in libc.  */
> +#if IS_IN (libc)
> +# define __memcmpeq __redirect___memcmpeq
> +# include <string.h>
> +# undef __memcmpeq
> +
> +# define SYMBOL_NAME __memcmpeq
> +# include "ifunc-memcmpeq.h"
> +
> +libc_ifunc_redirected (__redirect___memcmpeq, __memcmpeq, IFUNC_SELECTOR ());
> +
> +# ifdef SHARED
> +__hidden_ver1 (__memcmpeq, __GI___memcmpeq, __redirect___memcmpeq)
> +    __attribute__ ((visibility ("hidden"))) __attribute_copy__ (__memcmpeq);
> +# endif
> +#endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S
  2021-10-27  2:43 ` [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S Noah Goldstein
@ 2021-10-27 12:48   ` H.J. Lu
  0 siblings, 0 replies; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit does not modify any of the memcmp
> implementation. It just adds __memcmpeq ifdefs to skip obvious cases
> where computing the proper 1/-1 required by memcmp is not needed.
> ---
>  sysdeps/x86_64/memcmp.S | 55 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 51 insertions(+), 4 deletions(-)
>
> diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S
> index b53f2c0866..c245383963 100644
> --- a/sysdeps/x86_64/memcmp.S
> +++ b/sysdeps/x86_64/memcmp.S
> @@ -49,34 +49,63 @@ L(s2b):
>         movzwl  (%rdi), %eax
>         movzwl  (%rdi, %rsi), %edx
>         subq    $2, %r10
> +#ifdef USE_AS_MEMCMPEQ
> +       je      L(finz1)
> +#else
>         je      L(fin2_7)
> +#endif
>         addq    $2, %rdi
>         cmpl    %edx, %eax
> +#ifdef USE_AS_MEMCMPEQ
> +       jnz     L(neq_early)
> +#else
>         jnz     L(fin2_7)
> +#endif
>  L(s4b):
>         testq   $4, %r10
>         jz      L(s8b)
>         movl    (%rdi), %eax
>         movl    (%rdi, %rsi), %edx
>         subq    $4, %r10
> +#ifdef USE_AS_MEMCMPEQ
> +       je      L(finz1)
> +#else
>         je      L(fin2_7)
> +#endif
>         addq    $4, %rdi
>         cmpl    %edx, %eax
> +#ifdef USE_AS_MEMCMPEQ
> +       jnz     L(neq_early)
> +#else
>         jnz     L(fin2_7)
> +#endif
>  L(s8b):
>         testq   $8, %r10
>         jz      L(s16b)
>         movq    (%rdi), %rax
>         movq    (%rdi, %rsi), %rdx
>         subq    $8, %r10
> +#ifdef USE_AS_MEMCMPEQ
> +       je      L(sub_return8)
> +#else
>         je      L(fin2_7)
> +#endif
>         addq    $8, %rdi
>         cmpq    %rdx, %rax
> +#ifdef USE_AS_MEMCMPEQ
> +       jnz     L(neq_early)
> +#else
>         jnz     L(fin2_7)
> +#endif
>  L(s16b):
>         movdqu    (%rdi), %xmm1
>         movdqu    (%rdi, %rsi), %xmm0
>         pcmpeqb   %xmm0, %xmm1
> +#ifdef USE_AS_MEMCMPEQ
> +       pmovmskb  %xmm1, %eax
> +       subl      $0xffff, %eax
> +       ret
> +#else
>         pmovmskb  %xmm1, %edx
>         xorl      %eax, %eax
>         subl      $0xffff, %edx
> @@ -86,7 +115,7 @@ L(s16b):
>         movzbl   (%rcx), %eax
>         movzbl   (%rsi, %rcx), %edx
>         jmp      L(finz1)
> -
> +#endif
>         .p2align 4,, 4
>  L(finr1b):
>         movzbl  (%rdi), %eax
> @@ -95,7 +124,15 @@ L(finz1):
>         subl    %edx, %eax
>  L(exit):
>         ret
> -
> +#ifdef USE_AS_MEMCMPEQ
> +       .p2align 4,, 4
> +L(sub_return8):
> +       subq    %rdx, %rax
> +       movl    %eax, %edx
> +       shrq    $32, %rax
> +       orl     %edx, %eax
> +       ret
> +#else
>         .p2align 4,, 4
>  L(fin2_7):
>         cmpq    %rdx, %rax
> @@ -111,12 +148,17 @@ L(fin2_7):
>         movzbl  %dl, %edx
>         subl    %edx, %eax
>         ret
> -
> +#endif
>         .p2align 4,, 4
>  L(finz):
>         xorl    %eax, %eax
>         ret
> -
> +#ifdef USE_AS_MEMCMPEQ
> +       .p2align 4,, 4
> +L(neq_early):
> +       movl    $1, %eax
> +       ret
> +#endif
>         /* For blocks bigger than 32 bytes
>            1. Advance one of the addr pointer to be 16B aligned.
>            2. Treat the case of both addr pointers aligned to 16B
> @@ -246,11 +288,16 @@ L(mt16):
>
>         .p2align 4,, 4
>  L(neq):
> +#ifdef USE_AS_MEMCMPEQ
> +       movl    $1, %eax
> +    ret
> +#else
>         bsfl      %edx, %ecx
>         movzbl   (%rdi, %rcx), %eax
>         addq     %rdi, %rsi
>         movzbl   (%rsi,%rcx), %edx
>         jmp      L(finz1)
> +#endif
>
>         .p2align 4,, 4
>  L(ATR):
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S
  2021-10-27  2:43 ` [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S Noah Goldstein
@ 2021-10-27 12:48   ` H.J. Lu
  0 siblings, 0 replies; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit adds new optimized __memcmpeq implementation for
> avx2.
>
> The primary optimizations are:
>
> 1) skipping the logic to find the difference of the first mismatched
> byte.
>
> 2) not updating src/dst addresses as the non-equals logic does not
> need to be reused by different areas.
> ---
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c   |   2 -
>  sysdeps/x86_64/multiarch/ifunc-memcmpeq.h    |   2 +-
>  sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S |   4 +-
>  sysdeps/x86_64/multiarch/memcmpeq-avx2.S     | 309 ++++++++++++++++++-
>  4 files changed, 308 insertions(+), 9 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index f7f3806d1d..535450f52c 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -42,13 +42,11 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>    IFUNC_IMPL (i, name, __memcmpeq,
>               IFUNC_IMPL_ADD (array, i, __memcmpeq,
>                               (CPU_FEATURE_USABLE (AVX2)
> -                   && CPU_FEATURE_USABLE (MOVBE)
>                                && CPU_FEATURE_USABLE (BMI2)),
>                               __memcmpeq_avx2)
>               IFUNC_IMPL_ADD (array, i, __memcmpeq,
>                               (CPU_FEATURE_USABLE (AVX2)
>                                && CPU_FEATURE_USABLE (BMI2)
> -                   && CPU_FEATURE_USABLE (MOVBE)
>                                && CPU_FEATURE_USABLE (RTM)),
>                               __memcmpeq_avx2_rtm)
>               IFUNC_IMPL_ADD (array, i, __memcmpeq,
> diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> index 3319a9568a..e596c5048b 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> +++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> @@ -31,10 +31,10 @@ IFUNC_SELECTOR (void)
>
>    if (CPU_FEATURE_USABLE_P (cpu_features, AVX2)
>        && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
> -      && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
>        && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
>      {
>        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> +         && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
>           && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
>         return OPTIMIZE1 (evex);
>
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
> index 24b6a0c9ff..3264a4a76c 100644
> --- a/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2-rtm.S
> @@ -1,5 +1,5 @@
> -#ifndef MEMCMP
> -# define MEMCMP __memcmpeq_avx2_rtm
> +#ifndef MEMCMPEQ
> +# define MEMCMPEQ __memcmpeq_avx2_rtm
>  #endif
>
>  #define ZERO_UPPER_VEC_REGISTERS_RETURN \
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
> index 0181ea0d8d..0bf59fb8fa 100644
> --- a/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-avx2.S
> @@ -16,8 +16,309 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#ifndef MEMCMP
> -# define MEMCMP        __memcmpeq_avx2
> -#endif
> +#if IS_IN (libc)
> +
> +/* __memcmpeq is implemented as:
> +   1. Use ymm vector compares when possible. The only case where
> +      vector compares is not possible for when size < VEC_SIZE
> +      and loading from either s1 or s2 would cause a page cross.
> +   2. Use xmm vector compare when size >= 8 bytes.
> +   3. Optimistically compare up to first 4 * VEC_SIZE one at a
> +      to check for early mismatches. Only do this if its guranteed the
> +      work is not wasted.
> +   4. If size is 8 * VEC_SIZE or less, unroll the loop.
> +   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
> +      area.
> +   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
> +   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
> +   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
> +
> +# include <sysdep.h>
> +
> +# ifndef MEMCMPEQ
> +#  define MEMCMPEQ     __memcmpeq_avx2
> +# endif
> +
> +# define VPCMPEQ       vpcmpeqb
> +
> +# ifndef VZEROUPPER
> +#  define VZEROUPPER   vzeroupper
> +# endif
> +
> +# ifndef SECTION
> +#  define SECTION(p)   p##.avx
> +# endif
> +
> +# define VEC_SIZE 32
> +# define PAGE_SIZE     4096
> +
> +       .section SECTION(.text), "ax", @progbits
> +ENTRY_P2ALIGN (MEMCMPEQ, 6)
> +# ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %edx
> +# endif
> +       cmp     $VEC_SIZE, %RDX_LP
> +       jb      L(less_vec)
> +
> +       /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
> +       vmovdqu (%rsi), %ymm1
> +       VPCMPEQ (%rdi), %ymm1, %ymm1
> +       vpmovmskb %ymm1, %eax
> +       incl    %eax
> +       jnz     L(return_neq0)
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(last_1x_vec)
> +
> +       /* Check second VEC no matter what.  */
> +       vmovdqu VEC_SIZE(%rsi), %ymm2
> +       VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
> +       vpmovmskb %ymm2, %eax
> +       /* If all 4 VEC where equal eax will be all 1s so incl will overflow
> +          and set zero flag.  */
> +       incl    %eax
> +       jnz     L(return_neq0)
> +
> +       /* Less than 4 * VEC.  */
> +       cmpq    $(VEC_SIZE * 4), %rdx
> +       jbe     L(last_2x_vec)
> +
> +       /* Check third and fourth VEC no matter what.  */
> +       vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3
> +       VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
> +       vpmovmskb %ymm3, %eax
> +       incl    %eax
> +       jnz     L(return_neq0)
> +
> +       vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4
> +       VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
> +       vpmovmskb %ymm4, %eax
> +       incl    %eax
> +       jnz     L(return_neq0)
> +
> +       /* Go to 4x VEC loop.  */
> +       cmpq    $(VEC_SIZE * 8), %rdx
> +       ja      L(more_8x_vec)
> +
> +       /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
> +          branches.  */
> +
> +       /* Adjust rsi and rdi to avoid indexed address mode. This end up
> +          saving a 16 bytes of code, prevents unlamination, and bottlenecks in
> +          the AGU.  */
> +       addq    %rdx, %rsi
> +       vmovdqu -(VEC_SIZE * 4)(%rsi), %ymm1
> +       vmovdqu -(VEC_SIZE * 3)(%rsi), %ymm2
> +       addq    %rdx, %rdi
> +
> +       VPCMPEQ -(VEC_SIZE * 4)(%rdi), %ymm1, %ymm1
> +       VPCMPEQ -(VEC_SIZE * 3)(%rdi), %ymm2, %ymm2
> +
> +       vmovdqu -(VEC_SIZE * 2)(%rsi), %ymm3
> +       VPCMPEQ -(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
> +       vmovdqu -VEC_SIZE(%rsi), %ymm4
> +       VPCMPEQ -VEC_SIZE(%rdi), %ymm4, %ymm4
> +
> +       /* Reduce VEC0 - VEC4.  */
> +       vpand   %ymm1, %ymm2, %ymm2
> +       vpand   %ymm3, %ymm4, %ymm4
> +       vpand   %ymm2, %ymm4, %ymm4
> +       vpmovmskb %ymm4, %eax
> +       incl    %eax
> +L(return_neq0):
> +L(return_vzeroupper):
> +       ZERO_UPPER_VEC_REGISTERS_RETURN
>
> -#include "memcmp-avx2-movbe.S"
> +       /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte
> +          aligned.  */
> +       .p2align 5
> +L(less_vec):
> +       /* Check if one or less char. This is necessary for size = 0 but is
> +          also faster for size = 1.  */
> +       cmpl    $1, %edx
> +       jbe     L(one_or_less)
> +
> +       /* Check if loading one VEC from either s1 or s2 could cause a page
> +          cross. This can have false positives but is by far the fastest
> +          method.  */
> +       movl    %edi, %eax
> +       orl     %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> +       jg      L(page_cross_less_vec)
> +
> +       /* No page cross possible.  */
> +       vmovdqu (%rsi), %ymm2
> +       VPCMPEQ (%rdi), %ymm2, %ymm2
> +       vpmovmskb %ymm2, %eax
> +       incl    %eax
> +       /* Result will be zero if s1 and s2 match. Otherwise first set bit
> +          will be first mismatch.  */
> +       bzhil   %edx, %eax, %eax
> +       VZEROUPPER_RETURN
> +
> +       /* Relatively cold but placing close to L(less_vec) for 2 byte jump
> +          encoding.  */
> +       .p2align 4
> +L(one_or_less):
> +       jb      L(zero)
> +       movzbl  (%rsi), %ecx
> +       movzbl  (%rdi), %eax
> +       subl    %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +       /* Within the same 16 byte block is L(one_or_less).  */
> +L(zero):
> +       xorl    %eax, %eax
> +       ret
> +
> +       .p2align 4
> +L(last_1x_vec):
> +       vmovdqu -(VEC_SIZE * 1)(%rsi, %rdx), %ymm1
> +       VPCMPEQ -(VEC_SIZE * 1)(%rdi, %rdx), %ymm1, %ymm1
> +       vpmovmskb %ymm1, %eax
> +       incl    %eax
> +       VZEROUPPER_RETURN
> +
> +       .p2align 4
> +L(last_2x_vec):
> +       vmovdqu -(VEC_SIZE * 2)(%rsi, %rdx), %ymm1
> +       VPCMPEQ -(VEC_SIZE * 2)(%rdi, %rdx), %ymm1, %ymm1
> +       vmovdqu -(VEC_SIZE * 1)(%rsi, %rdx), %ymm2
> +       VPCMPEQ -(VEC_SIZE * 1)(%rdi, %rdx), %ymm2, %ymm2
> +       vpand   %ymm1, %ymm2, %ymm2
> +       vpmovmskb %ymm2, %eax
> +       incl    %eax
> +       VZEROUPPER_RETURN
> +
> +       .p2align 4
> +L(more_8x_vec):
> +       /* Set end of s1 in rdx.  */
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rdx), %rdx
> +       /* rsi stores s2 - s1. This allows loop to only update one pointer.
> +        */
> +       subq    %rdi, %rsi
> +       /* Align s1 pointer.  */
> +       andq    $-VEC_SIZE, %rdi
> +       /* Adjust because first 4x vec where check already.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       .p2align 4
> +L(loop_4x_vec):
> +       /* rsi has s2 - s1 so get correct address by adding s1 (in rdi).  */
> +       vmovdqu (%rsi, %rdi), %ymm1
> +       VPCMPEQ (%rdi), %ymm1, %ymm1
> +
> +       vmovdqu VEC_SIZE(%rsi, %rdi), %ymm2
> +       VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2
> +
> +       vmovdqu (VEC_SIZE * 2)(%rsi, %rdi), %ymm3
> +       VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3
> +
> +       vmovdqu (VEC_SIZE * 3)(%rsi, %rdi), %ymm4
> +       VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4
> +
> +       vpand   %ymm1, %ymm2, %ymm2
> +       vpand   %ymm3, %ymm4, %ymm4
> +       vpand   %ymm2, %ymm4, %ymm4
> +       vpmovmskb %ymm4, %eax
> +       incl    %eax
> +       jnz     L(return_neq1)
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       /* Check if s1 pointer at end.  */
> +       cmpq    %rdx, %rdi
> +       jb      L(loop_4x_vec)
> +
> +       vmovdqu (VEC_SIZE * 3)(%rsi, %rdx), %ymm4
> +       VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm4, %ymm4
> +       subq    %rdx, %rdi
> +       /* rdi has 4 * VEC_SIZE - remaining length.  */
> +       cmpl    $(VEC_SIZE * 3), %edi
> +       jae     L(8x_last_1x_vec)
> +       /* Load regardless of branch.  */
> +       vmovdqu (VEC_SIZE * 2)(%rsi, %rdx), %ymm3
> +       VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm3, %ymm3
> +       cmpl    $(VEC_SIZE * 2), %edi
> +       jae     L(8x_last_2x_vec)
> +       /* Check last 4 VEC.  */
> +       vmovdqu VEC_SIZE(%rsi, %rdx), %ymm1
> +       VPCMPEQ VEC_SIZE(%rdx), %ymm1, %ymm1
> +
> +       vmovdqu (%rsi, %rdx), %ymm2
> +       VPCMPEQ (%rdx), %ymm2, %ymm2
> +
> +       vpand   %ymm3, %ymm4, %ymm4
> +       vpand   %ymm1, %ymm2, %ymm3
> +L(8x_last_2x_vec):
> +       vpand   %ymm3, %ymm4, %ymm4
> +L(8x_last_1x_vec):
> +       vpmovmskb %ymm4, %eax
> +       /* Restore s1 pointer to rdi.  */
> +       incl    %eax
> +L(return_neq1):
> +       VZEROUPPER_RETURN
> +
> +       /* Relatively cold case as page cross are unexpected.  */
> +       .p2align 4
> +L(page_cross_less_vec):
> +       cmpl    $16, %edx
> +       jae     L(between_16_31)
> +       cmpl    $8, %edx
> +       ja      L(between_9_15)
> +       cmpl    $4, %edx
> +       jb      L(between_2_3)
> +       /* From 4 to 8 bytes.  No branch when size == 4.  */
> +       movl    (%rdi), %eax
> +       subl    (%rsi), %eax
> +       movl    -4(%rdi, %rdx), %ecx
> +       movl    -4(%rsi, %rdx), %edi
> +       subl    %edi, %ecx
> +       orl     %ecx, %eax
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_16_31):
> +       /* From 16 to 31 bytes.  No branch when size == 16.  */
> +
> +       /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
> +        */
> +       vmovdqu (%rsi), %xmm1
> +       vpcmpeqb (%rdi), %xmm1, %xmm1
> +       vmovdqu -16(%rsi, %rdx), %xmm2
> +       vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
> +       vpand   %xmm1, %xmm2, %xmm2
> +       vpmovmskb %xmm2, %eax
> +       notw    %ax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_9_15):
> +       /* From 9 to 15 bytes.  */
> +       movq    (%rdi), %rax
> +       subq    (%rsi), %rax
> +       movq    -8(%rdi, %rdx), %rcx
> +       movq    -8(%rsi, %rdx), %rdi
> +       subq    %rdi, %rcx
> +       orq     %rcx, %rax
> +       /* edx is guranteed to be a non-zero int.  */
> +       cmovnz  %edx, %eax
> +       ret
> +
> +       /* Don't align. This is cold and aligning here will cause code
> +          to spill into next cache line.  */
> +L(between_2_3):
> +       /* From 2 to 3 bytes.  No branch when size == 2.  */
> +       movzwl  (%rdi), %eax
> +       movzwl  (%rsi), %ecx
> +       subl    %ecx, %eax
> +       movzbl  -1(%rdi, %rdx), %ecx
> +       /* All machines that support evex will insert a "merging uop"
> +          avoiding any serious partial register stalls.  */
> +       subb    -1(%rsi, %rdx), %cl
> +       orl     %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +    /* 2 Bytes from next cache line. */
> +END (MEMCMPEQ)
> +#endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S
  2021-10-27  2:43 ` [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S Noah Goldstein
  2021-10-27  2:44   ` Noah Goldstein
@ 2021-10-27 12:49   ` H.J. Lu
  1 sibling, 0 replies; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 12:49 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> No bug. This commit adds new optimized __memcmpeq implementation for
> evex.
>
> The primary optimizations are:
>
> 1) skipping the logic to find the difference of the first mismatched
> byte.
>
> 2) not updating src/dst addresses as the non-equals logic does not
> need to be reused by different areas.
> ---
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c |   1 -
>  sysdeps/x86_64/multiarch/ifunc-memcmpeq.h  |   1 -
>  sysdeps/x86_64/multiarch/memcmpeq-evex.S   | 308 ++++++++++++++++++++-
>  3 files changed, 304 insertions(+), 6 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index 535450f52c..ea8df9f9b9 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, __memcmpeq,
>                               (CPU_FEATURE_USABLE (AVX512VL)
>                                && CPU_FEATURE_USABLE (AVX512BW)
> -                   && CPU_FEATURE_USABLE (MOVBE)
>                                && CPU_FEATURE_USABLE (BMI2)),
>                               __memcmpeq_evex)
>               IFUNC_IMPL_ADD (array, i, __memcmpeq, 1, __memcmpeq_sse2))
> diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> index e596c5048b..2ea38adf05 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> +++ b/sysdeps/x86_64/multiarch/ifunc-memcmpeq.h
> @@ -34,7 +34,6 @@ IFUNC_SELECTOR (void)
>        && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
>      {
>        if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)
> -         && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)
>           && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW))
>         return OPTIMIZE1 (evex);
>
> diff --git a/sysdeps/x86_64/multiarch/memcmpeq-evex.S b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> index 951e1e9560..f27e732036 100644
> --- a/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> +++ b/sysdeps/x86_64/multiarch/memcmpeq-evex.S
> @@ -16,8 +16,308 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>
> -#ifndef MEMCMP
> -# define MEMCMP        __memcmpeq_evex
> -#endif
> +#if IS_IN (libc)
> +
> +/* __memcmpeq is implemented as:
> +   1. Use ymm vector compares when possible. The only case where
> +      vector compares is not possible for when size < VEC_SIZE
> +      and loading from either s1 or s2 would cause a page cross.
> +   2. Use xmm vector compare when size >= 8 bytes.
> +   3. Optimistically compare up to first 4 * VEC_SIZE one at a
> +      to check for early mismatches. Only do this if its guranteed the
> +      work is not wasted.
> +   4. If size is 8 * VEC_SIZE or less, unroll the loop.
> +   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
> +      area.
> +   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
> +   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
> +   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
> +
> +# include <sysdep.h>
> +
> +# ifndef MEMCMPEQ
> +#  define MEMCMPEQ     __memcmpeq_evex
> +# endif
> +
> +# define VMOVU vmovdqu64
> +# define VPCMP vpcmpub
> +# define VPTEST        vptestmb
> +
> +# define VEC_SIZE      32
> +# define PAGE_SIZE     4096
> +
> +# define YMM0          ymm16
> +# define YMM1          ymm17
> +# define YMM2          ymm18
> +# define YMM3          ymm19
> +# define YMM4          ymm20
> +# define YMM5          ymm21
> +# define YMM6          ymm22
> +
> +
> +       .section .text.evex, "ax", @progbits
> +ENTRY_P2ALIGN (MEMCMPEQ, 6)
> +# ifdef __ILP32__
> +       /* Clear the upper 32 bits.  */
> +       movl    %edx, %edx
> +# endif
> +       cmp     $VEC_SIZE, %RDX_LP
> +       jb      L(less_vec)
> +
> +       /* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
> +       VMOVU   (%rsi), %YMM1
> +       /* Use compare not equals to directly check for mismatch.  */
> +       VPCMP   $4, (%rdi), %YMM1, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       cmpq    $(VEC_SIZE * 2), %rdx
> +       jbe     L(last_1x_vec)
> +
> +       /* Check second VEC no matter what.  */
> +       VMOVU   VEC_SIZE(%rsi), %YMM2
> +       VPCMP   $4, VEC_SIZE(%rdi), %YMM2, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       /* Less than 4 * VEC.  */
> +       cmpq    $(VEC_SIZE * 4), %rdx
> +       jbe     L(last_2x_vec)
> +
> +       /* Check third and fourth VEC no matter what.  */
> +       VMOVU   (VEC_SIZE * 2)(%rsi), %YMM3
> +       VPCMP   $4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       VMOVU   (VEC_SIZE * 3)(%rsi), %YMM4
> +       VPCMP   $4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq0)
> +
> +       /* Go to 4x VEC loop.  */
> +       cmpq    $(VEC_SIZE * 8), %rdx
> +       ja      L(more_8x_vec)
> +
> +       /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any
> +          branches.  */
> +
> +       VMOVU   -(VEC_SIZE * 4)(%rsi, %rdx), %YMM1
> +       VMOVU   -(VEC_SIZE * 3)(%rsi, %rdx), %YMM2
> +       addq    %rdx, %rdi
> +
> +       /* Wait to load from s1 until addressed adjust due to
> +          unlamination.  */
> +
> +       /* vpxor will be all 0s if s1 and s2 are equal. Otherwise it
> +          will have some 1s.  */
> +       vpxorq  -(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1
> +       /* Ternary logic to xor -(VEC_SIZE * 3)(%rdi) with YMM2 while
> +          oring with YMM1. Result is stored in YMM1.  */
> +       vpternlogd $0xde, -(VEC_SIZE * 3)(%rdi), %YMM1, %YMM2
> +
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %YMM3
> +       vpxorq  -(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
> +       /* Or together YMM1, YMM2, and YMM3 into YMM3.  */
> +       VMOVU   -(VEC_SIZE)(%rsi, %rdx), %YMM4
> +       vpxorq  -(VEC_SIZE)(%rdi), %YMM4, %YMM4
> +
> +       /* Or together YMM2, YMM3, and YMM4 into YMM4.  */
> +       vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
>
> -#include "memcmp-evex-movbe.S"
> +       /* Compare YMM4 with 0. If any 1s s1 and s2 don't match.  */
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +L(return_neq0):
> +       ret
> +
> +       /* Fits in padding needed to .p2align 5 L(less_vec).  */
> +L(last_1x_vec):
> +       VMOVU   -(VEC_SIZE * 1)(%rsi, %rdx), %YMM1
> +       VPCMP   $4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1
> +       kmovd   %k1, %eax
> +       ret
> +
> +       /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32
> +          byte aligned.  */
> +       .p2align 5
> +L(less_vec):
> +       /* Check if one or less char. This is necessary for size = 0 but
> +          is also faster for size = 1.  */
> +       cmpl    $1, %edx
> +       jbe     L(one_or_less)
> +
> +       /* Check if loading one VEC from either s1 or s2 could cause a
> +          page cross. This can have false positives but is by far the
> +          fastest method.  */
> +       movl    %edi, %eax
> +       orl     %esi, %eax
> +       andl    $(PAGE_SIZE - 1), %eax
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> +       jg      L(page_cross_less_vec)
> +
> +       /* No page cross possible.  */
> +       VMOVU   (%rsi), %YMM2
> +       VPCMP   $4, (%rdi), %YMM2, %k1
> +       kmovd   %k1, %eax
> +       /* Result will be zero if s1 and s2 match. Otherwise first set
> +          bit will be first mismatch.  */
> +       bzhil   %edx, %eax, %eax
> +       ret
> +
> +       /* Relatively cold but placing close to L(less_vec) for 2 byte
> +          jump encoding.  */
> +       .p2align 4
> +L(one_or_less):
> +       jb      L(zero)
> +       movzbl  (%rsi), %ecx
> +       movzbl  (%rdi), %eax
> +       subl    %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +       /* Within the same 16 byte block is L(one_or_less).  */
> +L(zero):
> +       xorl    %eax, %eax
> +       ret
> +
> +       .p2align 4
> +L(last_2x_vec):
> +       VMOVU   -(VEC_SIZE * 2)(%rsi, %rdx), %YMM1
> +       vpxorq  -(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1
> +       VMOVU   -(VEC_SIZE * 1)(%rsi, %rdx), %YMM2
> +       vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2
> +       VPTEST  %YMM2, %YMM2, %k1
> +       kmovd   %k1, %eax
> +       ret
> +
> +       .p2align 4
> +L(more_8x_vec):
> +       /* Set end of s1 in rdx.  */
> +       leaq    -(VEC_SIZE * 4)(%rdi, %rdx), %rdx
> +       /* rsi stores s2 - s1. This allows loop to only update one
> +          pointer.  */
> +       subq    %rdi, %rsi
> +       /* Align s1 pointer.  */
> +       andq    $-VEC_SIZE, %rdi
> +       /* Adjust because first 4x vec where check already.  */
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       .p2align 4
> +L(loop_4x_vec):
> +       VMOVU   (%rsi, %rdi), %YMM1
> +       vpxorq  (%rdi), %YMM1, %YMM1
> +
> +       VMOVU   VEC_SIZE(%rsi, %rdi), %YMM2
> +       vpternlogd $0xde, (VEC_SIZE)(%rdi), %YMM1, %YMM2
> +
> +       VMOVU   (VEC_SIZE * 2)(%rsi, %rdi), %YMM3
> +       vpxorq  (VEC_SIZE * 2)(%rdi), %YMM3, %YMM3
> +
> +       VMOVU   (VEC_SIZE * 3)(%rsi, %rdi), %YMM4
> +       vpxorq  (VEC_SIZE * 3)(%rdi), %YMM4, %YMM4
> +
> +       vpternlogd $0xfe, %YMM2, %YMM3, %YMM4
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +       testl   %eax, %eax
> +       jnz     L(return_neq2)
> +       subq    $-(VEC_SIZE * 4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(loop_4x_vec)
> +
> +       subq    %rdx, %rdi
> +       VMOVU   (VEC_SIZE * 3)(%rsi, %rdx), %YMM4
> +       vpxorq  (VEC_SIZE * 3)(%rdx), %YMM4, %YMM4
> +       /* rdi has 4 * VEC_SIZE - remaining length.  */
> +       cmpl    $(VEC_SIZE * 3), %edi
> +       jae     L(8x_last_1x_vec)
> +       /* Load regardless of branch.  */
> +       VMOVU   (VEC_SIZE * 2)(%rsi, %rdx), %YMM3
> +       /* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while
> +          oring with YMM4. Result is stored in YMM4.  */
> +       vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4
> +       cmpl    $(VEC_SIZE * 2), %edi
> +       jae     L(8x_last_2x_vec)
> +
> +       VMOVU   VEC_SIZE(%rsi, %rdx), %YMM2
> +       vpxorq  VEC_SIZE(%rdx), %YMM2, %YMM2
> +
> +       VMOVU   (%rsi, %rdx), %YMM1
> +       vpxorq  (%rdx), %YMM1, %YMM1
> +
> +       vpternlogd $0xfe, %YMM1, %YMM2, %YMM4
> +L(8x_last_1x_vec):
> +L(8x_last_2x_vec):
> +       VPTEST  %YMM4, %YMM4, %k1
> +       kmovd   %k1, %eax
> +L(return_neq2):
> +       ret
> +
> +       /* Relatively cold case as page cross are unexpected.  */
> +       .p2align 4
> +L(page_cross_less_vec):
> +       cmpl    $16, %edx
> +       jae     L(between_16_31)
> +       cmpl    $8, %edx
> +       ja      L(between_9_15)
> +       cmpl    $4, %edx
> +       jb      L(between_2_3)
> +       /* From 4 to 8 bytes.  No branch when size == 4.  */
> +       movl    (%rdi), %eax
> +       subl    (%rsi), %eax
> +       movl    -4(%rdi, %rdx), %ecx
> +       movl    -4(%rsi, %rdx), %edi
> +       subl    %edi, %ecx
> +       orl     %ecx, %eax
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_16_31):
> +       /* From 16 to 31 bytes.  No branch when size == 16.  */
> +
> +       /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe.
> +        */
> +       vmovdqu (%rsi), %xmm1
> +       vpcmpeqb (%rdi), %xmm1, %xmm1
> +       vmovdqu -16(%rsi, %rdx), %xmm2
> +       vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2
> +       vpand   %xmm1, %xmm2, %xmm2
> +       vpmovmskb %xmm2, %eax
> +       notw    %ax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +       .p2align 4,, 8
> +L(between_9_15):
> +       /* From 9 to 15 bytes.  */
> +       movq    (%rdi), %rax
> +       subq    (%rsi), %rax
> +       movq    -8(%rdi, %rdx), %rcx
> +       movq    -8(%rsi, %rdx), %rdi
> +       subq    %rdi, %rcx
> +       orq     %rcx, %rax
> +       /* edx is guranteed to be a non-zero int.  */
> +       cmovnz  %edx, %eax
> +       ret
> +
> +       /* Don't align. This is cold and aligning here will cause code
> +          to spill into next cache line.  */
> +L(between_2_3):
> +       /* From 2 to 3 bytes.  No branch when size == 2.  */
> +       movzwl  (%rdi), %eax
> +       movzwl  (%rsi), %ecx
> +       subl    %ecx, %eax
> +       movzbl  -1(%rdi, %rdx), %ecx
> +       /* All machines that support evex will insert a "merging uop"
> +          avoiding any serious partial register stalls.  */
> +       subb    -1(%rsi, %rdx), %cl
> +       orl     %ecx, %eax
> +       /* No ymm register was touched.  */
> +       ret
> +
> +    /* 4 Bytes from next cache line. */
> +END (MEMCMPEQ)
> +#endif
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
  2021-10-27 12:45   ` H.J. Lu
@ 2021-10-27 16:07   ` Noah Goldstein
  2021-10-27 17:59     ` H.J. Lu
  1 sibling, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27 16:07 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
use the existing ones in memcmp. This will be useful for testing
implementations of __memcmpeq that do not just alias memcmp.
---
 benchtests/Makefile         |  2 +-
 benchtests/bench-memcmp.c   | 14 ++++++++------
 benchtests/bench-memcmpeq.c | 20 ++++++++++++++++++++
 3 files changed, 29 insertions(+), 7 deletions(-)
 create mode 100644 benchtests/bench-memcmpeq.c

diff --git a/benchtests/Makefile b/benchtests/Makefile
index b690aaf65b..7be0e47c47 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -103,7 +103,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
 endif
 
 # String function benchmarks.
-string-benchset := memccpy memchr memcmp memcpy memmem memmove \
+string-benchset := memccpy memchr memcmp memcmpeq memcpy memmem memmove \
 		   mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		   strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		   strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
index 0d6a93bf29..2cf65525bb 100644
--- a/benchtests/bench-memcmp.c
+++ b/benchtests/bench-memcmp.c
@@ -17,17 +17,21 @@
    <https://www.gnu.org/licenses/>.  */
 
 #define TEST_MAIN
-#ifdef WIDE
+#ifdef TEST_MEMCMPEQ
+# define TEST_NAME "__memcmpeq"
+# define SIMPLE_MEMCMP simple_memcmpeq
+#elif defined WIDE
 # define TEST_NAME "wmemcmp"
+# define SIMPLE_MEMCMP simple_wmemcmp
 #else
 # define TEST_NAME "memcmp"
+# define SIMPLE_MEMCMP simple_memcmp
 #endif
 #include "bench-string.h"
 #ifdef WIDE
 
-# define SIMPLE_MEMCMP simple_wmemcmp
 int
-simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
+SIMPLE_MEMCMP (const wchar_t *s1, const wchar_t *s2, size_t n)
 {
   int ret = 0;
   /* Warning!
@@ -40,10 +44,8 @@ simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
 #else
 # include <limits.h>
 
-# define SIMPLE_MEMCMP simple_memcmp
-
 int
-simple_memcmp (const char *s1, const char *s2, size_t n)
+SIMPLE_MEMCMP (const char *s1, const char *s2, size_t n)
 {
   int ret = 0;
 
diff --git a/benchtests/bench-memcmpeq.c b/benchtests/bench-memcmpeq.c
new file mode 100644
index 0000000000..e918d4f77c
--- /dev/null
+++ b/benchtests/bench-memcmpeq.c
@@ -0,0 +1,20 @@
+/* Measure __memcmpeq functions.
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define TEST_MEMCMPEQ 1
+#include "bench-memcmp.c"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-10-27 12:45   ` H.J. Lu
@ 2021-10-27 16:08     ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27 16:08 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Oct 27, 2021 at 7:46 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
> > use the existing ones in memcmp. This will be useful for testing
> > implementations of __memcmpeq that do not just alias memcmp.
> > ---
> >  benchtests/Makefile         |  2 +-
> >  benchtests/bench-memcmp.c   |  4 +++-
> >  benchtests/bench-memcmpeq.c | 20 ++++++++++++++++++++
> >  3 files changed, 24 insertions(+), 2 deletions(-)
> >  create mode 100644 benchtests/bench-memcmpeq.c
> >
> > diff --git a/benchtests/Makefile b/benchtests/Makefile
> > index b690aaf65b..7be0e47c47 100644
> > --- a/benchtests/Makefile
> > +++ b/benchtests/Makefile
> > @@ -103,7 +103,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
> >  endif
> >
> >  # String function benchmarks.
> > -string-benchset := memccpy memchr memcmp memcpy memmem memmove \
> > +string-benchset := memccpy memchr memcmp memcmpeq memcpy memmem memmove \
> >                    mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
> >                    strcat strchr strchrnul strcmp strcpy strcspn strlen \
> >                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
> > diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
> > index 0d6a93bf29..546b06e1ab 100644
> > --- a/benchtests/bench-memcmp.c
> > +++ b/benchtests/bench-memcmp.c
> > @@ -17,7 +17,9 @@
> >     <https://www.gnu.org/licenses/>.  */
> >
> >  #define TEST_MAIN
> > -#ifdef WIDE
> > +#ifdef TEST_MEMCMPEQ
> > +# define TEST_NAME "__memcmpeq"
> > +#elif defined WIDE
> >  # define TEST_NAME "wmemcmp"
> >  #else
> >  # define TEST_NAME "memcmp"
>
> Please rename simple_memcmp to simple_memcmpeq.

Fixed.

>
> > diff --git a/benchtests/bench-memcmpeq.c b/benchtests/bench-memcmpeq.c
> > new file mode 100644
> > index 0000000000..e918d4f77c
> > --- /dev/null
> > +++ b/benchtests/bench-memcmpeq.c
> > @@ -0,0 +1,20 @@
> > +/* Measure __memcmpeq functions.
> > +   Copyright (C) 2015-2021 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#define TEST_MEMCMPEQ 1
> > +#include "bench-memcmp.c"
> > --
> > 2.25.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-10-27 16:07   ` [PATCH v2 " Noah Goldstein
@ 2021-10-27 17:59     ` H.J. Lu
  0 siblings, 0 replies; 24+ messages in thread
From: H.J. Lu @ 2021-10-27 17:59 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

On Wed, Oct 27, 2021 at 9:08 AM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> No bug. This commit adds __memcmpeq benchmarks. The benchmarks just
> use the existing ones in memcmp. This will be useful for testing
> implementations of __memcmpeq that do not just alias memcmp.
> ---
>  benchtests/Makefile         |  2 +-
>  benchtests/bench-memcmp.c   | 14 ++++++++------
>  benchtests/bench-memcmpeq.c | 20 ++++++++++++++++++++
>  3 files changed, 29 insertions(+), 7 deletions(-)
>  create mode 100644 benchtests/bench-memcmpeq.c
>
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index b690aaf65b..7be0e47c47 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -103,7 +103,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}})
>  endif
>
>  # String function benchmarks.
> -string-benchset := memccpy memchr memcmp memcpy memmem memmove \
> +string-benchset := memccpy memchr memcmp memcmpeq memcpy memmem memmove \
>                    mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
>                    strcat strchr strchrnul strcmp strcpy strcspn strlen \
>                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
> diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c
> index 0d6a93bf29..2cf65525bb 100644
> --- a/benchtests/bench-memcmp.c
> +++ b/benchtests/bench-memcmp.c
> @@ -17,17 +17,21 @@
>     <https://www.gnu.org/licenses/>.  */
>
>  #define TEST_MAIN
> -#ifdef WIDE
> +#ifdef TEST_MEMCMPEQ
> +# define TEST_NAME "__memcmpeq"
> +# define SIMPLE_MEMCMP simple_memcmpeq
> +#elif defined WIDE
>  # define TEST_NAME "wmemcmp"
> +# define SIMPLE_MEMCMP simple_wmemcmp
>  #else
>  # define TEST_NAME "memcmp"
> +# define SIMPLE_MEMCMP simple_memcmp
>  #endif
>  #include "bench-string.h"
>  #ifdef WIDE
>
> -# define SIMPLE_MEMCMP simple_wmemcmp
>  int
> -simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
> +SIMPLE_MEMCMP (const wchar_t *s1, const wchar_t *s2, size_t n)
>  {
>    int ret = 0;
>    /* Warning!
> @@ -40,10 +44,8 @@ simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n)
>  #else
>  # include <limits.h>
>
> -# define SIMPLE_MEMCMP simple_memcmp
> -
>  int
> -simple_memcmp (const char *s1, const char *s2, size_t n)
> +SIMPLE_MEMCMP (const char *s1, const char *s2, size_t n)
>  {
>    int ret = 0;
>
> diff --git a/benchtests/bench-memcmpeq.c b/benchtests/bench-memcmpeq.c
> new file mode 100644
> index 0000000000..e918d4f77c
> --- /dev/null
> +++ b/benchtests/bench-memcmpeq.c
> @@ -0,0 +1,20 @@
> +/* Measure __memcmpeq functions.
> +   Copyright (C) 2015-2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#define TEST_MEMCMPEQ 1
> +#include "bench-memcmp.c"
> --
> 2.25.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/6] String: Add __memcmpeq as build target
  2021-10-27 12:42 ` [PATCH v1 1/6] String: Add __memcmpeq as build target H.J. Lu
@ 2021-10-27 18:46   ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-27 18:46 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Oct 27, 2021 at 7:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Oct 26, 2021 at 7:43 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > No bug. This commit just adds __memcmpeq as a build target so that
> > implementations for __memcmpeq that are not just aliases to memcmp can
> > be supported.
> > ---
> >  string/Makefile   |  2 +-
> >  string/memcmpeq.c | 24 ++++++++++++++++++++++++
> >  2 files changed, 25 insertions(+), 1 deletion(-)
> >  create mode 100644 string/memcmpeq.c
> >
> > diff --git a/string/Makefile b/string/Makefile
> > index 40d6fac133..2199dd30b7 100644
> > --- a/string/Makefile
> > +++ b/string/Makefile
> > @@ -34,7 +34,7 @@ routines      := strcat strchr strcmp strcoll strcpy strcspn          \
> >                    strerror _strerror strlen strnlen                    \
> >                    strncat strncmp strncpy                              \
> >                    strrchr strpbrk strsignal strspn strstr strtok       \
> > -                  strtok_r strxfrm memchr memcmp memmove memset        \
> > +                  strtok_r strxfrm memchr memcmp memcmpeq memmove memset       \
> >                    mempcpy bcopy bzero ffs ffsll stpcpy stpncpy         \
> >                    strcasecmp strncase strcasecmp_l strncase_l          \
> >                    memccpy memcpy wordcopy strsep strcasestr            \
> > diff --git a/string/memcmpeq.c b/string/memcmpeq.c
> > new file mode 100644
> > index 0000000000..08726325a8
> > --- /dev/null
> > +++ b/string/memcmpeq.c
> > @@ -0,0 +1,24 @@
> > +/* Copyright (C) 1991-2021 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +/* This file is intentionally left empty.  It exists so that both
> > +   architectures which implement __memcmpeq seperately from memcmp and
> > +   architectures which implement __memcmpeq by having it alias memcmp will
> > +   build.
> > +
> > +   The alias for __memcmpeq to memcmp for the C implementation is in
> > +   memcmp.c.  */
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> H.J.

Thanks. Pushed the patchset.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/6] String: Add __memcmpeq as build target
  2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
                   ` (5 preceding siblings ...)
  2021-10-27 12:42 ` [PATCH v1 1/6] String: Add __memcmpeq as build target H.J. Lu
@ 2021-10-28 17:57 ` Joseph Myers
  2021-10-28 18:25   ` Noah Goldstein
  6 siblings, 1 reply; 24+ messages in thread
From: Joseph Myers @ 2021-10-28 17:57 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha

One of the patches in this series has broken the testsuite build for 
x86_64 --disable-multi-arch.

string/tester.o: in function `test_memcmpeq':
/scratch/jmyers/glibc-bot/src/glibc/string/tester.c:1456: undefined reference to `__memcmpeq'

https://sourceware.org/pipermail/libc-testresults/2021q4/008775.html

(11c88336e3013653d473fd58d8658d0cd30887e3 was OK, 
9b7cfab1802b71763da00982f772208544cf4a95 fails, so it's definitely 
something in this series.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/6] String: Add __memcmpeq as build target
  2021-10-28 17:57 ` Joseph Myers
@ 2021-10-28 18:25   ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-10-28 18:25 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GNU C Library

On Thu, Oct 28, 2021 at 12:57 PM Joseph Myers <joseph@codesourcery.com> wrote:
>
> One of the patches in this series has broken the testsuite build for
> x86_64 --disable-multi-arch.
>
> string/tester.o: in function `test_memcmpeq':
> /scratch/jmyers/glibc-bot/src/glibc/string/tester.c:1456: undefined reference to `__memcmpeq'
>
> https://sourceware.org/pipermail/libc-testresults/2021q4/008775.html
>
> (11c88336e3013653d473fd58d8658d0cd30887e3 was OK,
> 9b7cfab1802b71763da00982f772208544cf4a95 fails, so it's definitely
> something in this series.)
>

Yup. Was able to reproduce. Sorry only tested build not full xcheck
with --disable-multi-arch.

Have a fix. Testing right now.

> --
> Joseph S. Myers
> joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-11-03 13:39     ` Wilco Dijkstra
@ 2021-11-03 17:17       ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2021-11-03 17:17 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Wed, Nov 3, 2021 at 8:39 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Noah,
>
> > > As committed, it's literally the same benchmark as memcmp, however I
> > > expected it to call __memcmpeq... You'll need to change the definition of
> > > MEMCMP as well.
>
> > Not certain why, but it does seem to be getting the right ifuncs. Can update it
> > if desirable.
>
> > Is this also true for 'wmemcmp'? I don't see it defining MEMCMP either.
>
> Ifuncs use TEST_NAME (which is right) while there is also IMPL (MEMCMP, 1)
> which is not correct. It is defined in bench-string - it does support wmemcmp,
> but it is missing support for __memcmpeq.

That all makes sense. What's confusing is that despite not setting MEMCMP
for __memcmpeq, the benchmarks still test the correct implementations.

It appears that in FOR_EACH_IMPL:
https://sourceware.org/git/?p=glibc.git;a=blob;f=benchtests/bench-string.h;h=12f27473ff5714e50a94f3f1c5b06a56b99bc742;hb=HEAD#l200

The only check if selecting the function from the ifunc list is that
its name is not
an exact match of whatever MEMCMP is defined as.

Will post a patch to fix but is a bit odd.

>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-11-02 20:57   ` Noah Goldstein
@ 2021-11-03 13:39     ` Wilco Dijkstra
  2021-11-03 17:17       ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2021-11-03 13:39 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library

Hi Noah,

> > As committed, it's literally the same benchmark as memcmp, however I
> > expected it to call __memcmpeq... You'll need to change the definition of
> > MEMCMP as well.

> Not certain why, but it does seem to be getting the right ifuncs. Can update it
> if desirable.

> Is this also true for 'wmemcmp'? I don't see it defining MEMCMP either.

Ifuncs use TEST_NAME (which is right) while there is also IMPL (MEMCMP, 1)
which is not correct. It is defined in bench-string - it does support wmemcmp,
but it is missing support for __memcmpeq.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-11-01 16:01 ` Noah Goldstein
@ 2021-11-02 20:57   ` Noah Goldstein
  2021-11-03 13:39     ` Wilco Dijkstra
  0 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-11-02 20:57 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Mon, Nov 1, 2021 at 11:01 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Nov 1, 2021 at 8:56 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> >
> > Hi Noah,
> >
> > As committed, it's literally the same benchmark as memcmp, however I
> > expected it to call __memcmpeq... You'll need to change the definition of
> > MEMCMP as well.
>

Not certain why, but it does seem to be getting the right ifuncs. Can update it
if desirable.

> Is this also true for 'wmemcmp'? I don't see it defining MEMCMP either.
>
> >
> > Cheers,
> > Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
  2021-11-01 13:56 [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq Wilco Dijkstra
@ 2021-11-01 16:01 ` Noah Goldstein
  2021-11-02 20:57   ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-11-01 16:01 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library

On Mon, Nov 1, 2021 at 8:56 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Noah,
>
> As committed, it's literally the same benchmark as memcmp, however I
> expected it to call __memcmpeq... You'll need to change the definition of
> MEMCMP as well.

Is this also true for 'wmemcmp'? I don't see it defining MEMCMP either.

>
> Cheers,
> Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq
@ 2021-11-01 13:56 Wilco Dijkstra
  2021-11-01 16:01 ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Wilco Dijkstra @ 2021-11-01 13:56 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: 'GNU C Library'

Hi Noah,

As committed, it's literally the same benchmark as memcmp, however I
expected it to call __memcmpeq... You'll need to change the definition of
MEMCMP as well.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-11-03 17:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27  2:43 [PATCH v1 1/6] String: Add __memcmpeq as build target Noah Goldstein
2021-10-27  2:43 ` [PATCH v1 2/6] Benchtests: Add benchtests for __memcmpeq Noah Goldstein
2021-10-27 12:45   ` H.J. Lu
2021-10-27 16:08     ` Noah Goldstein
2021-10-27 16:07   ` [PATCH v2 " Noah Goldstein
2021-10-27 17:59     ` H.J. Lu
2021-10-27  2:43 ` [PATCH v1 3/6] x86_64: Add support for __memcmpeq using sse2, avx2, and evex Noah Goldstein
2021-10-27 12:47   ` H.J. Lu
2021-10-27  2:43 ` [PATCH v1 4/6] x86_64: Add sse2 optimized __memcmpeq in memcmp-sse2.S Noah Goldstein
2021-10-27 12:48   ` H.J. Lu
2021-10-27  2:43 ` [PATCH v1 5/6] x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S Noah Goldstein
2021-10-27 12:48   ` H.J. Lu
2021-10-27  2:43 ` [PATCH v1 6/6] x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S Noah Goldstein
2021-10-27  2:44   ` Noah Goldstein
2021-10-27 12:49   ` H.J. Lu
2021-10-27 12:42 ` [PATCH v1 1/6] String: Add __memcmpeq as build target H.J. Lu
2021-10-27 18:46   ` Noah Goldstein
2021-10-28 17:57 ` Joseph Myers
2021-10-28 18:25   ` Noah Goldstein
2021-11-01 13:56 [PATCH v2 2/6] Benchtests: Add benchtests for __memcmpeq Wilco Dijkstra
2021-11-01 16:01 ` Noah Goldstein
2021-11-02 20:57   ` Noah Goldstein
2021-11-03 13:39     ` Wilco Dijkstra
2021-11-03 17:17       ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).