[PATCH 0/4] x86: Improve ERMS usage on Zen3+

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/4] x86: Improve ERMS usage on Zen3+
@ 2023-10-31 20:09 Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 1/4] elf: Add a way to check if tunable is set (BZ 27069) Adhemerval Zanella
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2023-10-31 20:09 UTC (permalink / raw)
  To: libc-alpha, Noah Goldstein, H . J . Lu, Bruce Merry

For the sizes where REP MOVSB and REP STOSB are used on Zen3+ cores, the
result performance is lower than vectorized instructions (with some
input alignment showing a very large performance gap as indicated by
BZ#30995). 

The glibc enables ERMS on AMD code for sizes between 2113
(rep_movsb_threshold) and L2 cache size (rep_movsb_stop_threshold or 
524288 on a Zen3 core). Using the provided benchmarks from BZ#30995, the
memcpy on Ryzen 9 5900X shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0                84.2448              
  2113                              15                 4.4310
  524287                             0                57.1122 
  524287                            15                4.34671

While by using vectorized instructions with the tunable
GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 it shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0               124.1830             
  2113                              15               121.8720
  524287                             0                58.3212 
  524287                            15                58.5352 

Increasing the number of concurrent jobs does show improvements in ERMS
over vectorized instructions as well. The performance difference with
ERMS improves if input alignments are equal, although it does not reach
parity with the vectorized path.

The memset also shows similar performance improvement with vectorized
instructions instead of REP STOSB. On the same machine, the default
strategy shows:

  Size (bytes)   Destination Alignment      Throughput (GB/s)
  2113                               0                68.0113            
  2113                              15                56.1880
  524287                             0               119.3670
  524287                            15               116.2590

While with GLIBC_TUNABLES=glibc.cpu.x86_rep_stosb_threshold=1000000: 

  Size (bytes)   Destination Alignment      Throughput (GB/s)
2113                                 0               133.2310
2113                                15               132.5800
524287                               0               112.0650
524287                              15               118.0960

I also saw a slight performance increase on 502.gcc_r (1 copy), where
where result went from 9.82 to 9.85. The benchmarks hit hard both memcpy
and memset.

The first patch adds a way to check if tunable is set (BZ 27069), which
is used on the second patch to select the best strategy. The BZ 30994
fix also adds a new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, so
the caller can specify a size range for force ERMS usage (from BZ #30994
discussion, there are some cases where ERMS is profitable). Patch 3
disables ERMS usage for memset on Zen 3+. And patch 4 slightly improves
slight the x86 memcpy documentation.

Adhemerval Zanella (4):
  elf: Add a way to check if tunable is set (BZ 27069)
  x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
  x86: Do not prefer ERMS for memset on Zen3+
  x86: Expand the comment on when REP STOSB is used on memset

 elf/dl-tunable-types.h                        |  1 +
 elf/dl-tunables.c                             | 40 ++++++++++
 elf/dl-tunables.h                             | 28 +++++++
 elf/dl-tunables.list                          |  1 +
 manual/tunables.texi                          |  9 +++
 scripts/gen-tunables.awk                      |  4 +-
 sysdeps/x86/dl-cacheinfo.h                    | 74 ++++++++++++-------
 sysdeps/x86/dl-tunables.list                  | 10 +++
 .../multiarch/memset-vec-unaligned-erms.S     |  4 +-
 9 files changed, 142 insertions(+), 29 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] elf: Add a way to check if tunable is set (BZ 27069)
  2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
@ 2023-10-31 20:09 ` Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 2/4] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2023-10-31 20:09 UTC (permalink / raw)
  To: libc-alpha, Noah Goldstein, H . J . Lu, Bruce Merry

The tunable already keep a field whether it is initialized.  To query
the default value, it is easier to add a new constant field.

The patch adds two new macros, TUNABLE_GET_DEFAULT and
TUNABLE_IS_INITIALIZED, where the former get the default value
with a signature similar to TUNABLE_GET while the later returns whether
the tunable was set by the environment.

Checked on x86_64-linux-gnu.
---
 elf/dl-tunable-types.h   |  1 +
 elf/dl-tunables.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 elf/dl-tunables.h        | 28 ++++++++++++++++++++++++++++
 elf/dl-tunables.list     |  1 +
 scripts/gen-tunables.awk |  4 ++--
 5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/elf/dl-tunable-types.h b/elf/dl-tunable-types.h
index c88332657e..c41a3b3bdb 100644
--- a/elf/dl-tunable-types.h
+++ b/elf/dl-tunable-types.h
@@ -61,6 +61,7 @@ struct _tunable
 {
   const char name[TUNABLE_NAME_MAX];	/* Internal name of the tunable.  */
   tunable_type_t type;			/* Data type of the tunable.  */
+  const tunable_val_t def;		/* The value.  */
   tunable_val_t val;			/* The value.  */
   bool initialized;			/* Flag to indicate that the tunable is
 					   initialized.  */
diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index cae67efa0a..79b4d542a3 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -145,6 +145,13 @@ tunable_initialize (tunable_t *cur, const char *strval)
   do_tunable_update_val (cur, &val, NULL, NULL);
 }
 
+bool
+__tunable_is_initialized (tunable_id_t id)
+{
+  return tunable_list[id].initialized;
+}
+rtld_hidden_def (__tunable_is_initialized)
+
 void
 __tunable_set_val (tunable_id_t id, tunable_val_t *valp, tunable_num_t *minp,
 		   tunable_num_t *maxp)
@@ -388,6 +395,39 @@ __tunables_print (void)
     }
 }
 
+void
+__tunable_get_default (tunable_id_t id, void *valp)
+{
+  tunable_t *cur = &tunable_list[id];
+
+  switch (cur->type.type_code)
+    {
+    case TUNABLE_TYPE_UINT_64:
+	{
+	  *((uint64_t *) valp) = (uint64_t) cur->def.numval;
+	  break;
+	}
+    case TUNABLE_TYPE_INT_32:
+	{
+	  *((int32_t *) valp) = (int32_t) cur->def.numval;
+	  break;
+	}
+    case TUNABLE_TYPE_SIZE_T:
+	{
+	  *((size_t *) valp) = (size_t) cur->def.numval;
+	  break;
+	}
+    case TUNABLE_TYPE_STRING:
+	{
+	  *((const char **)valp) = cur->def.strval;
+	  break;
+	}
+    default:
+      __builtin_unreachable ();
+    }
+}
+rtld_hidden_def (__tunable_get_default)
+
 /* Set the tunable value.  This is called by the module that the tunable exists
    in. */
 void
diff --git a/elf/dl-tunables.h b/elf/dl-tunables.h
index 45c191e021..0df4dde24e 100644
--- a/elf/dl-tunables.h
+++ b/elf/dl-tunables.h
@@ -45,18 +45,26 @@ typedef void (*tunable_callback_t) (tunable_val_t *);
 
 extern void __tunables_init (char **);
 extern void __tunables_print (void);
+extern bool __tunable_is_initialized (tunable_id_t);
 extern void __tunable_get_val (tunable_id_t, void *, tunable_callback_t);
 extern void __tunable_set_val (tunable_id_t, tunable_val_t *, tunable_num_t *,
 			       tunable_num_t *);
+extern void __tunable_get_default (tunable_id_t id, void *valp);
 rtld_hidden_proto (__tunables_init)
 rtld_hidden_proto (__tunables_print)
+rtld_hidden_proto (__tunable_is_initialized)
 rtld_hidden_proto (__tunable_get_val)
 rtld_hidden_proto (__tunable_set_val)
+rtld_hidden_proto (__tunable_get_default)
 
 /* Define TUNABLE_GET and TUNABLE_SET in short form if TOP_NAMESPACE and
    TUNABLE_NAMESPACE are defined.  This is useful shorthand to get and set
    tunables within a module.  */
 #if defined TOP_NAMESPACE && defined TUNABLE_NAMESPACE
+# define TUNABLE_IS_INITIALIZED(__id) \
+  TUNABLE_IS_INITIALIZED_FULL(TOP_NAMESPACE, TUNABLE_NAMESPACE, __id)
+# define TUNABLE_GET_DEFAULT(__id, __type) \
+  TUNABLE_GET_DEFAULT_FULL(TOP_NAMESPACE, TUNABLE_NAMESPACE,__id, __type)
 # define TUNABLE_GET(__id, __type, __cb) \
   TUNABLE_GET_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, __type, __cb)
 # define TUNABLE_SET(__id, __val) \
@@ -65,6 +73,10 @@ rtld_hidden_proto (__tunable_set_val)
   TUNABLE_SET_WITH_BOUNDS_FULL (TOP_NAMESPACE, TUNABLE_NAMESPACE, __id, \
 				__val, __min, __max)
 #else
+# define TUNABLE_IS_INITIALIZED(__top, __ns, __id) \
+  TUNABLE_IS_INITIALIZED_FULL(__top, __ns, __id)
+# define TUNABLE_GET_DEFAULT(__top, __ns, __type) \
+  TUNABLE_GET_DEFAULT_FULL(__top, __ns, __id, __type)
 # define TUNABLE_GET(__top, __ns, __id, __type, __cb) \
   TUNABLE_GET_FULL (__top, __ns, __id, __type, __cb)
 # define TUNABLE_SET(__top, __ns, __id, __val) \
@@ -73,6 +85,22 @@ rtld_hidden_proto (__tunable_set_val)
   TUNABLE_SET_WITH_BOUNDS_FULL (__top, __ns, __id, __val, __min, __max)
 #endif
 
+/* Return whether the tunable was initialized by the environment variable.  */
+#define TUNABLE_IS_INITIALIZED_FULL(__top, __ns, __id) \
+({									      \
+  tunable_id_t id = TUNABLE_ENUM_NAME (__top, __ns, __id);		      \
+  __tunable_is_initialized (id);					      \
+})
+
+/* Return the default value of the tunable.  */
+#define TUNABLE_GET_DEFAULT_FULL(__top, __ns, __id, __type) \
+({									      \
+  tunable_id_t id = TUNABLE_ENUM_NAME (__top, __ns, __id);		      \
+  __type __ret;								      \
+  __tunable_get_default (id, &__ret);					      \
+  __ret;								      \
+})
+
 /* Get and return a tunable value.  If the tunable was set externally and __CB
    is defined then call __CB before returning the value.  */
 #define TUNABLE_GET_FULL(__top, __ns, __id, __type, __cb) \
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 695ba7192e..5bb858b1d8 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -20,6 +20,7 @@
 # type: Defaults to STRING
 # minval: Optional minimum acceptable value
 # maxval: Optional maximum acceptable value
+# default: Optional default value (if not specified it will be 0 or "")
 # env_alias: An alias environment variable
 # security_level: Specify security level of the tunable for AT_SECURE binaries.
 # 		  Valid values are:
diff --git a/scripts/gen-tunables.awk b/scripts/gen-tunables.awk
index d6de100df0..9726b05217 100644
--- a/scripts/gen-tunables.awk
+++ b/scripts/gen-tunables.awk
@@ -177,8 +177,8 @@ END {
     n = indices[2];
     m = indices[3];
     printf ("  {TUNABLE_NAME_S(%s, %s, %s)", t, n, m)
-    printf (", {TUNABLE_TYPE_%s, %s, %s}, {%s}, false, TUNABLE_SECLEVEL_%s, %s},\n",
-	    types[t,n,m], minvals[t,n,m], maxvals[t,n,m],
+    printf (", {TUNABLE_TYPE_%s, %s, %s}, {%s}, {%s}, false, TUNABLE_SECLEVEL_%s, %s},\n",
+	    types[t,n,m], minvals[t,n,m], maxvals[t,n,m], default_val[t,n,m],
 	    default_val[t,n,m], security_level[t,n,m], env_alias[t,n,m]);
   }
   print "};"
-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/4] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
  2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 1/4] elf: Add a way to check if tunable is set (BZ 27069) Adhemerval Zanella
@ 2023-10-31 20:09 ` Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 3/4] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2023-10-31 20:09 UTC (permalink / raw)
  To: libc-alpha, Noah Goldstein, H . J . Lu, Bruce Merry

The REP MOVSB usage on memcpy/memmove does show any performance gain
on Zen3/Zen4 cores compared to the vectorized loops.  Also, as
from BZ 30994, if source is aligned and destination is not the
performance can be as 20x slower.

The perfomance differnce is really noticeable with small buffer sizes,
closer to the lower bounds limits when memcpy/memmove starts to
use ERMS.  The performance of REP MOVSB is similar to vectorized
instruction on the size limit (the L2 cache).  Also, there is not
drawnback of multiple cores sharing the cache.

A new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, allows to
setup the higher bound size to use 'rep movsb'.

Checked on x86_64-linux-gnu on Zen3.
---
 manual/tunables.texi         |  9 ++++++
 sysdeps/x86/dl-cacheinfo.h   | 58 +++++++++++++++++++++++-------------
 sysdeps/x86/dl-tunables.list | 10 +++++++
 3 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 776fd93fd9..5d3263bc2e 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -570,6 +570,15 @@ greater than zero, and currently defaults to 2048 bytes.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_stop_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
+set threshold in bytes to stop using "rep movsb".  The value must be
+greater than zero, and currently defaults depends of the CPU and the
+cache size.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_rep_stosb_threshold
 The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user to
 set threshold in bytes to start using "rep stosb".  The value must be
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 87486054f9..51e5ba200f 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -784,6 +784,14 @@ get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, u
   *threads_ptr = threads;
 }
 
+static inline bool
+is_rep_movsb_stop_threshold_valid (unsigned long int v)
+{
+  unsigned long int rep_movsb_threshold
+    = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+  return v > rep_movsb_threshold;
+}
+
 static void
 dl_init_cacheinfo (struct cpu_features *cpu_features)
 {
@@ -791,7 +799,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   long int data = -1;
   long int shared = -1;
   long int shared_per_thread = -1;
-  long int core = -1;
   unsigned int threads = 0;
   unsigned long int level1_icache_size = -1;
   unsigned long int level1_icache_linesize = -1;
@@ -809,7 +816,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (cpu_features->basic.kind == arch_kind_intel)
     {
       data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
       shared_per_thread = shared;
 
@@ -822,7 +828,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
       level1_dcache_linesize
 	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
-      level2_cache_size = core;
+      level2_cache_size
+	= handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       level2_cache_assoc
 	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
       level2_cache_linesize
@@ -835,12 +842,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level4_cache_size
 	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_zhaoxin)
     {
       data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
       shared_per_thread = shared;
 
@@ -849,19 +856,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
       level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
       level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_amd)
     {
       data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
 
       level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
@@ -869,7 +876,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
       level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
@@ -880,12 +887,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       if (shared <= 0)
         {
            /* No shared L3 cache.  All we have is the L2 cache.  */
-           shared = core;
+           shared = level2_cache_size;
         }
       else if (cpu_features->basic.family < 0x17)
         {
            /* Account for exclusive L2 and L3 caches.  */
-           shared += core;
+           shared += level2_cache_size;
         }
 
       shared_per_thread = shared;
@@ -1028,16 +1035,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 			   SIZE_MAX);
 
   unsigned long int rep_movsb_stop_threshold;
-  /* ERMS feature is implemented from AMD Zen3 architecture and it is
-     performing poorly for data above L2 cache size. Henceforth, adding
-     an upper bound threshold parameter to limit the usage of Enhanced
-     REP MOVSB operations and setting its value to L2 cache size.  */
-  if (cpu_features->basic.kind == arch_kind_amd)
-    rep_movsb_stop_threshold = core;
-  /* Setting the upper bound of ERMS to the computed value of
-     non-temporal threshold for architectures other than AMD.  */
-  else
-    rep_movsb_stop_threshold = non_temporal_threshold;
+  /* If the tunable is not set or if the value is not larger than
+     x86_rep_stosb_threshold, use the default values.  */
+  rep_movsb_stop_threshold = TUNABLE_GET (x86_rep_movsb_stop_threshold,
+					  long int, NULL);
+  if (!TUNABLE_IS_INITIALIZED (x86_rep_movsb_stop_threshold)
+      || !is_rep_movsb_stop_threshold_valid (rep_movsb_stop_threshold))
+    {
+      /* For AMD cpus that support ERMS (Zen3+), REP MOVSB is in a lot case
+	 slower than the vectorized path (and for some alignments it is really
+	 slow, check BZ #30994).  */
+      if (cpu_features->basic.kind == arch_kind_amd)
+	rep_movsb_stop_threshold = 0;
+      else
+      /* Setting the upper bound of ERMS to the computed value of
+	 non-temporal threshold for architectures other than AMD.  */
+	rep_movsb_stop_threshold = non_temporal_threshold;
+    }
+  TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
+			   SIZE_MAX);
 
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index feb7004036..5e9831b610 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -49,6 +49,16 @@ glibc {
       # if the tunable value is set by user or not [BZ #27069].
       minval: 1
     }
+    x86_rep_movsb_stop_threshold {
+      # For AMD cpus that support ERMS (Zen3+), REP MOVSB is not faster
+      # than the vectorized path (and for some destination alignment it
+      # is really slow, check BZ #30994).  On Intel cpus, the size limit
+      # to use ERMS is is [1/8, 1/2] of size of the chip's cache, check
+      # the dl-cacheinfo.h).
+      # This tunable allows the caller to setup the limit where to use
+      # REP MOVB on memcpy/memmove.
+      type: SIZE_T
+    }
     x86_rep_stosb_threshold {
       type: SIZE_T
       # Since there is overhead to set up REP STOSB operation, REP STOSB
-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/4] x86: Do not prefer ERMS for memset on Zen3+
  2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 1/4] elf: Add a way to check if tunable is set (BZ 27069) Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 2/4] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella
@ 2023-10-31 20:09 ` Adhemerval Zanella
  2023-10-31 20:09 ` [PATCH 4/4] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella
  2023-11-15 19:05 ` [PATCH 0/4] x86: Improve ERMS usage on Zen3+ sajan.karumanchi
  4 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2023-10-31 20:09 UTC (permalink / raw)
  To: libc-alpha, Noah Goldstein, H . J . Lu, Bruce Merry

The REP STOSB usage on memset does show any performance gain on
Zen3/Zen4 cores compared to the vectorized loops.

Checked on x86_64-linux-gnu.
---
 sysdeps/x86/dl-cacheinfo.h | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 51e5ba200f..99ba0f776a 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -1018,11 +1018,17 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (tunable_size > minimum_rep_movsb_threshold)
     rep_movsb_threshold = tunable_size;
 
-  /* NB: The default value of the x86_rep_stosb_threshold tunable is the
-     same as the default value of __x86_rep_stosb_threshold and the
-     minimum value is fixed.  */
-  rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold,
-				     long int, NULL);
+  /* For AMD Zen3+ architecture, the performance of vectorized loop is
+     slight better than ERMS.  */
+  if (cpu_features->basic.kind == arch_kind_amd)
+    rep_stosb_threshold = SIZE_MAX;
+
+  if (TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold))
+    /* NB: The default value of the x86_rep_stosb_threshold tunable is the
+       same as the default value of __x86_rep_stosb_threshold and the
+       minimum value is fixed.  */
+    rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold,
+				       long int, NULL);
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/4] x86: Expand the comment on when REP STOSB is used on memset
  2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2023-10-31 20:09 ` [PATCH 3/4] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella
@ 2023-10-31 20:09 ` Adhemerval Zanella
  2023-11-15 19:05 ` [PATCH 0/4] x86: Improve ERMS usage on Zen3+ sajan.karumanchi
  4 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella @ 2023-10-31 20:09 UTC (permalink / raw)
  To: libc-alpha, Noah Goldstein, H . J . Lu, Bruce Merry

---
 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 3d9ad49cb9..0821b32997 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,7 +21,9 @@
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
+   5. On machines ERMS feature, if size is greater or equal than
+      __x86_rep_stosb_threshold then REP STOSB will be used.
+   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
       4 VEC stores and store 4 * VEC at a time until done.  */
 
 #include <sysdep.h>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/4] x86: Improve ERMS usage on Zen3+
  2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
                   ` (3 preceding siblings ...)
  2023-10-31 20:09 ` [PATCH 4/4] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella
@ 2023-11-15 19:05 ` sajan.karumanchi
  2023-11-16 18:35   ` Adhemerval Zanella Netto
  4 siblings, 1 reply; 9+ messages in thread
From: sajan.karumanchi @ 2023-11-15 19:05 UTC (permalink / raw)
  To: adhemerval.zanella
  Cc: bmerry, goldstein.w.n, hjl.tools, libc-alpha, sajan.karumanchi, pmallapp

Adhemerval,

We added this to our todo list, and will get back shortly after verifying the patches.

-Sajan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/4] x86: Improve ERMS usage on Zen3+
  2023-11-15 19:05 ` [PATCH 0/4] x86: Improve ERMS usage on Zen3+ sajan.karumanchi
@ 2023-11-16 18:35   ` Adhemerval Zanella Netto
  2024-02-05 19:01     ` Sajan Karumanchi
  0 siblings, 1 reply; 9+ messages in thread
From: Adhemerval Zanella Netto @ 2023-11-16 18:35 UTC (permalink / raw)
  To: sajan.karumanchi
  Cc: bmerry, goldstein.w.n, hjl.tools, libc-alpha, sajan.karumanchi, pmallapp



On 15/11/23 16:05, sajan.karumanchi@gmail.com wrote:
> Adhemerval,
> 
> We added this to our todo list, and will get back shortly after verifying the patches.
> 
> -Sajan

Thanks Sajan, let me know if you need anything else.  I only have access to a Zen3 core
machine, so if you could also check the BZ30995 [1] it would be helpful (it is related
to Zen4 performance for memcpy).

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=30995

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH 0/4] x86: Improve ERMS usage on Zen3+
  2023-11-16 18:35   ` Adhemerval Zanella Netto
@ 2024-02-05 19:01     ` Sajan Karumanchi
  2024-02-06 13:00       ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 9+ messages in thread
From: Sajan Karumanchi @ 2024-02-05 19:01 UTC (permalink / raw)
  To: adhemerval.zanella
  Cc: bmerry, goldstein.w.n, hjl.tools, libc-alpha, pmallapp,
	sajan.karumanchi, fweimer

Adhemerval,

In our extensive testing, we observed mixed results for rep-movs/stos performance with the ERMS feature enabled. 
Henceforth, we approve this patch to avoid the ERMS code path on AMD processors for better performance.

-Sajan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/4] x86: Improve ERMS usage on Zen3+
  2024-02-05 19:01     ` Sajan Karumanchi
@ 2024-02-06 13:00       ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella Netto @ 2024-02-06 13:00 UTC (permalink / raw)
  To: Sajan Karumanchi
  Cc: bmerry, goldstein.w.n, hjl.tools, libc-alpha, pmallapp,
	sajan.karumanchi, fweimer



On 05/02/24 16:01, Sajan Karumanchi wrote:
> 
> Adhemerval,
> 
> In our extensive testing, we observed mixed results for rep-movs/stos performance with the ERMS feature enabled. 
> Henceforth, we approve this patch to avoid the ERMS code path on AMD processors for better performance.
> 
> -Sajan
> 
> 

Thanks for checking this out Sajan, I will rebase with some wording fixes
in comments and double check if everything is ok.  If you can, please
send a Ack-by or Reviewed-by in the next comment.  I will also check with
H.J and Noah (x86 maintaners) to see if everything is ok.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-02-06 13:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-31 20:09 [PATCH 0/4] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella
2023-10-31 20:09 ` [PATCH 1/4] elf: Add a way to check if tunable is set (BZ 27069) Adhemerval Zanella
2023-10-31 20:09 ` [PATCH 2/4] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella
2023-10-31 20:09 ` [PATCH 3/4] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella
2023-10-31 20:09 ` [PATCH 4/4] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella
2023-11-15 19:05 ` [PATCH 0/4] x86: Improve ERMS usage on Zen3+ sajan.karumanchi
2023-11-16 18:35   ` Adhemerval Zanella Netto
2024-02-05 19:01     ` Sajan Karumanchi
2024-02-06 13:00       ` Adhemerval Zanella Netto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).