From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <azanella@sourceware.org>
Received: by sourceware.org (Postfix, from userid 1791)
	id D84A93858298; Fri, 27 Oct 2023 12:29:52 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D84A93858298
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1698409792;
	bh=9NK013XL3QrTNooG4YzwLTwjx1HG3dnaNVubNZ6QO6Q=;
	h=From:To:Subject:Date:From;
	b=VcMhN3yHwubdgaSiC1BZdKNvcH6XTZHmaCOJQ6QDVnCsjOakA235fgegmXt+Ykafb
	 ioJs8RQNnH0zM/Vic/L91pDnQ7dzOK4bAPPeNlf4gW2uk+acrHnHlco/GD8dIhz4wH
	 f8cgGKjtlmkUfpLK6plcO32efp2fgooU2+f2AHqc=
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Adhemerval Zanella <azanella@sourceware.org>
To: glibc-cvs@sourceware.org
Subject: [glibc/azanella/bz30944-memcpy-zen] x86: Fix Zen3/Zen4 ERMS selection
 (BZ 30994, BZ 30995)
X-Act-Checkin: glibc
X-Git-Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
X-Git-Refname: refs/heads/azanella/bz30944-memcpy-zen
X-Git-Oldrev: 2bd00179885928fd95fcabfafc50e7b5c6e660d2
X-Git-Newrev: 2869a2eb9669e8cc1cd019805c419d11e5f3a1d3
Message-Id: <20231027122952.D84A93858298@sourceware.org>
Date: Fri, 27 Oct 2023 12:29:52 +0000 (GMT)
List-Id: <glibc-cvs.sourceware.org>

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2869a2eb9669e8cc1cd019805c419d11e5f3a1d3

commit 2869a2eb9669e8cc1cd019805c419d11e5f3a1d3
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Thu Oct 26 23:24:35 2023 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994, BZ 30995)
    
    The REP MOVSB usage on memcpy/memmove does show any performance gain
    on Zen3/Zen4 cores compared to the vectorized loops.  Also, as
    from BZ 30994, if source is aligned and destination is not the
    performance can be as 20x slower.
    
    The perfomance differnce is really noticeable with small buffer sizes,
    closer to the lower bounds limits when memcpy/memmove starts to
    use ERMS.  The performance of REP MOVSB is similar to vectorized
    instruction on the size limit (the L2 cache).
    
    Also, there is not drawnback of multiple cores sharing the cache.
    
    Checked on x86_64-linux-gnu on Zen3.

Diff:
---
 sysdeps/x86/dl-cacheinfo.h | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 87486054f9..546ff0725a 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   long int data = -1;
   long int shared = -1;
   long int shared_per_thread = -1;
-  long int core = -1;
   unsigned int threads = 0;
   unsigned long int level1_icache_size = -1;
   unsigned long int level1_icache_linesize = -1;
@@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (cpu_features->basic.kind == arch_kind_intel)
     {
       data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
       shared_per_thread = shared;
 
@@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
       level1_dcache_linesize
 	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
-      level2_cache_size = core;
+      level2_cache_size
+	= handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       level2_cache_assoc
 	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
       level2_cache_linesize
@@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level4_cache_size
 	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_zhaoxin)
     {
       data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
       shared_per_thread = shared;
 
@@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
       level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
       level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_amd)
     {
       data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
 
       level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
@@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
       level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
@@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       if (shared <= 0)
         {
            /* No shared L3 cache.  All we have is the L2 cache.  */
-           shared = core;
+           shared = level2_cache_size;
         }
       else if (cpu_features->basic.family < 0x17)
         {
            /* Account for exclusive L2 and L3 caches.  */
-           shared += core;
+           shared += level2_cache_size;
         }
 
       shared_per_thread = shared;
@@ -1028,12 +1027,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 			   SIZE_MAX);
 
   unsigned long int rep_movsb_stop_threshold;
-  /* ERMS feature is implemented from AMD Zen3 architecture and it is
-     performing poorly for data above L2 cache size. Henceforth, adding
-     an upper bound threshold parameter to limit the usage of Enhanced
-     REP MOVSB operations and setting its value to L2 cache size.  */
+  /* Although ERMS feature is implemented for AMD Zen3+ architecture, it shows
+     really bad performance is source is aligned and destiny is unaligned.
+     And even for other alignments, the performance is not better than the
+     vectorized loop.  Disable ERMS on AMD.  */
   if (cpu_features->basic.kind == arch_kind_amd)
-    rep_movsb_stop_threshold = core;
+    rep_movsb_stop_threshold = 0;
   /* Setting the upper bound of ERMS to the computed value of
      non-temporal threshold for architectures other than AMD.  */
   else