From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=guw8=GN=linaro.org=adhemerval.zanella@sourceware.org>
Received: from mail-yw1-x112d.google.com (mail-yw1-x112d.google.com [IPv6:2607:f8b0:4864:20::112d])
	by sourceware.org (Postfix) with ESMTPS id B2F153857C43
	for <libc-alpha@sourceware.org>; Tue, 31 Oct 2023 20:09:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B2F153857C43
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B2F153857C43
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::112d
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698782976; cv=none;
	b=MUqahYDIsC6nxHZ6L2mDegngIqQa8F+492Xm/nCKB3WNVCXbkg/nXFMemtnzfU4iXlHSdysPfyx5twq4OCevVHltQsX/HjpHXqH3W5JJB7OJxrjtgDGa9pC+y1tPdvR6myCZNdbKdCdXeW8CuzXEiFBMuZPCVymVA/hYkhFb2mw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1698782976; c=relaxed/simple;
	bh=uoxo7hUdQgKwNmmyubhslxK3kKPhMwm/tHfmjv/gwPU=;
	h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=DJ5CVAESWfQifOJPI+GTDjy3ePiSIoZoNrTzo/uHY27ZWevdEasXEhJ7iQ3FyzbxhFUPevtvP1opX5wq42mvmQaf2Yjy7MrgIprC+ebsj/EK/FABZLq22NS6dUmx0stbNOdsht9HRriiw+39qqM7DTLUqkteRWzArET2dpj9c70=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-yw1-x112d.google.com with SMTP id 00721157ae682-5a8628e54d4so2244917b3.0
        for <libc-alpha@sourceware.org>; Tue, 31 Oct 2023 13:09:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1698782973; x=1699387773; darn=sourceware.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Denfu+yaEZW3lZNjVu3uJCgtHsql+oOPqFX94ecuTp4=;
        b=V78q9RHDY3mKWXRbJjpz+v4ok5tw8+nbJS66Ap2YBzaPYdqKMeFQ9HWxjC6lYGuYRL
         d5DGgp4n1ZXSFL1ZJoHJbSNQelRCmjYWmDrDRdLw+95KFT1ekXjy2PwaYBSvd/nQaqd1
         I0Qa8sfczo+V6yMUIpqwMzwwedGgG50/Hw2Y6kYTVsd5ZKSSf37r+R4GQiu+rs9o0IqQ
         dUsgN9cuAGNAOQj5W97FReCKKCp31bJs5M5LU4jn5vcDpGNYXZmdoyBPbBpSaNKA5/tr
         pRJ333HZ2seg02o7hZLkgv6qNfOQ1lyXNxNAdr1XHCaDROdm2uRUIVI97Lshpd1BF4HY
         JsSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698782973; x=1699387773;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Denfu+yaEZW3lZNjVu3uJCgtHsql+oOPqFX94ecuTp4=;
        b=lh65Vz72c3D1+FOYEDYd/17Fhu7p0lsPRYtl/+R+l6fCP4yafp9CiF7L5NX23UakAi
         zogh9Ma/zg+QHblwvT3jsyaPWbZTUz/GATjEwQIdMcEu9vlm3gnr4iShcGt+Lv9GwEOj
         m++BuiPeKX8xKjRHs+5RtN88goB+F9+NX6nWB2RNkPn2Rp0UXU9MomaLYayljyKwVtSk
         NG9qCRzBiEpZjlIxMSPoU58Nl28geWfh1qT0WgMj7jiVRBnp8x4zmh9kuPCvLIgz8BK6
         XqNendHr+e8tkxVbw9cTphOiRz98AaWs9pRQJbjtzQcX4TezKaImSvOlXYDqvvY8cV+P
         vP4Q==
X-Gm-Message-State: AOJu0YwMn2RJAkJ1Rg1+4UBkTYHswVeXS93JBPUEr+oRpdnzmEFx/VFn
	ewfEroHeFgV/c/Dy8YlnT01tLKAsZw7i1ziIBJXI1A==
X-Google-Smtp-Source: AGHT+IHAlM85qsyxHTtSLyqv6rD8nGOcxCvXbnicQ5v23bWwIRLBSWLrpyPFmjcjVD2YnuGUxmLi4A==
X-Received: by 2002:a05:690c:70a:b0:5a7:ba3e:d1d1 with SMTP id bs10-20020a05690c070a00b005a7ba3ed1d1mr601790ywb.25.1698782973290;
        Tue, 31 Oct 2023 13:09:33 -0700 (PDT)
Received: from mandiga.. ([2804:1b3:a7c0:3d3c:6c87:9be3:8cfc:976d])
        by smtp.gmail.com with ESMTPSA id q69-20020a819948000000b005a7fa3ccb32sm1264111ywg.35.2023.10.31.13.09.31
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 31 Oct 2023 13:09:32 -0700 (PDT)
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: libc-alpha@sourceware.org,
	Noah Goldstein <goldstein.w.n@gmail.com>,
	"H . J . Lu" <hjl.tools@gmail.com>,
	Bruce Merry <bmerry@sarao.ac.za>
Subject: [PATCH 2/4] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
Date: Tue, 31 Oct 2023 17:09:23 -0300
Message-Id: <20231031200925.3297456-3-adhemerval.zanella@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20231031200925.3297456-1-adhemerval.zanella@linaro.org>
References: <20231031200925.3297456-1-adhemerval.zanella@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

The REP MOVSB usage on memcpy/memmove does show any performance gain
on Zen3/Zen4 cores compared to the vectorized loops.  Also, as
from BZ 30994, if source is aligned and destination is not the
performance can be as 20x slower.

The perfomance differnce is really noticeable with small buffer sizes,
closer to the lower bounds limits when memcpy/memmove starts to
use ERMS.  The performance of REP MOVSB is similar to vectorized
instruction on the size limit (the L2 cache).  Also, there is not
drawnback of multiple cores sharing the cache.

A new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, allows to
setup the higher bound size to use 'rep movsb'.

Checked on x86_64-linux-gnu on Zen3.
---
 manual/tunables.texi         |  9 ++++++
 sysdeps/x86/dl-cacheinfo.h   | 58 +++++++++++++++++++++++-------------
 sysdeps/x86/dl-tunables.list | 10 +++++++
 3 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 776fd93fd9..5d3263bc2e 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -570,6 +570,15 @@ greater than zero, and currently defaults to 2048 bytes.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_stop_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
+set threshold in bytes to stop using "rep movsb".  The value must be
+greater than zero, and currently defaults depends of the CPU and the
+cache size.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_rep_stosb_threshold
 The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user to
 set threshold in bytes to start using "rep stosb".  The value must be
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 87486054f9..51e5ba200f 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -784,6 +784,14 @@ get_common_cache_info (long int *shared_ptr, long int * shared_per_thread_ptr, u
   *threads_ptr = threads;
 }
 
+static inline bool
+is_rep_movsb_stop_threshold_valid (unsigned long int v)
+{
+  unsigned long int rep_movsb_threshold
+    = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+  return v > rep_movsb_threshold;
+}
+
 static void
 dl_init_cacheinfo (struct cpu_features *cpu_features)
 {
@@ -791,7 +799,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   long int data = -1;
   long int shared = -1;
   long int shared_per_thread = -1;
-  long int core = -1;
   unsigned int threads = 0;
   unsigned long int level1_icache_size = -1;
   unsigned long int level1_icache_linesize = -1;
@@ -809,7 +816,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (cpu_features->basic.kind == arch_kind_intel)
     {
       data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
       shared_per_thread = shared;
 
@@ -822,7 +828,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
       level1_dcache_linesize
 	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
-      level2_cache_size = core;
+      level2_cache_size
+	= handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       level2_cache_assoc
 	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
       level2_cache_linesize
@@ -835,12 +842,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level4_cache_size
 	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_zhaoxin)
     {
       data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
       shared_per_thread = shared;
 
@@ -849,19 +856,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
       level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
       level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_amd)
     {
       data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
 
       level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
@@ -869,7 +876,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
       level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
@@ -880,12 +887,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       if (shared <= 0)
         {
            /* No shared L3 cache.  All we have is the L2 cache.  */
-           shared = core;
+           shared = level2_cache_size;
         }
       else if (cpu_features->basic.family < 0x17)
         {
            /* Account for exclusive L2 and L3 caches.  */
-           shared += core;
+           shared += level2_cache_size;
         }
 
       shared_per_thread = shared;
@@ -1028,16 +1035,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 			   SIZE_MAX);
 
   unsigned long int rep_movsb_stop_threshold;
-  /* ERMS feature is implemented from AMD Zen3 architecture and it is
-     performing poorly for data above L2 cache size. Henceforth, adding
-     an upper bound threshold parameter to limit the usage of Enhanced
-     REP MOVSB operations and setting its value to L2 cache size.  */
-  if (cpu_features->basic.kind == arch_kind_amd)
-    rep_movsb_stop_threshold = core;
-  /* Setting the upper bound of ERMS to the computed value of
-     non-temporal threshold for architectures other than AMD.  */
-  else
-    rep_movsb_stop_threshold = non_temporal_threshold;
+  /* If the tunable is not set or if the value is not larger than
+     x86_rep_stosb_threshold, use the default values.  */
+  rep_movsb_stop_threshold = TUNABLE_GET (x86_rep_movsb_stop_threshold,
+					  long int, NULL);
+  if (!TUNABLE_IS_INITIALIZED (x86_rep_movsb_stop_threshold)
+      || !is_rep_movsb_stop_threshold_valid (rep_movsb_stop_threshold))
+    {
+      /* For AMD cpus that support ERMS (Zen3+), REP MOVSB is in a lot case
+	 slower than the vectorized path (and for some alignments it is really
+	 slow, check BZ #30994).  */
+      if (cpu_features->basic.kind == arch_kind_amd)
+	rep_movsb_stop_threshold = 0;
+      else
+      /* Setting the upper bound of ERMS to the computed value of
+	 non-temporal threshold for architectures other than AMD.  */
+	rep_movsb_stop_threshold = non_temporal_threshold;
+    }
+  TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
+			   SIZE_MAX);
 
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index feb7004036..5e9831b610 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -49,6 +49,16 @@ glibc {
       # if the tunable value is set by user or not [BZ #27069].
       minval: 1
     }
+    x86_rep_movsb_stop_threshold {
+      # For AMD cpus that support ERMS (Zen3+), REP MOVSB is not faster
+      # than the vectorized path (and for some destination alignment it
+      # is really slow, check BZ #30994).  On Intel cpus, the size limit
+      # to use ERMS is is [1/8, 1/2] of size of the chip's cache, check
+      # the dl-cacheinfo.h).
+      # This tunable allows the caller to setup the limit where to use
+      # REP MOVB on memcpy/memmove.
+      type: SIZE_T
+    }
     x86_rep_stosb_threshold {
       type: SIZE_T
       # Since there is overhead to set up REP STOSB operation, REP STOSB
-- 
2.34.1