From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=pwla=A7=gmail.com=goldstein.w.n@sourceware.org>
Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f])
	by sourceware.org (Postfix) with ESMTPS id 30A6E3857036
	for <libc-alpha@sourceware.org>; Wed, 10 May 2023 22:12:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 30A6E3857036
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-ej1-x62f.google.com with SMTP id a640c23a62f3a-965d2749e2eso1114453466b.1
        for <libc-alpha@sourceware.org>; Wed, 10 May 2023 15:12:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1683756757; x=1686348757;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=0JB530HnaG2OZ2huNedSxSI1hsjYpY+O5SWkAyi1ueA=;
        b=dV5plNY0zqoB0p937fCFY4Lue5ZXZtPgkAxV+dvOcAcD38HaveeO7m0Wh7LFbHKuKE
         mEMs2CRdyxxOPoTSjTI3TCY3nD3/9lFPWvN85hNH7YMRm+pMiyYyOwzfqVUszkjDyumm
         mDJlAO2XQeHzUEjLAdXi+AsA/oXi72GAFqZGmdQsOpzvf1vPWo82NDxbR9s5a+2wyI7V
         E0ND0QscdgWuSFDtDxmsk/XP1/7/iW3eAZxcYBStcYwLi8ZFebm07ZS5pLxWIPTvlMUv
         zEA7+H/jdZxcRFAReLAMzoERQYWA/cjqigZ4NQIiTIvtVL22+K5y4ndbNjZlywh36daI
         D3aQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1683756757; x=1686348757;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=0JB530HnaG2OZ2huNedSxSI1hsjYpY+O5SWkAyi1ueA=;
        b=Dwlmq4AZFnrclwR1cpr55WbI3wq5n+VCsVj5n4Ab8D2vETBBon51aQK+Qb3eM5+SBe
         pZZy6sKaON58sDpMpfjFNxk5/wwLtM/j/rIwR74YUuIaxWKBvj6PB+vetmEwqehaO5Wc
         zDHnK8HXqitjIxz76bBF2oN5cM/Za/5hceuDZN4YR2h87a52/V4fENfdGpW+wD/lRHLq
         fdSvklJAPGENOvulquRLuFi8P9W4BkTPNnKS7yJDpygRfudsJs9lDbE72m934+N36MjM
         ZVVb6jNxf2L2RHqno3xyKJ2C3fUaUXnzs4IitEMqIYo8tdjDDVKfH8j8iYl78XHOj0zj
         bmGQ==
X-Gm-Message-State: AC+VfDwpzwAkFyijNdr0V/s6vaVXuln5VMaYNMO5ZDAWSLdkXLurtYvd
	SSOevYGO14NCXF/JMB+F+kIxDN2x8bw=
X-Google-Smtp-Source: ACHHUZ6pSbWrNxendHQocIQ521FbGM5RNR+8o1EXnedWGgV27tZRTwah/V92D88owBWrVPxxVEiItw==
X-Received: by 2002:a17:907:7fa2:b0:932:4255:5902 with SMTP id qk34-20020a1709077fa200b0093242555902mr18879351ejc.76.1683756757059;
        Wed, 10 May 2023 15:12:37 -0700 (PDT)
Received: from noahgold-desk.lan (2603-8080-1301-76c6-d514-288c-2603-6ec7.res6.spectrum.com. [2603:8080:1301:76c6:d514:288c:2603:6ec7])
        by smtp.gmail.com with ESMTPSA id mc27-20020a170906eb5b00b00966330021e9sm3141378ejb.47.2023.05.10.15.12.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 10 May 2023 15:12:36 -0700 (PDT)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com,
	hjl.tools@gmail.com,
	carlos@systemhalted.org
Subject: [PATCH v7 3/4] x86: Make the divisor in setting `non_temporal_threshold` cpu specific
Date: Wed, 10 May 2023 17:12:12 -0500
Message-Id: <20230510221213.765754-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20230510221213.765754-1-goldstein.w.n@gmail.com>
References: <20230424050329.1501348-1-goldstein.w.n@gmail.com>v7-0001-x86-Increase-non_temporal_threshold-to-roughly-si.patch>
 <20230510221213.765754-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

Different systems prefer a different divisors.

>From benchmarks[1] so far the following divisors have been found:
    ICX     : 2
    SKX     : 2
    BWD     : 8

For Intel, we are generalizing that BWD and older prefers 8 as a
divisor, and SKL and newer prefers 2. This number can be further tuned
as benchmarks are run.

[1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
---
 sysdeps/x86/cpu-features.c         | 16 +++++++++++++--
 sysdeps/x86/dl-cacheinfo.h         | 32 ++++++++++++++++++------------
 sysdeps/x86/include/cpu-features.h |  3 +++
 3 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 9d433f8144..4cc1cd9fed 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -637,6 +637,7 @@ init_cpu_features (struct cpu_features *cpu_features)
   unsigned int stepping = 0;
   enum cpu_features_kind kind;
 
+  cpu_features->cachesize_non_temporal_divisor = 4;
 #if !HAS_CPUID
   if (__get_cpuid_max (0, 0) == 0)
     {
@@ -720,8 +721,8 @@ init_cpu_features (struct cpu_features *cpu_features)
 		 of Core i3/i5/i7 processors if AVX is available.  */
 	      if (!CPU_FEATURES_CPU_P (cpu_features, AVX))
 		break;
-	    case INTEL_BIGCORE_NEHALEM:
-	    case INTEL_BIGCORE_WESTMERE:
+
+	    enable_modern_features:
 	      /* Rep string instructions, unaligned load, unaligned copy,
 		 and pminub are fast on Intel Core i3, i5 and i7.  */
 	      cpu_features->preferred[index_arch_Fast_Rep_String]
@@ -730,11 +731,20 @@ init_cpu_features (struct cpu_features *cpu_features)
 		      | bit_arch_Prefer_PMINUB_for_stringop);
 	      break;
 
+	    case INTEL_BIGCORE_NEHALEM:
+	    case INTEL_BIGCORE_WESTMERE:
+	      /* Older CPUs prefer non-temporal stores at lower threshold.  */
+	      cpu_features->cachesize_non_temporal_divisor = 8;
+	      goto enable_modern_features;
+
 	      /* Default tuned Bigcore microarch.  */
 	    case INTEL_BIGCORE_SANDYBRIDGE:
 	    case INTEL_BIGCORE_IVYBRIDGE:
 	    case INTEL_BIGCORE_HASWELL:
 	    case INTEL_BIGCORE_BROADWELL:
+	      cpu_features->cachesize_non_temporal_divisor = 8;
+	      goto default_tuning;
+
 	    case INTEL_BIGCORE_SKYLAKE:
 	    case INTEL_BIGCORE_AMBERLAKE:
 	    case INTEL_BIGCORE_COFFEELAKE:
@@ -755,11 +765,13 @@ init_cpu_features (struct cpu_features *cpu_features)
 	    case INTEL_BIGCORE_SAPPHIRERAPIDS:
 	    case INTEL_BIGCORE_EMERALDRAPIDS:
 	    case INTEL_BIGCORE_GRANITERAPIDS:
+	      cpu_features->cachesize_non_temporal_divisor = 2;
 	      goto default_tuning;
 
 	    /* Default tuned Mixed (bigcore + atom SOC).  */
 	    case INTEL_MIXED_LAKEFIELD:
 	    case INTEL_MIXED_ALDERLAKE:
+	      cpu_features->cachesize_non_temporal_divisor = 2;
 	      goto default_tuning;
 	    }
 
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 4a1a5423ff..864b00a521 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   cpu_features->level3_cache_linesize = level3_cache_linesize;
   cpu_features->level4_cache_size = level4_cache_size;
 
-  /* The default setting for the non_temporal threshold is 1/4 of size
-     of the chip's cache. For most Intel and AMD processors with an
-     initial release date between 2017 and 2023, a thread's typical
-     share of the cache is from 18-64MB. Using the 1/4 L3 is meant to
-     estimate the point where non-temporal stores begin outcompeting
-     REP MOVSB. As well the point where the fact that non-temporal
-     stores are forced back to main memory would already occurred to the
-     majority of the lines in the copy. Note, concerns about the
-     entire L3 cache being evicted by the copy are mostly alleviated
-     by the fact that modern HW detects streaming patterns and
-     provides proper LRU hints so that the maximum thrashing
-     capped at 1/associativity. */
-  unsigned long int non_temporal_threshold = shared / 4;
+  unsigned long int cachesize_non_temporal_divisor
+      = cpu_features->cachesize_non_temporal_divisor;
+  if (cachesize_non_temporal_divisor <= 0)
+    cachesize_non_temporal_divisor = 4;
+
+  /* The default setting for the non_temporal threshold is [1/2, 1/8] of size
+     of the chip's cache (depending on `cachesize_non_temporal_divisor` which
+     is microarch specific. The defeault is 1/4). For most Intel and AMD
+     processors with an initial release date between 2017 and 2023, a thread's
+     typical share of the cache is from 18-64MB. Using a reasonable size
+     fraction of L3 is meant to estimate the point where non-temporal stores
+     begin outcompeting REP MOVSB. As well the point where the fact that
+     non-temporal stores are forced back to main memory would already occurred
+     to the majority of the lines in the copy. Note, concerns about the entire
+     L3 cache being evicted by the copy are mostly alleviated by the fact that
+     modern HW detects streaming patterns and provides proper LRU hints so that
+     the maximum thrashing capped at 1/associativity. */
+  unsigned long int non_temporal_threshold
+      = shared / cachesize_non_temporal_divisor;
   /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
      a higher risk of actually thrashing the cache as they don't have a HW LRU
      hint. As well, there performance in highly parallel situations is
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index 40b8129d6a..f5b9dd54fe 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -915,6 +915,9 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* When no user non_temporal_threshold is specified. We default to
+     cachesize / cachesize_non_temporal_divisor.  */
+  unsigned long int cachesize_non_temporal_divisor;
   /* Threshold to use "rep movsb".  */
   unsigned long int rep_movsb_threshold;
   /* Threshold to stop using "rep movsb".  */
-- 
2.34.1