From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by sourceware.org (Postfix) with ESMTPS id C95443858C2D for ; Fri, 12 May 2023 22:03:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C95443858C2D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-969f90d71d4so845849466b.3 for ; Fri, 12 May 2023 15:03:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683929019; x=1686521019; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QRDC66Zy4vIbnfnJu/qyww+wZfPDA7TxrInxEN/nFKU=; b=OkM1KSf2H1flzet6/Xozf90GzliXx4sdy9xVUf/b7kloZ6jzKm/I/7m9+VtI9qyGR9 4mpwUkFIOcVd6G2ZyKv3AFLM0ulD+xxtA4RjBkRc+OJ3+713gtrVeQ2gziHQAr8/XMOw e0wWIrDaPL8+RHriSaO0VUFIlEJvbb5hQ0ac5jU6v2YYos7iyCnN4UT5cL1Jh4nJm1rZ UPd9rS0LMpNC/nD4ZrQCc3swIdnHr+lQjk6p3ITSxvJh9SqFotLAIZzzoi+9hQPFKT2b 48Kh8cGx+WW1CAnk8PaHUh95neRZk11AkWi20w4rAJtnkGnIVcQO0Vt0CFlK84GRb/3r UFdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683929019; x=1686521019; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QRDC66Zy4vIbnfnJu/qyww+wZfPDA7TxrInxEN/nFKU=; b=LA1VkLP3eqdao0qAGOw3Qhbjio6Obt3AIP8mCV8pXj541ynVGRA+755VMOZxty/ofP lPpNfP/GG2KM5kGim5INWuH+ojyvMH97/RAhYspgdr70ak6VQh7KV0m67gHWTLRTyIXP seYAZqTZVtq9sqENx8isqElCmYWvn+XV/ukJeDKyP8YAUKTLcd1L1fqeT373m+1ggozC dpxtSc3SQaCAmrqvjhgCeiiksSQahO2YUXQ46YonMD0/Xp/5QKxjqaZDCnNuZ4rH72Sh Kcf66Nhkhc4lh7Nq8Dobq+BaPTQTGDAkEDVj3gR4QrCjfjMTLlbCE6qHxKfGY7NMk34r 4WWg== X-Gm-Message-State: AC+VfDwQ6uDf84zmDaEm9Gb5FxBz+h4ap6bUyLbu1l04pyfHI0RghIg1 mxgk79rOxOBq4jvpciTyIV7ohNcPgCY= X-Google-Smtp-Source: ACHHUZ4EFOGa/u1RdQaGnvJMT+jRfuZ5Ir7kCeYkRSIgu9ld8AqLnUaqLlJfpuZ1tXTMgdWYkMao8Q== X-Received: by 2002:a17:907:c10:b0:94d:7b6b:fda6 with SMTP id ga16-20020a1709070c1000b0094d7b6bfda6mr24818315ejc.22.1683929018512; Fri, 12 May 2023 15:03:38 -0700 (PDT) Received: from noahgold-desk.intel.com (2603-8080-1301-76c6-d520-eb19-b7be-96c4.res6.spectrum.com. [2603:8080:1301:76c6:d520:eb19:b7be:96c4]) by smtp.gmail.com with ESMTPSA id z4-20020a17090655c400b009660449b9a3sm5941912ejp.25.2023.05.12.15.03.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 May 2023 15:03:37 -0700 (PDT) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: [PATCH v8 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Date: Fri, 12 May 2023 17:03:26 -0500 Message-Id: <20230512220326.1918608-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230424050329.1501348-1-goldstein.w.n@gmail.com> References: <20230424050329.1501348-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Different systems prefer a different divisors. >From benchmarks[1] so far the following divisors have been found: ICX : 2 SKX : 2 BWD : 8 For Intel, we are generalizing that BWD and older prefers 8 as a divisor, and SKL and newer prefers 2. This number can be further tuned as benchmarks are run. [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks --- sysdeps/x86/cpu-features.c | 27 +++++++++++++++++-------- sysdeps/x86/dl-cacheinfo.h | 32 ++++++++++++++++++------------ sysdeps/x86/include/cpu-features.h | 3 +++ 3 files changed, 41 insertions(+), 21 deletions(-) diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 264d309dd7..3ec7e6f2df 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -638,6 +638,7 @@ init_cpu_features (struct cpu_features *cpu_features) unsigned int stepping = 0; enum cpu_features_kind kind; + cpu_features->cachesize_non_temporal_divisor = 4; #if !HAS_CPUID if (__get_cpuid_max (0, 0) == 0) { @@ -717,12 +718,13 @@ init_cpu_features (struct cpu_features *cpu_features) /* Bigcore/Default Tuning. */ default: + default_tuning: /* Unknown family 0x06 processors. Assuming this is one of Core i3/i5/i7 processors if AVX is available. */ if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) break; - case INTEL_BIGCORE_NEHALEM: - case INTEL_BIGCORE_WESTMERE: + + enable_modern_features: /* Rep string instructions, unaligned load, unaligned copy, and pminub are fast on Intel Core i3, i5 and i7. */ cpu_features->preferred[index_arch_Fast_Rep_String] @@ -731,12 +733,20 @@ init_cpu_features (struct cpu_features *cpu_features) | bit_arch_Prefer_PMINUB_for_stringop); break; - /* - Default tuned Bigcore microarch. + case INTEL_BIGCORE_NEHALEM: + case INTEL_BIGCORE_WESTMERE: + /* Older CPUs prefer non-temporal stores at lower threshold. */ + cpu_features->cachesize_non_temporal_divisor = 8; + goto enable_modern_features; + + /* Default tuned Bigcore microarch. */ case INTEL_BIGCORE_SANDYBRIDGE: case INTEL_BIGCORE_IVYBRIDGE: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: + cpu_features->cachesize_non_temporal_divisor = 8; + goto default_tuning; + case INTEL_BIGCORE_SKYLAKE: case INTEL_BIGCORE_AMBERLAKE: case INTEL_BIGCORE_COFFEELAKE: @@ -755,13 +765,14 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_SAPPHIRERAPIDS: case INTEL_BIGCORE_EMERALDRAPIDS: case INTEL_BIGCORE_GRANITERAPIDS: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; - /* - Default tuned Mixed (bigcore + atom SOC). + /* Default tuned Mixed (bigcore + atom SOC). */ case INTEL_MIXED_LAKEFIELD: case INTEL_MIXED_ALDERLAKE: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; } /* Disable TSX on some processors to avoid TSX on kernels that diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 4a1a5423ff..864b00a521 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->level3_cache_linesize = level3_cache_linesize; cpu_features->level4_cache_size = level4_cache_size; - /* The default setting for the non_temporal threshold is 1/4 of size - of the chip's cache. For most Intel and AMD processors with an - initial release date between 2017 and 2023, a thread's typical - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to - estimate the point where non-temporal stores begin outcompeting - REP MOVSB. As well the point where the fact that non-temporal - stores are forced back to main memory would already occurred to the - majority of the lines in the copy. Note, concerns about the - entire L3 cache being evicted by the copy are mostly alleviated - by the fact that modern HW detects streaming patterns and - provides proper LRU hints so that the maximum thrashing - capped at 1/associativity. */ - unsigned long int non_temporal_threshold = shared / 4; + unsigned long int cachesize_non_temporal_divisor + = cpu_features->cachesize_non_temporal_divisor; + if (cachesize_non_temporal_divisor <= 0) + cachesize_non_temporal_divisor = 4; + + /* The default setting for the non_temporal threshold is [1/2, 1/8] of size + of the chip's cache (depending on `cachesize_non_temporal_divisor` which + is microarch specific. The defeault is 1/4). For most Intel and AMD + processors with an initial release date between 2017 and 2023, a thread's + typical share of the cache is from 18-64MB. Using a reasonable size + fraction of L3 is meant to estimate the point where non-temporal stores + begin outcompeting REP MOVSB. As well the point where the fact that + non-temporal stores are forced back to main memory would already occurred + to the majority of the lines in the copy. Note, concerns about the entire + L3 cache being evicted by the copy are mostly alleviated by the fact that + modern HW detects streaming patterns and provides proper LRU hints so that + the maximum thrashing capped at 1/associativity. */ + unsigned long int non_temporal_threshold + = shared / cachesize_non_temporal_divisor; /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, there performance in highly parallel situations is diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index 40b8129d6a..f5b9dd54fe 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -915,6 +915,9 @@ struct cpu_features unsigned long int shared_cache_size; /* Threshold to use non temporal store. */ unsigned long int non_temporal_threshold; + /* When no user non_temporal_threshold is specified. We default to + cachesize / cachesize_non_temporal_divisor. */ + unsigned long int cachesize_non_temporal_divisor; /* Threshold to use "rep movsb". */ unsigned long int rep_movsb_threshold; /* Threshold to stop using "rep movsb". */ -- 2.34.1