From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by sourceware.org (Postfix) with ESMTPS id D8F413858C53 for ; Sat, 27 May 2023 18:46:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D8F413858C53 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ot1-x32a.google.com with SMTP id 46e09a7af769-6af7593ed5fso707799a34.0 for ; Sat, 27 May 2023 11:46:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1685213186; x=1687805186; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/qiLqS3tsP7vPHzXj6FiCyc+QtJScJpqPRMVBpMhmG4=; b=lO7xGgG3aOr/eEX+nLseA89RKQYkY8f+uPpXTruI+aAxHantaSTScAo7o5lw/XOb6S k1dTaTXhPyXgxQ9Py1Qq0SCnVypP3WHVMr0Q50KzdcOiQa3m/En0QS5Yrw2C2Rb2Zi/4 f/GiMPods033CmsqKzdvMXCu8QpnE5m0gaFhMxh5k2Um/55ybp9H6zJn+BOOjiu70GlZ 1KFv8ap6qyFBe8b39i9MwqbFUr7Q0MWZ6TDfYx5GR/BbYR0UErteEuZHnjzEgTaVOGsA bIR5xwK9PJ+AEs2jXRhJlgCthzvhhFl1lrcsCy+5Vuve5+BhFv1dFCAvtFySuKkpcn2g lmuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685213186; x=1687805186; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/qiLqS3tsP7vPHzXj6FiCyc+QtJScJpqPRMVBpMhmG4=; b=HZr5ywPrxvZvtRSDJsoSJQEW/QFzblECtWN6ZLhnbtXjSRhJf65ix3bamBoKT4SqPm 6ULKMTz/oq46lmIG/61omnZaP5B4AVBsbVMiyCqOS6Fbl2MMpMBRaWy4gvvIf3s5+drL iUWE9RRBE85Mjrwdg1l1cT1CjtwsaN2+W/qRohm/gCvvJ+MpwklALf9viy2tcnBckyct AlSDGoPS/228qvWydX2yiQBbJlPva1awr4mHPpl74oIvTBzWZiuLR1dc9yiqXHsbm/cW 1SOkpFsdz6oI//6LKJvKXH/hck1i/sU4C/e2La6Qf1uo89isZ14h7v+joM1Lj+b8KDyT 7w6g== X-Gm-Message-State: AC+VfDy4KYeBCXUpXiSgOidkSnSqgKdHXzo0xw013kVIfPd40oK41rDJ R88YN2jybj4EjaK6/K4IVxhedQJNFFyJYOwb4j0= X-Google-Smtp-Source: ACHHUZ5mqYPTFHzQhR5Ht4VpYsvBlgCIhPSs7rQpKylatgSB5C3nIKLkY0ZEb7MzOQdr/ebW3fqfpJ1uj+nVRHQoRA0= X-Received: by 2002:a05:6870:e285:b0:19a:75ed:8516 with SMTP id v5-20020a056870e28500b0019a75ed8516mr2604954oad.14.1685213185847; Sat, 27 May 2023 11:46:25 -0700 (PDT) MIME-Version: 1.0 References: <20230513051906.1287611-3-goldstein.w.n@gmail.com> In-Reply-To: From: Noah Goldstein Date: Sat, 27 May 2023 13:46:14 -0500 Message-ID: Subject: Re: [PATCH v9 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific To: DJ Delorie Cc: libc-alpha@sourceware.org, hjl.tools@gmail.com, carlos@systemhalted.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE,URIBL_BLACK autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, May 25, 2023 at 10:34=E2=80=AFPM DJ Delorie wrote: > > > One question about upgradability, one comment nit that I don't care > about but include for completeness. > > Noah Goldstein via Libc-alpha writes: > > Different systems prefer a different divisors. > > > >>From benchmarks[1] so far the following divisors have been found: > > ICX : 2 > > SKX : 2 > > BWD : 8 > > > > For Intel, we are generalizing that BWD and older prefers 8 as a > > divisor, and SKL and newer prefers 2. This number can be further tuned > > as benchmarks are run. > > > > [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks > > --- > > sysdeps/x86/cpu-features.c | 27 +++++++++++++++++-------- > > sysdeps/x86/dl-cacheinfo.h | 32 ++++++++++++++++++------------ > > sysdeps/x86/include/cpu-features.h | 3 +++ > > 3 files changed, 41 insertions(+), 21 deletions(-) > > > > > diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/c= pu-features.h > > index 40b8129d6a..f5b9dd54fe 100644 > > --- a/sysdeps/x86/include/cpu-features.h > > +++ b/sysdeps/x86/include/cpu-features.h > > @@ -915,6 +915,9 @@ struct cpu_features > > unsigned long int shared_cache_size; > > /* Threshold to use non temporal store. */ > > unsigned long int non_temporal_threshold; > > + /* When no user non_temporal_threshold is specified. We default to > > + cachesize / cachesize_non_temporal_divisor. */ > > + unsigned long int cachesize_non_temporal_divisor; > > /* Threshold to use "rep movsb". */ > > unsigned long int rep_movsb_threshold; > > /* Threshold to stop using "rep movsb". */ > > This adds a new field to "struct cpu_features". Is this structure > something that is shared between ld.so and libc.so ? I.e. tunables > related? If so, does this field need to be added to the end of the > struct, so as to not cause problems during an upgrade (when we have an > old ld.so and a new libc.so)? Not sure. HJ do you know? But moved for now as a kind of "why not". > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > index 4a1a5423ff..864b00a521 100644 > > --- a/sysdeps/x86/dl-cacheinfo.h > > +++ b/sysdeps/x86/dl-cacheinfo.h > > @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_featu= res) > > cpu_features->level3_cache_linesize =3D level3_cache_linesize; > > cpu_features->level4_cache_size =3D level4_cache_size; > > > > - /* The default setting for the non_temporal threshold is 1/4 of size > > - of the chip's cache. For most Intel and AMD processors with an > > - initial release date between 2017 and 2023, a thread's typical > > - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to > > - estimate the point where non-temporal stores begin outcompeting > > - REP MOVSB. As well the point where the fact that non-temporal > > - stores are forced back to main memory would already occurred to t= he > > - majority of the lines in the copy. Note, concerns about the > > - entire L3 cache being evicted by the copy are mostly alleviated > > - by the fact that modern HW detects streaming patterns and > > - provides proper LRU hints so that the maximum thrashing > > - capped at 1/associativity. */ > > - unsigned long int non_temporal_threshold =3D shared / 4; > > > + unsigned long int cachesize_non_temporal_divisor > > + =3D cpu_features->cachesize_non_temporal_divisor; > > + if (cachesize_non_temporal_divisor <=3D 0) > > + cachesize_non_temporal_divisor =3D 4; > > + > > + /* The default setting for the non_temporal threshold is [1/2, 1/8] = of size > > FYI this range is backwards ;-) Fixed. > > > + of the chip's cache (depending on `cachesize_non_temporal_divisor= ` which > > + is microarch specific. The defeault is 1/4). For most Intel and A= MD > > + processors with an initial release date between 2017 and 2023, a = thread's > > + typical share of the cache is from 18-64MB. Using a reasonable si= ze > > + fraction of L3 is meant to estimate the point where non-temporal = stores > > + begin outcompeting REP MOVSB. As well the point where the fact th= at > > + non-temporal stores are forced back to main memory would already = occurred > > + to the majority of the lines in the copy. Note, concerns about th= e entire > > + L3 cache being evicted by the copy are mostly alleviated by the f= act that > > + modern HW detects streaming patterns and provides proper LRU hint= s so that > > + the maximum thrashing capped at 1/associativity. */ > > + unsigned long int non_temporal_threshold > > + =3D shared / cachesize_non_temporal_divisor; > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable s= tores run > > a higher risk of actually thrashing the cache as they don't have = a HW LRU > > hint. As well, there performance in highly parallel situations is > > Ok, defaults to the same behavior. > > > > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c > > index 29b8c8c133..ba789d6fc1 100644 > > --- a/sysdeps/x86/cpu-features.c > > +++ b/sysdeps/x86/cpu-features.c > > @@ -635,6 +635,7 @@ init_cpu_features (struct cpu_features *cpu_feature= s) > > unsigned int stepping =3D 0; > > enum cpu_features_kind kind; > > > > + cpu_features->cachesize_non_temporal_divisor =3D 4; > > Ok. > > > @@ -714,12 +715,13 @@ init_cpu_features (struct cpu_features *cpu_featu= res) > > > > /* Bigcore/Default Tuning. */ > > default: > > + default_tuning: > > /* Unknown family 0x06 processors. Assuming this is one > > of Core i3/i5/i7 processors if AVX is available. */ > > if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) > > break; > > Ok. > > > - case INTEL_BIGCORE_NEHALEM: > > - case INTEL_BIGCORE_WESTMERE: > > + > > + enable_modern_features: > > Ok. > > /* Rep string instructions, unaligned load, unaligned copy, > > and pminub are fast on Intel Core i3, i5 and i7. */ > > cpu_features->preferred[index_arch_Fast_Rep_String] > > @@ -728,12 +730,20 @@ init_cpu_features (struct cpu_features *cpu_featu= res) > > | bit_arch_Prefer_PMINUB_for_stringop); > > break; > > > > - /* > > - Default tuned Bigcore microarch. > > Note comment begin removed here... > > > + case INTEL_BIGCORE_NEHALEM: > > + case INTEL_BIGCORE_WESTMERE: > > + /* Older CPUs prefer non-temporal stores at lower threshold= . */ > > + cpu_features->cachesize_non_temporal_divisor =3D 8; > > + goto enable_modern_features; > > + > > + /* Default tuned Bigcore microarch. */ > > Ok. > > > case INTEL_BIGCORE_SANDYBRIDGE: > > case INTEL_BIGCORE_IVYBRIDGE: > > case INTEL_BIGCORE_HASWELL: > > case INTEL_BIGCORE_BROADWELL: > > + cpu_features->cachesize_non_temporal_divisor =3D 8; > > + goto default_tuning; > > + > > Ok. > > > case INTEL_BIGCORE_SKYLAKE: > > case INTEL_BIGCORE_KABYLAKE: > > case INTEL_BIGCORE_COMETLAKE: > Note nothing but more cases here, ok. > > case INTEL_BIGCORE_SAPPHIRERAPIDS: > > case INTEL_BIGCORE_EMERALDRAPIDS: > > case INTEL_BIGCORE_GRANITERAPIDS: > > - */ > > ... and comment end removed here. Ok. > > > + cpu_features->cachesize_non_temporal_divisor =3D 2; > > + goto default_tuning; > > Ok. > > > - /* > > - Default tuned Mixed (bigcore + atom SOC). > > + /* Default tuned Mixed (bigcore + atom SOC). */ > > case INTEL_MIXED_LAKEFIELD: > > case INTEL_MIXED_ALDERLAKE: > > - */ > > + cpu_features->cachesize_non_temporal_divisor =3D 2; > > + goto default_tuning; > > } > > Ok. >