public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: DJ Delorie <dj@redhat.com>
To: Noah Goldstein <goldstein.w.n@gmail.com>
Cc: libc-alpha@sourceware.org, goldstein.w.n@gmail.com,
	hjl.tools@gmail.com, carlos@systemhalted.org
Subject: Re: [PATCH v9 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific
Date: Thu, 25 May 2023 23:34:53 -0400	[thread overview]
Message-ID: <xnttvz7oki.fsf@greed.delorie.com> (raw)
In-Reply-To: <20230513051906.1287611-3-goldstein.w.n@gmail.com>


One question about upgradability, one comment nit that I don't care
about but include for completeness.

Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes:
> Different systems prefer a different divisors.
>
>>From benchmarks[1] so far the following divisors have been found:
>     ICX     : 2
>     SKX     : 2
>     BWD     : 8
>
> For Intel, we are generalizing that BWD and older prefers 8 as a
> divisor, and SKL and newer prefers 2. This number can be further tuned
> as benchmarks are run.
>
> [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
> ---
>  sysdeps/x86/cpu-features.c         | 27 +++++++++++++++++--------
>  sysdeps/x86/dl-cacheinfo.h         | 32 ++++++++++++++++++------------
>  sysdeps/x86/include/cpu-features.h |  3 +++
>  3 files changed, 41 insertions(+), 21 deletions(-)
>

> diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
> index 40b8129d6a..f5b9dd54fe 100644
> --- a/sysdeps/x86/include/cpu-features.h
> +++ b/sysdeps/x86/include/cpu-features.h
> @@ -915,6 +915,9 @@ struct cpu_features
>    unsigned long int shared_cache_size;
>    /* Threshold to use non temporal store.  */
>    unsigned long int non_temporal_threshold;
> +  /* When no user non_temporal_threshold is specified. We default to
> +     cachesize / cachesize_non_temporal_divisor.  */
> +  unsigned long int cachesize_non_temporal_divisor;
>    /* Threshold to use "rep movsb".  */
>    unsigned long int rep_movsb_threshold;
>    /* Threshold to stop using "rep movsb".  */

This adds a new field to "struct cpu_features".  Is this structure
something that is shared between ld.so and libc.so ?  I.e. tunables
related?  If so, does this field need to be added to the end of the
struct, so as to not cause problems during an upgrade (when we have an
old ld.so and a new libc.so)?

> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index 4a1a5423ff..864b00a521 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    cpu_features->level3_cache_linesize = level3_cache_linesize;
>    cpu_features->level4_cache_size = level4_cache_size;
>  
> -  /* The default setting for the non_temporal threshold is 1/4 of size
> -     of the chip's cache. For most Intel and AMD processors with an
> -     initial release date between 2017 and 2023, a thread's typical
> -     share of the cache is from 18-64MB. Using the 1/4 L3 is meant to
> -     estimate the point where non-temporal stores begin outcompeting
> -     REP MOVSB. As well the point where the fact that non-temporal
> -     stores are forced back to main memory would already occurred to the
> -     majority of the lines in the copy. Note, concerns about the
> -     entire L3 cache being evicted by the copy are mostly alleviated
> -     by the fact that modern HW detects streaming patterns and
> -     provides proper LRU hints so that the maximum thrashing
> -     capped at 1/associativity. */
> -  unsigned long int non_temporal_threshold = shared / 4;

> +  unsigned long int cachesize_non_temporal_divisor
> +      = cpu_features->cachesize_non_temporal_divisor;
> +  if (cachesize_non_temporal_divisor <= 0)
> +    cachesize_non_temporal_divisor = 4;
> +
> +  /* The default setting for the non_temporal threshold is [1/2, 1/8] of size

FYI this range is backwards ;-)

> +     of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> +     is microarch specific. The defeault is 1/4). For most Intel and AMD
> +     processors with an initial release date between 2017 and 2023, a thread's
> +     typical share of the cache is from 18-64MB. Using a reasonable size
> +     fraction of L3 is meant to estimate the point where non-temporal stores
> +     begin outcompeting REP MOVSB. As well the point where the fact that
> +     non-temporal stores are forced back to main memory would already occurred
> +     to the majority of the lines in the copy. Note, concerns about the entire
> +     L3 cache being evicted by the copy are mostly alleviated by the fact that
> +     modern HW detects streaming patterns and provides proper LRU hints so that
> +     the maximum thrashing capped at 1/associativity. */
> +  unsigned long int non_temporal_threshold
> +      = shared / cachesize_non_temporal_divisor;
>    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
>       a higher risk of actually thrashing the cache as they don't have a HW LRU
>       hint. As well, there performance in highly parallel situations is

Ok, defaults to the same behavior.


> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 29b8c8c133..ba789d6fc1 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -635,6 +635,7 @@ init_cpu_features (struct cpu_features *cpu_features)
>    unsigned int stepping = 0;
>    enum cpu_features_kind kind;
>  
> +  cpu_features->cachesize_non_temporal_divisor = 4;

Ok.

> @@ -714,12 +715,13 @@ init_cpu_features (struct cpu_features *cpu_features)
>  
>  	      /* Bigcore/Default Tuning.  */
>  	    default:
> +	    default_tuning:
>  	      /* Unknown family 0x06 processors.  Assuming this is one
>  		 of Core i3/i5/i7 processors if AVX is available.  */
>  	      if (!CPU_FEATURES_CPU_P (cpu_features, AVX))
>  		break;

Ok.

> -	    case INTEL_BIGCORE_NEHALEM:
> -	    case INTEL_BIGCORE_WESTMERE:
> +
> +	    enable_modern_features:

Ok.
>  	      /* Rep string instructions, unaligned load, unaligned copy,
>  		 and pminub are fast on Intel Core i3, i5 and i7.  */
>  	      cpu_features->preferred[index_arch_Fast_Rep_String]
> @@ -728,12 +730,20 @@ init_cpu_features (struct cpu_features *cpu_features)
>  		      | bit_arch_Prefer_PMINUB_for_stringop);
>  	      break;
>  
> -	   /*
> -	    Default tuned Bigcore microarch.

Note comment begin removed here...

> +	    case INTEL_BIGCORE_NEHALEM:
> +	    case INTEL_BIGCORE_WESTMERE:
> +	      /* Older CPUs prefer non-temporal stores at lower threshold.  */
> +	      cpu_features->cachesize_non_temporal_divisor = 8;
> +	      goto enable_modern_features;
> +
> +	      /* Default tuned Bigcore microarch.  */

Ok.

>  	    case INTEL_BIGCORE_SANDYBRIDGE:
>  	    case INTEL_BIGCORE_IVYBRIDGE:
>  	    case INTEL_BIGCORE_HASWELL:
>  	    case INTEL_BIGCORE_BROADWELL:
> +	      cpu_features->cachesize_non_temporal_divisor = 8;
> +	      goto default_tuning;
> +

Ok.

>  	    case INTEL_BIGCORE_SKYLAKE:
>  	    case INTEL_BIGCORE_KABYLAKE:
>  	    case INTEL_BIGCORE_COMETLAKE:
Note nothing but more cases here, ok.
>  	    case INTEL_BIGCORE_SAPPHIRERAPIDS:
>  	    case INTEL_BIGCORE_EMERALDRAPIDS:
>  	    case INTEL_BIGCORE_GRANITERAPIDS:
> -	    */

... and comment end removed here.  Ok.

> +	      cpu_features->cachesize_non_temporal_divisor = 2;
> +	      goto default_tuning;

Ok.

> -	   /*
> -	    Default tuned Mixed (bigcore + atom SOC).
> +	      /* Default tuned Mixed (bigcore + atom SOC). */
>  	    case INTEL_MIXED_LAKEFIELD:
>  	    case INTEL_MIXED_ALDERLAKE:
> -	    */
> +	      cpu_features->cachesize_non_temporal_divisor = 2;
> +	      goto default_tuning;
>  	    }

Ok.


  reply	other threads:[~2023-05-26  3:34 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-24  5:03 [PATCH v1] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 2` Noah Goldstein
2023-04-24 18:09 ` H.J. Lu
2023-04-24 18:34   ` Noah Goldstein
2023-04-24 20:44     ` H.J. Lu
2023-04-24 22:30       ` Noah Goldstein
2023-04-24 22:30 ` [PATCH v2] " Noah Goldstein
2023-04-24 22:48   ` H.J. Lu
2023-04-25  2:05     ` Noah Goldstein
2023-04-25  2:55       ` H.J. Lu
2023-04-25  3:43         ` Noah Goldstein
2023-04-25  3:43 ` [PATCH v3] " Noah Goldstein
2023-04-25 17:42   ` H.J. Lu
2023-04-25 21:45     ` Noah Goldstein
2023-04-25 21:45 ` [PATCH v4] " Noah Goldstein
2023-04-26 15:59   ` H.J. Lu
2023-04-26 17:15     ` Noah Goldstein
2023-05-04  3:28       ` Noah Goldstein
2023-05-05 18:06         ` H.J. Lu
2023-05-09  3:14           ` Noah Goldstein
2023-05-09  3:13 ` [PATCH v5 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-05-09  3:13   ` [PATCH v5 2/3] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-09 21:58     ` H.J. Lu
2023-05-10  0:33       ` Noah Goldstein
2023-05-09  3:13   ` [PATCH v5 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-10  0:33 ` [PATCH v6 1/4] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-05-10  0:33   ` [PATCH v6 2/4] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-10 22:13     ` H.J. Lu
2023-05-10 23:17       ` Noah Goldstein
2023-05-11 21:36         ` H.J. Lu
2023-05-12  5:11           ` Noah Goldstein
2023-05-10  0:33   ` [PATCH v6 3/4] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-10  0:33   ` [PATCH v6 4/4] x86: Tune 'Saltwell' microarch the same was a 'Bonnell' Noah Goldstein
2023-05-10 22:04     ` H.J. Lu
2023-05-10 22:12       ` Noah Goldstein
2023-05-10 15:55   ` [PATCH v6 1/4] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` H.J. Lu
2023-05-10 16:07     ` Noah Goldstein
2023-05-10 22:12 ` [PATCH v7 2/4] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-10 22:12   ` [PATCH v7 3/4] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-10 22:12   ` [PATCH v7 4/4] x86: Tune 'Saltwell' microarch the same was a 'Bonnell' Noah Goldstein
2023-05-12  5:12     ` Noah Goldstein
2023-05-12  5:10 ` [PATCH v8 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-05-12  5:10   ` [PATCH v8 2/3] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-12 22:17     ` H.J. Lu
2023-05-13  5:18       ` Noah Goldstein
2023-05-12 22:03 ` [PATCH v8 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-13  5:19 ` [PATCH v9 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-05-13  5:19   ` [PATCH v9 2/3] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-15 20:57     ` H.J. Lu
2023-05-26  3:34     ` DJ Delorie
2023-05-27 18:46       ` Noah Goldstein
2023-05-13  5:19   ` [PATCH v9 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-26  3:34     ` DJ Delorie [this message]
2023-05-27 18:46       ` Noah Goldstein
2023-05-15 18:29   ` [PATCH v9 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-05-17 12:00     ` Carlos O'Donell
2023-05-26  3:34   ` DJ Delorie
2023-05-27 18:46 ` [PATCH v10 " Noah Goldstein
2023-05-27 18:46   ` [PATCH v10 2/3] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-05-27 18:46   ` [PATCH v10 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-05-31  2:33     ` DJ Delorie
2023-07-10  5:23     ` Sajan Karumanchi
2023-07-10 15:58       ` Noah Goldstein
2023-07-14  2:21         ` Re: Noah Goldstein
2023-07-14  7:39         ` Re: sajan karumanchi
2023-06-07  0:15   ` [PATCH v10 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Carlos O'Donell
2023-06-07 18:18     ` Noah Goldstein
2023-06-07 18:18 ` [PATCH v11 " Noah Goldstein
2023-06-07 18:18   ` [PATCH v11 2/3] x86: Refactor Intel `init_cpu_features` Noah Goldstein
2023-06-07 18:18   ` [PATCH v11 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Noah Goldstein
2023-06-07 18:19   ` [PATCH v11 1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` Noah Goldstein
2023-08-14 23:00   ` Noah Goldstein
2023-08-22 15:11     ` Noah Goldstein
2023-08-24 17:06       ` Noah Goldstein
2023-08-28 20:02         ` Noah Goldstein
2023-09-05 15:37           ` Noah Goldstein
2023-09-12  3:50             ` Noah Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xnttvz7oki.fsf@greed.delorie.com \
    --to=dj@redhat.com \
    --cc=carlos@systemhalted.org \
    --cc=goldstein.w.n@gmail.com \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).