public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] Reversing calculation of __x86_shared_non_temporal_threshold
@ 2020-09-23 15:13 Patrick McGehearty
  2020-09-23 17:19 ` H.J. Lu
  0 siblings, 1 reply; 2+ messages in thread
From: Patrick McGehearty @ 2020-09-23 15:13 UTC (permalink / raw)
  To: libc-alpha

The __x86_shared_non_temporal_threshold determines when memcpy on x86
uses non_temporal stores to avoid pushing other data out of the last
level cache.

This patch proposes to revert the calculation change made by H.J. Lu's
patch of June 2, 2017.

H.J. Lu's patch selected a threshold suitable for a single thread
getting maximum performance. It was tuned using the single threaded
large memcpy micro benchmark on an 8 core processor. The last change
changes the threshold from using 3/4 of one thread's share of the
cache to using 3/4 of the entire cache of a multi-threaded system
before switching to non-temporal stores. Multi-threaded systems with
more than a few threads are server-class and typically have many
active threads. If one thread consumes 3/4 of the available cache for
all threads, it will cause other active threads to have data removed
from the cache. Two examples show the range of the effect. John
McCalpin's widely parallel Stream benchmark, which runs in parallel
and fetches data sequentially, saw a 20% slowdown with this patch on
an internal system test of 128 threads. This regression was discovered
when comparing OL8 performance to OL7.  An example that compares
normal stores to non-temporal stores may be found at
https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
shows performance loss of 400 to 500% due to a failure to use
nontemporal stores. These performance losses are most likely to occur
when the system load is heaviest and good performance is critical.

The tunable x86_non_temporal_threshold can be used to override the
default for the knowledgable user who really wants maximum cache
allocation to a single thread in a multi-threaded system.
The manual entry for the tunable has been expanded to provide
more information about its purpose.

	modified: sysdeps/x86/cacheinfo.c
	modified: manual/tunables.texi
---
 manual/tunables.texi    | 6 +++++-
 sysdeps/x86/cacheinfo.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index b6bb54d..94d4fbd 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
 
 @deftp Tunable glibc.tune.x86_non_temporal_threshold
 The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
-to set threshold in bytes for non temporal store.
+to set threshold in bytes for non temporal store. Non temporal stores
+give a hint to the hardware to move data directly to memory without
+displacing other data from the cache. This tunable is used by some
+platforms to determine when to use non temporal stores in operations
+like memmove and memcpy.
 
 This tunable is specific to i386 and x86-64.
 @end deftp
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index b9444dd..5c5192a 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -785,7 +785,7 @@ intel_bug_no_cache_info:
   __x86_shared_non_temporal_threshold
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
-       : __x86_shared_cache_size * threads * 3 / 4);
+       : __x86_shared_cache_size * 3 / 4);
 }
 
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] Reversing calculation of __x86_shared_non_temporal_threshold
  2020-09-23 15:13 [PATCH] Reversing calculation of __x86_shared_non_temporal_threshold Patrick McGehearty
@ 2020-09-23 17:19 ` H.J. Lu
  0 siblings, 0 replies; 2+ messages in thread
From: H.J. Lu @ 2020-09-23 17:19 UTC (permalink / raw)
  To: Patrick McGehearty; +Cc: GNU C Library

On Wed, Sep 23, 2020 at 8:14 AM Patrick McGehearty via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> The __x86_shared_non_temporal_threshold determines when memcpy on x86
> uses non_temporal stores to avoid pushing other data out of the last
> level cache.
>
> This patch proposes to revert the calculation change made by H.J. Lu's
> patch of June 2, 2017.
>
> H.J. Lu's patch selected a threshold suitable for a single thread
> getting maximum performance. It was tuned using the single threaded
> large memcpy micro benchmark on an 8 core processor. The last change
> changes the threshold from using 3/4 of one thread's share of the
> cache to using 3/4 of the entire cache of a multi-threaded system
> before switching to non-temporal stores. Multi-threaded systems with
> more than a few threads are server-class and typically have many
> active threads. If one thread consumes 3/4 of the available cache for
> all threads, it will cause other active threads to have data removed
> from the cache. Two examples show the range of the effect. John
> McCalpin's widely parallel Stream benchmark, which runs in parallel
> and fetches data sequentially, saw a 20% slowdown with this patch on
> an internal system test of 128 threads. This regression was discovered
> when comparing OL8 performance to OL7.  An example that compares
> normal stores to non-temporal stores may be found at
> https://vgatherps.github.io/2018-09-02-nontemporal/.  A simple test
> shows performance loss of 400 to 500% due to a failure to use
> nontemporal stores. These performance losses are most likely to occur
> when the system load is heaviest and good performance is critical.
>
> The tunable x86_non_temporal_threshold can be used to override the
> default for the knowledgable user who really wants maximum cache
> allocation to a single thread in a multi-threaded system.
> The manual entry for the tunable has been expanded to provide
> more information about its purpose.
>
>         modified: sysdeps/x86/cacheinfo.c
>         modified: manual/tunables.texi
> ---
>  manual/tunables.texi    | 6 +++++-
>  sysdeps/x86/cacheinfo.c | 2 +-
>  2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index b6bb54d..94d4fbd 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines.
>
>  @deftp Tunable glibc.tune.x86_non_temporal_threshold
>  The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user
> -to set threshold in bytes for non temporal store.
> +to set threshold in bytes for non temporal store. Non temporal stores
> +give a hint to the hardware to move data directly to memory without
> +displacing other data from the cache. This tunable is used by some
> +platforms to determine when to use non temporal stores in operations
> +like memmove and memcpy.
>
>  This tunable is specific to i386 and x86-64.
>  @end deftp
> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
> index b9444dd..5c5192a 100644
> --- a/sysdeps/x86/cacheinfo.c
> +++ b/sysdeps/x86/cacheinfo.c
> @@ -785,7 +785,7 @@ intel_bug_no_cache_info:
>    __x86_shared_non_temporal_threshold
>      = (cpu_features->non_temporal_threshold != 0
>         ? cpu_features->non_temporal_threshold
> -       : __x86_shared_cache_size * threads * 3 / 4);
> +       : __x86_shared_cache_size * 3 / 4);
>  }
>

You need to update the comment for __x86_shared_non_temporal_threshold.

-- 
H.J.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-09-23 17:19 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 15:13 [PATCH] Reversing calculation of __x86_shared_non_temporal_threshold Patrick McGehearty
2020-09-23 17:19 ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).