* [PATCH 0/1] x86: Tuning NT Threshold parameter for AMD machines
@ 2020-08-19 10:45 Sajan Karumanchi
2020-08-19 10:45 ` [PATCH 1/1] " Sajan Karumanchi
0 siblings, 1 reply; 5+ messages in thread
From: Sajan Karumanchi @ 2020-08-19 10:45 UTC (permalink / raw)
To: libc-alpha, carlos; +Cc: Sajan Karumanchi, premachandra.mallappa
Tuning NT threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of
shared cache size on AMD Zen[1|2] machines brings in performance gains
for memcpy/memmove as per the Large and Walk Bench variant reuslts.
As there are run to run variations in bench results, I took average of 100 runs
for both vanilla and patched glibc.
AMD ZEN[1/2] architectures doesn't have ERMS cpu feature.
So, on ZEN architecutre memcpy takes 'memcpy_avx_unaligned' entry point.
Below is the large bench test results comparision for entry points:
avx_unaligned and avx_unaligned_erms.
-------------------------------------------------------------------------
size load_align store_align avx_unaligned(%) avx_unaligned_erms(%)
-------------------------------------------------------------------------
1048583 0 0 1.89 68.28
1048591 0 3 1.19 94.56
1048607 3 0 -0.25 68.25
1048639 3 5 -90.7 89.69
2097159 0 0 -75.11 43.18
2097167 0 3 -74.08 90.16
2097183 3 0 -78.12 43.81
2097215 3 5 -73.75 90.58
4194311 0 0 -88.5 39.26
4194319 0 3 -72.13 90.21
4194335 3 0 -78.31 43.97
4194367 3 5 -72 90.64
8388615 0 0 -12.22 43.24
8388623 0 3 -15.76 90.3
8388639 3 0 -22.31 39.92
8388671 3 5 -15.34 90.74
16777223 0 0 49.8 46.89
16777231 0 3 52.5 90.14
16777247 3 0 51.82 46.68
16777279 3 5 52.35 90.55
33554439 0 0 41.76 52.72
33554447 0 3 44.17 88.29
33554463 3 0 43.74 53.62
33554495 3 5 44.09 88.78
-------------------------------------------------------------------------
Below is the Walk bench test results comparision for entry points.
avx_unaligned and avx_unaligned_erms.
---------------------------------------------------
size avx_unaligned(%) avx_unaligned_erms(%)
---------------------------------------------------
1048576 -0.2 15.03
1048577 0.92 15.52
2097152 40.52 50.92
2097153 40.76 50.84
4194304 40.6 51.22
4194305 40.57 51.25
8388608 40.61 51.23
8388609 40.82 51.32
16777216 40.56 51.11
16777217 40.35 51.29
33554432 40.15 37.41
33554433 20.75 41.22
---------------------------------------------------
Question:
Why do we see discrepancies in the results of Large bench, though code path
taken for NT Stores in memcpy is same for both entry points
"memcpy_avx_unaligned" and "memcpy_avx_unaligned_erms"?
Sajan Karumanchi (1):
x86: Tuning NT Threshold parameter for AMD machines.
sysdeps/x86/cacheinfo.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/1] x86: Tuning NT Threshold parameter for AMD machines.
2020-08-19 10:45 [PATCH 0/1] x86: Tuning NT Threshold parameter for AMD machines Sajan Karumanchi
@ 2020-08-19 10:45 ` Sajan Karumanchi
2020-09-01 19:23 ` H.J. Lu
0 siblings, 1 reply; 5+ messages in thread
From: Sajan Karumanchi @ 2020-08-19 10:45 UTC (permalink / raw)
To: libc-alpha, carlos; +Cc: Sajan Karumanchi, premachandra.mallappa
Tuning NT threshold parameter to bring in performance gains of
memcpy/memove on AMD cpu's.
Based on Large and Walk bench variant results,
setting __x86_shared_non_temporal_threshold to 2/3 of shared cache size
brings in performance gains for memcpy/memmove on AMD machines.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Signed-off-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Signed-off-by: Sajan Karumanchi <sajan.karumanchi@amd.com>
---
sysdeps/x86/cacheinfo.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 217c21c34f..5487f382a8 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -829,7 +829,8 @@ init_cacheinfo (void)
}
if (cpu_features->data_cache_size != 0)
- data = cpu_features->data_cache_size;
+ if (data == 0 || cpu_features->basic.kind != arch_kind_amd)
+ data = cpu_features->data_cache_size;
if (data > 0)
{
@@ -842,7 +843,8 @@ init_cacheinfo (void)
}
if (cpu_features->shared_cache_size != 0)
- shared = cpu_features->shared_cache_size;
+ if (shared == 0 || cpu_features->basic.kind != arch_kind_amd)
+ shared = cpu_features->shared_cache_size;
if (shared > 0)
{
@@ -854,6 +856,17 @@ init_cacheinfo (void)
__x86_shared_cache_size = shared;
}
+ if (cpu_features->basic.kind == arch_kind_amd)
+ {
+ /* Large and Walk benchmarks in glibc shows 2/3 shared cache size is
+ the threshold value above which non-temporal store is performing better */
+ __x86_shared_non_temporal_threshold
+ = (cpu_features->non_temporal_threshold != 0
+ ? cpu_features->non_temporal_threshold
+ : __x86_shared_cache_size * 2 / 3);
+ }
+ else
+ {
/* The large memcpy micro benchmark in glibc shows that 6 times of
shared cache size is the approximate value above which non-temporal
store becomes faster on a 8-core processor. This is the 3/4 of the
@@ -862,6 +875,7 @@ init_cacheinfo (void)
= (cpu_features->non_temporal_threshold != 0
? cpu_features->non_temporal_threshold
: __x86_shared_cache_size * threads * 3 / 4);
+ }
/* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */
unsigned int minimum_rep_movsb_threshold;
--
2.17.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] x86: Tuning NT Threshold parameter for AMD machines.
2020-08-19 10:45 ` [PATCH 1/1] " Sajan Karumanchi
@ 2020-09-01 19:23 ` H.J. Lu
2020-09-08 11:36 ` Sajan Karumanchi
0 siblings, 1 reply; 5+ messages in thread
From: H.J. Lu @ 2020-09-01 19:23 UTC (permalink / raw)
To: Sajan Karumanchi
Cc: GNU C Library, Carlos O'Donell, Sajan Karumanchi, Mallappa,
Premachandra
On Wed, Aug 19, 2020 at 3:58 AM Sajan Karumanchi via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Tuning NT threshold parameter to bring in performance gains of
> memcpy/memove on AMD cpu's.
>
> Based on Large and Walk bench variant results,
> setting __x86_shared_non_temporal_threshold to 2/3 of shared cache size
> brings in performance gains for memcpy/memmove on AMD machines.
>
The patch looks mostly OK. But I have quite a few x86 patches queued
which touch the same codes. Please take a look at
https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/tunable/master
and put your patch on top of mine.
--
H.J.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] x86: Tuning NT Threshold parameter for AMD machines.
2020-09-01 19:23 ` H.J. Lu
@ 2020-09-08 11:36 ` Sajan Karumanchi
2020-12-07 14:23 ` H.J. Lu
0 siblings, 1 reply; 5+ messages in thread
From: Sajan Karumanchi @ 2020-09-08 11:36 UTC (permalink / raw)
To: hjl.tools; +Cc: libc-alpha, carlos, premachandra.mallappa, Sajan Karumanchi
Thanks H.J.Lu for reviewing the patch.
Before pushing a rebased patch, I am looking for answers regarding
the performance drop observed only in large bench variant results for
size ranges of 1MB to 8MB.
For more details, please refer to the cover letter
https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] x86: Tuning NT Threshold parameter for AMD machines.
2020-09-08 11:36 ` Sajan Karumanchi
@ 2020-12-07 14:23 ` H.J. Lu
0 siblings, 0 replies; 5+ messages in thread
From: H.J. Lu @ 2020-12-07 14:23 UTC (permalink / raw)
To: Sajan Karumanchi
Cc: GNU C Library, Carlos O'Donell, Mallappa, Premachandra,
Sajan Karumanchi
On Tue, Sep 8, 2020 at 4:39 AM Sajan Karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> Thanks H.J.Lu for reviewing the patch.
> Before pushing a rebased patch, I am looking for answers regarding
> the performance drop observed only in large bench variant results for
> size ranges of 1MB to 8MB.
> For more details, please refer to the cover letter
> https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html
>
Please update your patch since the code has been changed by
commit d3c57027470b78dba79c6d931e4e409b1fecfc80
Author: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Date: Mon Sep 28 20:11:28 2020 +0000
Reversing calculation of __x86_shared_non_temporal_threshold
--
H.J.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-12-07 14:24 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 10:45 [PATCH 0/1] x86: Tuning NT Threshold parameter for AMD machines Sajan Karumanchi
2020-08-19 10:45 ` [PATCH 1/1] " Sajan Karumanchi
2020-09-01 19:23 ` H.J. Lu
2020-09-08 11:36 ` Sajan Karumanchi
2020-12-07 14:23 ` H.J. Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).