* [PATCH 0/1] Optimizing memcpy for AMD Zen architecture. @ 2020-10-22 4:50 sajan.karumanchi 2020-10-22 4:50 ` [PATCH 1/1] x86: " sajan.karumanchi 0 siblings, 1 reply; 9+ messages in thread From: sajan.karumanchi @ 2020-10-22 4:50 UTC (permalink / raw) To: libc-alpha, carlos, fweimer; +Cc: Sajan Karumanchi, premachandra.mallappa From: Sajan Karumanchi <sajan.karumanchi@amd.com> Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architecture. In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX'(Core-Complex) has brought performance gains of ~44% for memory sizes greater than 16MB. The patch I posted earlier: 'Tuning NT Threshold parameter for AMD machines' https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html and the recent patch committed by Patrick McGehearty: 'Reversing calculation of __x86_shared_non_temporal_threshold', both have regression problems on AMD Zen machines for memory ranges of 1MB to 8MB as per the large bench variant results. This patch addresses the regression problem on AMD Zen machines. The below link will show the performance results chart comparison of 'Master' branch and 'AMD' patch against the 2.32 stable release. https://i.imgur.com/0ZJAwes.png Summary: On master branch we see a regression for memoery sizes below 8MB with performance drop of upto 99%, whereas AMD patch has performance gains for 16MB and above with no regressions. Note: The benchmarking is done by isolating all the cpu cores in a CCX, configuring them to fixed frequency mode and routing the IRQs to other cpu cores. Then the large bench tests were run by pinning to one of the isolated cores for 1000 iterations and the performance computation is done by taking average of these iterations. Sajan Karumanchi (1): x86: Optimizing memcpy for AMD Zen architecture. sysdeps/x86/cacheinfo.h | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-22 4:50 [PATCH 0/1] Optimizing memcpy for AMD Zen architecture sajan.karumanchi @ 2020-10-22 4:50 ` sajan.karumanchi 2020-10-27 10:38 ` Florian Weimer 2020-10-28 14:00 ` [PATCH 1/1] " H.J. Lu 0 siblings, 2 replies; 9+ messages in thread From: sajan.karumanchi @ 2020-10-22 4:50 UTC (permalink / raw) To: libc-alpha, carlos, fweimer; +Cc: Sajan Karumanchi, premachandra.mallappa From: Sajan Karumanchi <sajan.karumanchi@amd.com> Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architectures. In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX(Core-Complex)' has brought in performance gains. As per the large bench variant results, this patch also addresses the regression problem on AMD Zen architectures. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Signed-off-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Signed-off-by: Sajan Karumanchi <sajan.karumanchi@amd.com> --- sysdeps/x86/cacheinfo.h | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h index 7f342fdc23..d6d6877702 100644 --- a/sysdeps/x86/cacheinfo.h +++ b/sysdeps/x86/cacheinfo.h @@ -303,6 +303,8 @@ init_cacheinfo (void) data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE); shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); + unsigned int eax; + unsigned int threads_per_ccx = 0; /* Get maximum extended function. */ __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx); @@ -320,7 +322,7 @@ init_cacheinfo (void) threads = 1 << ((ecx >> 12) & 0x0f); } - if (threads == 0) + if (threads == 0 || cpu_features->basic.family >= 0x17) { /* If APIC ID width is not available, use logical processor count. */ @@ -335,13 +337,27 @@ init_cacheinfo (void) if (threads > 0) shared /= threads; - /* Account for exclusive L2 and L3 caches. */ - shared += core; - } + /* Get shared cache per ccx for Zen architectures */ + if (cpu_features->basic.family >= 0x17) + { + /* Get number of threads share the L3 cache in CCX */ + __cpuid_count(0x8000001D, 0x3, eax, ebx, ecx, edx); + threads_per_ccx = ((eax >> 14) & 0xfff) + 1; + shared = shared * threads_per_ccx; + } + else + { + /* Account for exclusive L2 and L3 caches. */ + shared += core; + } + } } if (cpu_features->data_cache_size != 0) - data = cpu_features->data_cache_size; + { + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) + data = cpu_features->data_cache_size; + } if (data > 0) { @@ -354,7 +370,10 @@ init_cacheinfo (void) } if (cpu_features->shared_cache_size != 0) - shared = cpu_features->shared_cache_size; + { + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) + shared = cpu_features->shared_cache_size; + } if (shared > 0) { -- 2.17.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-22 4:50 ` [PATCH 1/1] x86: " sajan.karumanchi @ 2020-10-27 10:38 ` Florian Weimer 2020-10-28 7:35 ` [PATCH] " sajan.karumanchi 2020-10-28 14:00 ` [PATCH 1/1] " H.J. Lu 1 sibling, 1 reply; 9+ messages in thread From: Florian Weimer @ 2020-10-27 10:38 UTC (permalink / raw) To: sajan.karumanchi; +Cc: libc-alpha, carlos, premachandra.mallappa * sajan karumanchi: > diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h > index 7f342fdc23..d6d6877702 100644 > --- a/sysdeps/x86/cacheinfo.h > +++ b/sysdeps/x86/cacheinfo.h > @@ -303,6 +303,8 @@ init_cacheinfo (void) > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > + unsigned int eax; > + unsigned int threads_per_ccx = 0; > > /* Get maximum extended function. */ > __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx); > @@ -320,7 +322,7 @@ init_cacheinfo (void) > threads = 1 << ((ecx >> 12) & 0x0f); > } > > - if (threads == 0) > + if (threads == 0 || cpu_features->basic.family >= 0x17) > { > /* If APIC ID width is not available, use logical > processor count. */ > @@ -335,13 +337,27 @@ init_cacheinfo (void) > if (threads > 0) > shared /= threads; > > - /* Account for exclusive L2 and L3 caches. */ > - shared += core; > - } > + /* Get shared cache per ccx for Zen architectures */ > + if (cpu_features->basic.family >= 0x17) > + { > + /* Get number of threads share the L3 cache in CCX */ > + __cpuid_count(0x8000001D, 0x3, eax, ebx, ecx, edx); > + threads_per_ccx = ((eax >> 14) & 0xfff) + 1; > + shared = shared * threads_per_ccx; > + } > + else > + { > + /* Account for exclusive L2 and L3 caches. */ > + shared += core; > + } > + } > } Although not visible in the patch, these changes a properly guarded by an arch_kind_amd check, as expected. Beyond that, I can't comment on the substance of the patch, but I'd like to request the follow style changes: * Move the definitions of eax and threads_per_ccx closer to their usage site. Initialize threads_per_ccx directly with its final variable. (The separate variable is nice for documentation purposes.) * Add a space after __cpuid_count (to follow GNU style). * Add ". " (period and two spaces) at the end of all new comments. * Remove Signed-off-by. glibc does not use DCO <https://developercertificate.org/>. I assume this patch is covered by AMD's copyright assignment instead. I can make these changes for you and push this, or you can post a new patch. Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-27 10:38 ` Florian Weimer @ 2020-10-28 7:35 ` sajan.karumanchi 2020-10-28 9:22 ` Florian Weimer 0 siblings, 1 reply; 9+ messages in thread From: sajan.karumanchi @ 2020-10-28 7:35 UTC (permalink / raw) To: libc-alpha, carlos, fweimer; +Cc: premachandra.mallappa, Sajan Karumanchi From: Sajan Karumanchi <sajan.karumanchi@amd.com> Modifying the shareable cache '__x86_shared_cache_size', which is a factor in computing the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen architectures. In the existing implementation, the shareable cache is computed as 'L3 per thread, L2 per core'. Recomputing this shareable cache as 'L3 per CCX(Core-Complex)' has brought in performance gains. As per the large bench variant results, this patch also addresses the regression problem on AMD Zen architectures. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> --- sysdeps/x86/cacheinfo.h | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h index 7f342fdc23..1296c93b2b 100644 --- a/sysdeps/x86/cacheinfo.h +++ b/sysdeps/x86/cacheinfo.h @@ -320,7 +320,7 @@ init_cacheinfo (void) threads = 1 << ((ecx >> 12) & 0x0f); } - if (threads == 0) + if (threads == 0 || cpu_features->basic.family >= 0x17) { /* If APIC ID width is not available, use logical processor count. */ @@ -335,13 +335,30 @@ init_cacheinfo (void) if (threads > 0) shared /= threads; - /* Account for exclusive L2 and L3 caches. */ - shared += core; - } + /* Get shared cache per ccx for Zen architectures. */ + if (cpu_features->basic.family >= 0x17) + { + unsigned int eax; + + /* Get number of threads share the L3 cache in CCX. */ + __cpuid_count (0x8000001D, 0x3, eax, ebx, ecx, edx); + + unsigned int threads_per_ccx = ((eax >> 14) & 0xfff) + 1; + shared *= threads_per_ccx; + } + else + { + /* Account for exclusive L2 and L3 caches. */ + shared += core; + } + } } if (cpu_features->data_cache_size != 0) - data = cpu_features->data_cache_size; + { + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) + data = cpu_features->data_cache_size; + } if (data > 0) { @@ -354,7 +371,10 @@ init_cacheinfo (void) } if (cpu_features->shared_cache_size != 0) - shared = cpu_features->shared_cache_size; + { + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) + shared = cpu_features->shared_cache_size; + } if (shared > 0) { -- 2.17.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-28 7:35 ` [PATCH] " sajan.karumanchi @ 2020-10-28 9:22 ` Florian Weimer 0 siblings, 0 replies; 9+ messages in thread From: Florian Weimer @ 2020-10-28 9:22 UTC (permalink / raw) To: sajan.karumanchi; +Cc: libc-alpha, carlos, premachandra.mallappa * sajan karumanchi: > From: Sajan Karumanchi <sajan.karumanchi@amd.com> > > Modifying the shareable cache '__x86_shared_cache_size', which is a > factor in computing the non-temporal threshold parameter > '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen > architectures. > In the existing implementation, the shareable cache is computed as 'L3 > per thread, L2 per core'. Recomputing this shareable cache as 'L3 per > CCX(Core-Complex)' has brought in performance gains. > As per the large bench variant results, this patch also addresses the > regression problem on AMD Zen architectures. > > Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Thanks, I've pushed this version for you. Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-22 4:50 ` [PATCH 1/1] x86: " sajan.karumanchi 2020-10-27 10:38 ` Florian Weimer @ 2020-10-28 14:00 ` H.J. Lu 2020-10-28 14:29 ` Florian Weimer 1 sibling, 1 reply; 9+ messages in thread From: H.J. Lu @ 2020-10-28 14:00 UTC (permalink / raw) To: Sajan Karumanchi Cc: GNU C Library, Carlos O'Donell, Florian Weimer, Mallappa, Premachandra On Wed, Oct 21, 2020 at 9:51 PM <sajan.karumanchi@amd.com> wrote: > > From: Sajan Karumanchi <sajan.karumanchi@amd.com> > > Modifying the shareable cache '__x86_shared_cache_size', which is a > factor in computing the non-temporal threshold parameter > '__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen > architectures. > In the existing implementation, the shareable cache is computed as 'L3 > per thread, L2 per core'. Recomputing this shareable cache as 'L3 per > CCX(Core-Complex)' has brought in performance gains. > As per the large bench variant results, this patch also addresses the > regression problem on AMD Zen architectures. > > Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> > Signed-off-by: Premachandra Mallappa <premachandra.mallappa@amd.com> > Signed-off-by: Sajan Karumanchi <sajan.karumanchi@amd.com> > --- > sysdeps/x86/cacheinfo.h | 31 +++++++++++++++++++++++++------ > 1 file changed, 25 insertions(+), 6 deletions(-) > > diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h > index 7f342fdc23..d6d6877702 100644 > --- a/sysdeps/x86/cacheinfo.h > +++ b/sysdeps/x86/cacheinfo.h > @@ -303,6 +303,8 @@ init_cacheinfo (void) > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > long int core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > + unsigned int eax; > + unsigned int threads_per_ccx = 0; > > /* Get maximum extended function. */ > __cpuid (0x80000000, max_cpuid_ex, ebx, ecx, edx); > @@ -320,7 +322,7 @@ init_cacheinfo (void) > threads = 1 << ((ecx >> 12) & 0x0f); > } > > - if (threads == 0) > + if (threads == 0 || cpu_features->basic.family >= 0x17) > { > /* If APIC ID width is not available, use logical > processor count. */ > @@ -335,13 +337,27 @@ init_cacheinfo (void) > if (threads > 0) > shared /= threads; > > - /* Account for exclusive L2 and L3 caches. */ > - shared += core; > - } > + /* Get shared cache per ccx for Zen architectures */ > + if (cpu_features->basic.family >= 0x17) > + { > + /* Get number of threads share the L3 cache in CCX */ > + __cpuid_count(0x8000001D, 0x3, eax, ebx, ecx, edx); > + threads_per_ccx = ((eax >> 14) & 0xfff) + 1; > + shared = shared * threads_per_ccx; > + } > + else > + { > + /* Account for exclusive L2 and L3 caches. */ > + shared += core; > + } > + } > } > > if (cpu_features->data_cache_size != 0) > - data = cpu_features->data_cache_size; > + { > + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) > + data = cpu_features->data_cache_size; > + } This looks wrong. cpu_features->data_cache_size is set by GLIBC tunables: -- Tunable: glibc.cpu.x86_data_cache_size The ‘glibc.cpu.x86_data_cache_size’ tunable allows the user to set data cache size in bytes for use in memory and string routines. This tunable is specific to i386 and x86-64. Why is it ignored? > if (data > 0) > { > @@ -354,7 +370,10 @@ init_cacheinfo (void) > } > > if (cpu_features->shared_cache_size != 0) > - shared = cpu_features->shared_cache_size; > + { > + if (shared == 0 || cpu_features->basic.kind != arch_kind_amd) > + shared = cpu_features->shared_cache_size; > + } This looks wrong. cpu_features->shared_cache_size is set by GLIBC tunables: -- Tunable: glibc.cpu.x86_shared_cache_size The ‘glibc.cpu.x86_shared_cache_size’ tunable allows the user to set shared cache size in bytes for use in memory and string routines. Why is it ignored? > if (shared > 0) > { > -- > 2.17.1 > I think they should be reverted. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-28 14:00 ` [PATCH 1/1] " H.J. Lu @ 2020-10-28 14:29 ` Florian Weimer 2020-10-28 14:44 ` H.J. Lu 0 siblings, 1 reply; 9+ messages in thread From: Florian Weimer @ 2020-10-28 14:29 UTC (permalink / raw) To: H.J. Lu Cc: Sajan Karumanchi, GNU C Library, Carlos O'Donell, Mallappa, Premachandra * H. J. Lu: >> if (cpu_features->data_cache_size != 0) >> - data = cpu_features->data_cache_size; >> + { >> + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) >> + data = cpu_features->data_cache_size; >> + } > > This looks wrong. cpu_features->data_cache_size is set by > GLIBC tunables: > > -- Tunable: glibc.cpu.x86_data_cache_size > The ‘glibc.cpu.x86_data_cache_size’ tunable allows the user to set > data cache size in bytes for use in memory and string routines. > > This tunable is specific to i386 and x86-64. > > Why is it ignored? So we should revert those two hunks only? Thanks, Florian -- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-28 14:29 ` Florian Weimer @ 2020-10-28 14:44 ` H.J. Lu 2020-10-30 6:35 ` Karumanchi, Sajan 0 siblings, 1 reply; 9+ messages in thread From: H.J. Lu @ 2020-10-28 14:44 UTC (permalink / raw) To: Florian Weimer Cc: Sajan Karumanchi, GNU C Library, Carlos O'Donell, Mallappa, Premachandra On Wed, Oct 28, 2020 at 7:29 AM Florian Weimer <fweimer@redhat.com> wrote: > > * H. J. Lu: > > >> if (cpu_features->data_cache_size != 0) > >> - data = cpu_features->data_cache_size; > >> + { > >> + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) > >> + data = cpu_features->data_cache_size; > >> + } > > > > This looks wrong. cpu_features->data_cache_size is set by > > GLIBC tunables: > > > > -- Tunable: glibc.cpu.x86_data_cache_size > > The ‘glibc.cpu.x86_data_cache_size’ tunable allows the user to set > > data cache size in bytes for use in memory and string routines. > > > > This tunable is specific to i386 and x86-64. > > > > Why is it ignored? > > So we should revert those two hunks only? > Yes. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. 2020-10-28 14:44 ` H.J. Lu @ 2020-10-30 6:35 ` Karumanchi, Sajan 0 siblings, 0 replies; 9+ messages in thread From: Karumanchi, Sajan @ 2020-10-30 6:35 UTC (permalink / raw) To: H.J. Lu, Florian Weimer Cc: GNU C Library, Carlos O'Donell, Mallappa, Premachandra [AMD Public Use] Hi H. J .Lu, Thanks for pointing it out and sorry for my ignorance of the Glibc tunables. I overlooked this tunables part, as it was once reviewed by you for another patch which did not make through. Patch: https://sourceware.org/pipermail/libc-alpha/2020-August/117081.html Review: https://sourceware.org/pipermail/libc-alpha/2020-September/117385.html Thanks & Regards, Sajan K. -----Original Message----- From: H.J. Lu <hjl.tools@gmail.com> Sent: Wednesday, October 28, 2020 8:14 PM To: Florian Weimer <fweimer@redhat.com> Cc: Karumanchi, Sajan <Sajan.Karumanchi@amd.com>; GNU C Library <libc-alpha@sourceware.org>; Carlos O'Donell <carlos@redhat.com>; Mallappa, Premachandra <Premachandra.Mallappa@amd.com> Subject: Re: [PATCH 1/1] x86: Optimizing memcpy for AMD Zen architecture. [CAUTION: External Email] On Wed, Oct 28, 2020 at 7:29 AM Florian Weimer <fweimer@redhat.com> wrote: > > * H. J. Lu: > > >> if (cpu_features->data_cache_size != 0) > >> - data = cpu_features->data_cache_size; > >> + { > >> + if (data == 0 || cpu_features->basic.kind != arch_kind_amd) > >> + data = cpu_features->data_cache_size; > >> + } > > > > This looks wrong. cpu_features->data_cache_size is set by GLIBC > > tunables: > > > > -- Tunable: glibc.cpu.x86_data_cache_size > > The ‘glibc.cpu.x86_data_cache_size’ tunable allows the user to set > > data cache size in bytes for use in memory and string routines. > > > > This tunable is specific to i386 and x86-64. > > > > Why is it ignored? > > So we should revert those two hunks only? > Yes. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-10-30 6:35 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-22 4:50 [PATCH 0/1] Optimizing memcpy for AMD Zen architecture sajan.karumanchi 2020-10-22 4:50 ` [PATCH 1/1] x86: " sajan.karumanchi 2020-10-27 10:38 ` Florian Weimer 2020-10-28 7:35 ` [PATCH] " sajan.karumanchi 2020-10-28 9:22 ` Florian Weimer 2020-10-28 14:00 ` [PATCH 1/1] " H.J. Lu 2020-10-28 14:29 ` Florian Weimer 2020-10-28 14:44 ` H.J. Lu 2020-10-30 6:35 ` Karumanchi, Sajan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).