* [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ @ 2024-02-08 13:08 Adhemerval Zanella 2024-02-08 13:08 ` [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella ` (3 more replies) 0 siblings, 4 replies; 9+ messages in thread From: Adhemerval Zanella @ 2024-02-08 13:08 UTC (permalink / raw) To: libc-alpha; +Cc: H . J . Lu, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp For the sizes where REP MOVSB and REP STOSB are used on Zen3+ cores, the result performance is lower than vectorized instructions (with some input alignment showing a very large performance gap as indicated by BZ#30995). The glibc enables ERMS on AMD code for sizes between 2113 (rep_movsb_threshold) and L2 cache size rep_movsb_stop_threshold or 524288 on a Zen3 core). Using the provided benchmarks from BZ#30995, the memcpy on Ryzen 9 5900X shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 84.2448 2113 15 4.4310 524287 0 57.1122 524287 15 4.34671 While by using vectorized instructions with the tunable GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 it shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 124.1830 2113 15 121.8720 524287 0 58.3212 524287 15 58.5352 Increasing the number of concurrent jobs does show improvements in ERMS over vectorized instructions as well. The performance difference with ERMS improves if input alignments are equal, although it does not reach parity with the vectorized path. The memset also shows similar performance improvement with vectorized instructions instead of REP STOSB. On the same machine, the default strategy shows: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 68.0113 2113 15 56.1880 524287 0 119.3670 524287 15 116.2590 While with GLIBC_TUNABLES=glibc.cpu.x86_rep_stosb_threshold=1000000: Size (bytes) Destination Alignment Throughput (GB/s) 2113 0 133.2310 2113 15 132.5800 524287 0 112.0650 524287 15 118.0960 I also saw a slight performance increase on 502.gcc_r (1 copy), where where result went from 9.82 to 9.85. The benchmarks hit hard both memcpy and memset. Changes from v2: - Removed rep_movsb_stop_threshold tunable. - Simplify the memset change. Changes from v1: - Reword comment and commit message. Adhemerval Zanella (3): x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) x86: Do not prefer ERMS for memset on Zen3+ x86: Expand the comment on when REP STOSB is used on memset sysdeps/x86/dl-cacheinfo.h | 43 ++++++++++--------- .../multiarch/memset-vec-unaligned-erms.S | 4 +- 2 files changed, 26 insertions(+), 21 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) 2024-02-08 13:08 [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella @ 2024-02-08 13:08 ` Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-02-08 13:08 ` [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella ` (2 subsequent siblings) 3 siblings, 1 reply; 9+ messages in thread From: Adhemerval Zanella @ 2024-02-08 13:08 UTC (permalink / raw) To: libc-alpha; +Cc: H . J . Lu, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp The REP MOVSB usage on memcpy/memmove does not show much performance improvement on Zen3/Zen4 cores compared to the vectorized loops. Also, as from BZ 30994, if the source is aligned and the destination is not the performance can be 20x slower. The performance difference is noticeable with small buffer sizes, closer to the lower bounds limits when memcpy/memmove starts to use ERMS. The performance of REP MOVSB is similar to vectorized instruction on the size limit (the L2 cache). Also, there is no drawback to multiple cores sharing the cache. Checked on x86_64-linux-gnu on Zen3. --- sysdeps/x86/dl-cacheinfo.h | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index d5101615e3..f34d12846c 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) long int data = -1; long int shared = -1; long int shared_per_thread = -1; - long int core = -1; unsigned int threads = 0; unsigned long int level1_icache_size = -1; unsigned long int level1_icache_linesize = -1; @@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (cpu_features->basic.kind == arch_kind_intel) { data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); - core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); shared_per_thread = shared; @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features); level1_dcache_linesize = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features); - level2_cache_size = core; + level2_cache_size + = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); level2_cache_assoc = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features); level2_cache_linesize @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level4_cache_size = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features); - get_common_cache_info (&shared, &shared_per_thread, &threads, core); + get_common_cache_info (&shared, &shared_per_thread, &threads, + level2_cache_size); } else if (cpu_features->basic.kind == arch_kind_zhaoxin) { data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE); - core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE); shared_per_thread = shared; @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level1_dcache_size = data; level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC); level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE); - level2_cache_size = core; + level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC); level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE); level3_cache_size = shared; level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC); level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE); - get_common_cache_info (&shared, &shared_per_thread, &threads, core); + get_common_cache_info (&shared, &shared_per_thread, &threads, + level2_cache_size); } else if (cpu_features->basic.kind == arch_kind_amd) { data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); - core = handle_amd (_SC_LEVEL2_CACHE_SIZE); shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level1_dcache_size = data; level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC); level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE); - level2_cache_size = core; + level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);; level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC); level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE); level3_cache_size = shared; @@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (shared <= 0) { /* No shared L3 cache. All we have is the L2 cache. */ - shared = core; + shared = level2_cache_size; } else if (cpu_features->basic.family < 0x17) { /* Account for exclusive L2 and L3 caches. */ - shared += core; + shared += level2_cache_size; } shared_per_thread = shared; @@ -987,6 +986,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) rep_movsb_threshold = 2112; + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of + cases slower than the vectorized path (and for some alignments, + it is really slow, check BZ #30994). */ + if (cpu_features->basic.kind == arch_kind_amd) + rep_movsb_threshold = non_temporal_threshold; + /* The default threshold to use Enhanced REP STOSB. */ unsigned long int rep_stosb_threshold = 2048; @@ -1028,16 +1033,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) SIZE_MAX); unsigned long int rep_movsb_stop_threshold; - /* ERMS feature is implemented from AMD Zen3 architecture and it is - performing poorly for data above L2 cache size. Henceforth, adding - an upper bound threshold parameter to limit the usage of Enhanced - REP MOVSB operations and setting its value to L2 cache size. */ - if (cpu_features->basic.kind == arch_kind_amd) - rep_movsb_stop_threshold = core; /* Setting the upper bound of ERMS to the computed value of - non-temporal threshold for architectures other than AMD. */ - else - rep_movsb_stop_threshold = non_temporal_threshold; + non-temporal threshold for all architectures. */ + rep_movsb_stop_threshold = non_temporal_threshold; cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; -- 2.34.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) 2024-02-08 13:08 ` [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella @ 2024-02-12 15:56 ` H.J. Lu 0 siblings, 0 replies; 9+ messages in thread From: H.J. Lu @ 2024-02-12 15:56 UTC (permalink / raw) To: Adhemerval Zanella Cc: libc-alpha, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp On Thu, Feb 8, 2024 at 5:08 AM Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > The REP MOVSB usage on memcpy/memmove does not show much performance > improvement on Zen3/Zen4 cores compared to the vectorized loops. Also, > as from BZ 30994, if the source is aligned and the destination is not > the performance can be 20x slower. > > The performance difference is noticeable with small buffer sizes, closer > to the lower bounds limits when memcpy/memmove starts to use ERMS. The > performance of REP MOVSB is similar to vectorized instruction on the > size limit (the L2 cache). Also, there is no drawback to multiple cores > sharing the cache. > > Checked on x86_64-linux-gnu on Zen3. > --- > sysdeps/x86/dl-cacheinfo.h | 38 ++++++++++++++++++-------------------- > 1 file changed, 18 insertions(+), 20 deletions(-) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index d5101615e3..f34d12846c 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > long int data = -1; > long int shared = -1; > long int shared_per_thread = -1; > - long int core = -1; > unsigned int threads = 0; > unsigned long int level1_icache_size = -1; > unsigned long int level1_icache_linesize = -1; > @@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > if (cpu_features->basic.kind == arch_kind_intel) > { > data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features); > - core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); > shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features); > shared_per_thread = shared; > > @@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > = handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features); > level1_dcache_linesize > = handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features); > - level2_cache_size = core; > + level2_cache_size > + = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features); > level2_cache_assoc > = handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features); > level2_cache_linesize > @@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level4_cache_size > = handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features); > > - get_common_cache_info (&shared, &shared_per_thread, &threads, core); > + get_common_cache_info (&shared, &shared_per_thread, &threads, > + level2_cache_size); > } > else if (cpu_features->basic.kind == arch_kind_zhaoxin) > { > data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE); > - core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); > shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE); > shared_per_thread = shared; > > @@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level1_dcache_size = data; > level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC); > level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE); > - level2_cache_size = core; > + level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE); > level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC); > level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE); > level3_cache_size = shared; > level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC); > level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE); > > - get_common_cache_info (&shared, &shared_per_thread, &threads, core); > + get_common_cache_info (&shared, &shared_per_thread, &threads, > + level2_cache_size); > } > else if (cpu_features->basic.kind == arch_kind_amd) > { > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > - core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); > @@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level1_dcache_size = data; > level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC); > level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE); > - level2_cache_size = core; > + level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);; > level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC); > level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE); > level3_cache_size = shared; > @@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > if (shared <= 0) > { > /* No shared L3 cache. All we have is the L2 cache. */ > - shared = core; > + shared = level2_cache_size; > } > else if (cpu_features->basic.family < 0x17) > { > /* Account for exclusive L2 and L3 caches. */ > - shared += core; > + shared += level2_cache_size; > } > > shared_per_thread = shared; > @@ -987,6 +986,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) > rep_movsb_threshold = 2112; > > + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of > + cases slower than the vectorized path (and for some alignments, > + it is really slow, check BZ #30994). */ > + if (cpu_features->basic.kind == arch_kind_amd) > + rep_movsb_threshold = non_temporal_threshold; > + > /* The default threshold to use Enhanced REP STOSB. */ > unsigned long int rep_stosb_threshold = 2048; > > @@ -1028,16 +1033,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > SIZE_MAX); > > unsigned long int rep_movsb_stop_threshold; > - /* ERMS feature is implemented from AMD Zen3 architecture and it is > - performing poorly for data above L2 cache size. Henceforth, adding > - an upper bound threshold parameter to limit the usage of Enhanced > - REP MOVSB operations and setting its value to L2 cache size. */ > - if (cpu_features->basic.kind == arch_kind_amd) > - rep_movsb_stop_threshold = core; > /* Setting the upper bound of ERMS to the computed value of > - non-temporal threshold for architectures other than AMD. */ > - else > - rep_movsb_stop_threshold = non_temporal_threshold; > + non-temporal threshold for all architectures. */ > + rep_movsb_stop_threshold = non_temporal_threshold; > > cpu_features->data_cache_size = data; > cpu_features->shared_cache_size = shared; > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ 2024-02-08 13:08 [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella 2024-02-08 13:08 ` [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella @ 2024-02-08 13:08 ` Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-02-08 13:08 ` [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella 2024-03-25 15:15 ` [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Florian Weimer 3 siblings, 1 reply; 9+ messages in thread From: Adhemerval Zanella @ 2024-02-08 13:08 UTC (permalink / raw) To: libc-alpha; +Cc: H . J . Lu, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp For AMD Zen3+ architecture, the performance of the vectorized loop is slightly better than ERMS. Checked on x86_64-linux-gnu on Zen3. --- sysdeps/x86/dl-cacheinfo.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index f34d12846c..5a98f70364 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -1021,6 +1021,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) minimum value is fixed. */ rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); + if (cpu_features->basic.kind == arch_kind_amd + && !TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold)) + /* For AMD Zen3+ architecture, the performance of the vectorized loop is + slightly better than ERMS. */ + rep_stosb_threshold = SIZE_MAX; TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX); TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX); -- 2.34.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ 2024-02-08 13:08 ` [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella @ 2024-02-12 15:56 ` H.J. Lu 0 siblings, 0 replies; 9+ messages in thread From: H.J. Lu @ 2024-02-12 15:56 UTC (permalink / raw) To: Adhemerval Zanella Cc: libc-alpha, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp On Thu, Feb 8, 2024 at 5:08 AM Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > For AMD Zen3+ architecture, the performance of the vectorized loop is > slightly better than ERMS. > > Checked on x86_64-linux-gnu on Zen3. > --- > sysdeps/x86/dl-cacheinfo.h | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index f34d12846c..5a98f70364 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -1021,6 +1021,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > minimum value is fixed. */ > rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold, > long int, NULL); > + if (cpu_features->basic.kind == arch_kind_amd > + && !TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold)) > + /* For AMD Zen3+ architecture, the performance of the vectorized loop is > + slightly better than ERMS. */ > + rep_stosb_threshold = SIZE_MAX; > > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX); > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX); > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset 2024-02-08 13:08 [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella 2024-02-08 13:08 ` [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella 2024-02-08 13:08 ` [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella @ 2024-02-08 13:08 ` Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-03-25 15:15 ` [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Florian Weimer 3 siblings, 1 reply; 9+ messages in thread From: Adhemerval Zanella @ 2024-02-08 13:08 UTC (permalink / raw) To: libc-alpha; +Cc: H . J . Lu, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp --- sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 9984c3ca0f..97839a2248 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -21,7 +21,9 @@ 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. - 5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with + 5. On machines ERMS feature, if size is greater or equal than + __x86_rep_stosb_threshold then REP STOSB will be used. + 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with 4 VEC stores and store 4 * VEC at a time until done. */ #include <sysdep.h> -- 2.34.1 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset 2024-02-08 13:08 ` [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella @ 2024-02-12 15:56 ` H.J. Lu 0 siblings, 0 replies; 9+ messages in thread From: H.J. Lu @ 2024-02-12 15:56 UTC (permalink / raw) To: Adhemerval Zanella Cc: libc-alpha, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp On Thu, Feb 8, 2024 at 5:08 AM Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > --- > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 9984c3ca0f..97839a2248 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -21,7 +21,9 @@ > 2. If size is less than VEC, use integer register stores. > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > - 5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > + 5. On machines ERMS feature, if size is greater or equal than > + __x86_rep_stosb_threshold then REP STOSB will be used. > + 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > 4 VEC stores and store 4 * VEC at a time until done. */ > > #include <sysdep.h> > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks. -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ 2024-02-08 13:08 [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella ` (2 preceding siblings ...) 2024-02-08 13:08 ` [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella @ 2024-03-25 15:15 ` Florian Weimer 2024-03-25 15:19 ` H.J. Lu 3 siblings, 1 reply; 9+ messages in thread From: Florian Weimer @ 2024-03-25 15:15 UTC (permalink / raw) To: Adhemerval Zanella Cc: libc-alpha, H . J . Lu, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp, Michael Hudson-Doyle, Simon Chopin * Adhemerval Zanella: > Adhemerval Zanella (3): > x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) > x86: Do not prefer ERMS for memset on Zen3+ > x86: Expand the comment on when REP STOSB is used on memset > > sysdeps/x86/dl-cacheinfo.h | 43 ++++++++++--------- > .../multiarch/memset-vec-unaligned-erms.S | 4 +- > 2 files changed, 26 insertions(+), 21 deletions(-) Should we backport this into release branches? This issue was first raised as an Ubuntu downstream bug, so maybe they'd appreciate a backport for this? We want this backport in Fedora, so it would help us, too. Thanks, Florian ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ 2024-03-25 15:15 ` [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Florian Weimer @ 2024-03-25 15:19 ` H.J. Lu 0 siblings, 0 replies; 9+ messages in thread From: H.J. Lu @ 2024-03-25 15:19 UTC (permalink / raw) To: Florian Weimer Cc: Adhemerval Zanella, libc-alpha, Noah Goldstein, Sajan Karumanchi, bmerry, pmallapp, Michael Hudson-Doyle, Simon Chopin On Mon, Mar 25, 2024 at 8:15 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Adhemerval Zanella: > > > Adhemerval Zanella (3): > > x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) > > x86: Do not prefer ERMS for memset on Zen3+ > > x86: Expand the comment on when REP STOSB is used on memset > > > > sysdeps/x86/dl-cacheinfo.h | 43 ++++++++++--------- > > .../multiarch/memset-vec-unaligned-erms.S | 4 +- > > 2 files changed, 26 insertions(+), 21 deletions(-) > > Should we backport this into release branches? Yes. > This issue was first raised as an Ubuntu downstream bug, so maybe they'd > appreciate a backport for this? We want this backport in Fedora, so it > would help us, too. > > Thanks, > Florian > -- H.J. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-03-25 15:19 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-08 13:08 [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Adhemerval Zanella 2024-02-08 13:08 ` [PATCH v3 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994) Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-02-08 13:08 ` [PATCH v3 2/3] x86: Do not prefer ERMS for memset on Zen3+ Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-02-08 13:08 ` [PATCH v3 3/3] x86: Expand the comment on when REP STOSB is used on memset Adhemerval Zanella 2024-02-12 15:56 ` H.J. Lu 2024-03-25 15:15 ` [PATCH v3 0/3] x86: Improve ERMS usage on Zen3+ Florian Weimer 2024-03-25 15:19 ` H.J. Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).