* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
@ 2023-10-24 17:56 ` sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-10-24 17:56 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Sam James <sam at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |sam at gentoo dot org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
@ 2023-10-25 10:16 ` bmerry at sarao dot ac.za
2023-10-25 12:50 ` fweimer at redhat dot com
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 10:16 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #1 from Bruce Merry <bmerry at sarao dot ac.za> ---
I've only got access to the Zen 4 systems until the end of the week, so if
there are any diagnostics that would be useful to capture, let me know ASAP.
There is some further information attached to #30994.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
@ 2023-10-25 12:50 ` fweimer at redhat dot com
2023-10-25 13:21 ` bmerry at sarao dot ac.za
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: fweimer at redhat dot com @ 2023-10-25 12:50 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fweimer at redhat dot com
--- Comment #2 from Florian Weimer <fweimer at redhat dot com> ---
This is likely a trade-off between whole-system performance and
single-thread/single-process performance. To avoid evicting higher-level,
shared cashes, it is usually beneficial to switch from REP MOVSB to
non-temporal stores at a certain point even though it impacts single-thread
performance.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (2 preceding siblings ...)
2023-10-25 12:50 ` fweimer at redhat dot com
@ 2023-10-25 13:21 ` bmerry at sarao dot ac.za
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 13:21 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #3 from Bruce Merry <bmerry at sarao dot ac.za> ---
> To avoid evicting higher-level, shared cashes, it is usually beneficial to switch from REP MOVSB to non-temporal stores at a certain point even though it impacts single-thread performance.
From what I can tell, REP MOVSB on Zen 4 already does this for large copies. I
base that off using AMD uProf to read the DRAM bandwidth counters while running
the copy benchmark. When copying 1GB with a single REP MOVSB, the read and
write counters match the rate of data transfer (no read-for-ownership
overhead). When breaking the copy into smaller pieces (less than 32MB), the
read counter is roughly double the transfer rate due to read-for-ownership.
I've tried running the benchmark on all 32 cores of the CPU; in this case
glibc's memcpy is about 5% faster than using REP MOVSB (and my simple AVX512
streaming copy with a linear access pattern gets pretty much the same
performance as REP MOVSB in this case). So you're correct that there is a
trade-off, but being 5% faster when bandwidth-limited but 30% slower on a
single core (as well as using more space in icache) doesn't seem like a great
tradeoff (I appreciate that trying to write a memcpy that works well across a
wide range of hardware is no easy task though).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (3 preceding siblings ...)
2023-10-25 13:21 ` bmerry at sarao dot ac.za
@ 2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:30 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |adhemerval.zanella at linaro dot o
| |rg
--- Comment #4 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
On Zen3 I can confirm that REP MOSVB is not faster than the vectorized path,
but with an unaligned destination the results are also subpar:
# Default non-temporal stores
$ ./memcpy_loop -f memcpy -D 1
4.19552
# GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=134217730
$ ./memcpy_loop -f memcpy -D 1
11.7379
# Modified glibc with tunables to force REP MOVSB
$ ./memcpy_loop -f memcpy -D 1
1.01945
With aligned stores I see ~20 GB on Zen3. I am even more convinced that REP
MOVSB is not really a good strategy for Zen3.
I still think it would be better to avoid non-temporal stores for unaligned
inputs on Zen3. Another possibility would avoid unaligned stores, but it would
require adding another code path that might not be optimal for all x86 cpus.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (4 preceding siblings ...)
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
2023-10-30 14:00 ` bmerry at sarao dot ac.za
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:34 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #5 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
As a side-note, the benchmark you are referring does not have some of the
options you are using (-r, memcpy_rep_movsb).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (5 preceding siblings ...)
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 14:00 ` bmerry at sarao dot ac.za
2023-10-30 14:24 ` bmerry at sarao dot ac.za
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:00 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #6 from Bruce Merry <bmerry at sarao dot ac.za> ---
> As a side-note, the benchmark you are referring does not have some of the options you are using (-r, memcpy_rep_movsb).
Oh, I linked to a fixed commit and those features have since been added to
main: https://github.com/ska-sa/katgpucbf/blob/main/scratch/memcpy_loop.cpp
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (6 preceding siblings ...)
2023-10-30 14:00 ` bmerry at sarao dot ac.za
@ 2023-10-30 14:24 ` bmerry at sarao dot ac.za
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:24 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #7 from Bruce Merry <bmerry at sarao dot ac.za> ---
> I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.
As noted in comment 2, non-temporal stores for large copies have benefits that
won't show up in a single-threaded microbenchmark: both less pollution of the
shared cache, and 1/3 less DRAM bandwidth (eliminates read-for-ownership).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (7 preceding siblings ...)
2023-10-30 14:24 ` bmerry at sarao dot ac.za
@ 2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 16:17 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #8 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
(In reply to Bruce Merry from comment #7)
> > I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.
>
> As noted in comment 2, non-temporal stores for large copies have benefits
> that won't show up in a single-threaded microbenchmark: both less pollution
> of the shared cache, and 1/3 less DRAM bandwidth (eliminates
> read-for-ownership).
Indeed, after some tests, it seems that the performance difference when
multiple issuers are involved does seem to show the advantage of non-temporal
stores even for unaligned arguments.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (8 preceding siblings ...)
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
@ 2023-11-07 13:01 ` jamborm at gcc dot gnu.org
2023-11-29 17:27 ` gabravier at gmail dot com
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-11-07 13:01 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Martin Jambor <jamborm at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jamborm at gcc dot gnu.org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (9 preceding siblings ...)
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
@ 2023-11-29 17:27 ` gabravier at gmail dot com
2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: gabravier at gmail dot com @ 2023-11-29 17:27 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Gabriel Ravier <gabravier at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |gabravier at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (10 preceding siblings ...)
2023-11-29 17:27 ` gabravier at gmail dot com
@ 2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-11-29 17:30 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Sam James <sam at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://sourceware.org/bugz
| |illa/show_bug.cgi?id=30994
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (11 preceding siblings ...)
2023-11-29 17:30 ` sam at gentoo dot org
@ 2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: pageexec at gmail dot com @ 2023-11-29 21:08 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
PaX Team <pageexec at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pageexec at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread