public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies
@ 2023-10-24 7:38 bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
` (12 more replies)
0 siblings, 13 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-24 7:38 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Bug ID: 30995
Summary: Zen 4: sub-optimal memcpy on very large copies
Product: glibc
Version: 2.38
Status: UNCONFIRMED
Severity: minor
Priority: P2
Component: string
Assignee: unassigned at sourceware dot org
Reporter: bmerry at sarao dot ac.za
Target Milestone: ---
At sizes significantly larger than 32MB, the copy strategy seems to perform
worse on Zen 4 than either REP MOVSB or a more naive AVX-512 streaming copy.
Steps to reproduce:
1. Compile the microbench at
https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
using the adjacent Makefile (or g++ -std=c++17 -std=c++17 -Wall -O3 -pthread -o
memcpy_loop memcpy_loop.cpp)
2. Run it as ./memcpy_loop -f memcpy -r 5
3. Run it again as ./memcpy_loop -f memcpy_rep_movsb -r 5
4. Run it again as ./memcpy_loop -f memcpy_stream_avx512 -r 5
On the system I'm testing, the first reports 19.2 GB/s while the second (which
directly invokes REP MOVSB) reports 27-27.5 GB/s and the third (a
straight-forward non-temporal AVX-512 implementation) reports 27.8 GB/s. This
is for a 128 MiB copy (other sizes can be passed to the benchmark with -b).
Interestingly, I don't see this regression on a similarly-configured Zen 3
system, where memcpy and memcpy_rep_movsb seem to have roughly the same
performance on large copies. This is in spite of the comment at
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/dl-cacheinfo.h;h=87486054f931e52f53123c672217f1903297ec76;hb=HEAD#l1031
claiming that Zen 3's REP MOVSB performs poorly on large copies.
System information: Epyc 9374F processor, Ubuntu 22.04, glibc compiled from git
glibc-2.38.9000-185-g2aa0974d25
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
@ 2023-10-24 17:56 ` sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-10-24 17:56 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Sam James <sam at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |sam at gentoo dot org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
@ 2023-10-25 10:16 ` bmerry at sarao dot ac.za
2023-10-25 12:50 ` fweimer at redhat dot com
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 10:16 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #1 from Bruce Merry <bmerry at sarao dot ac.za> ---
I've only got access to the Zen 4 systems until the end of the week, so if
there are any diagnostics that would be useful to capture, let me know ASAP.
There is some further information attached to #30994.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
@ 2023-10-25 12:50 ` fweimer at redhat dot com
2023-10-25 13:21 ` bmerry at sarao dot ac.za
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: fweimer at redhat dot com @ 2023-10-25 12:50 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fweimer at redhat dot com
--- Comment #2 from Florian Weimer <fweimer at redhat dot com> ---
This is likely a trade-off between whole-system performance and
single-thread/single-process performance. To avoid evicting higher-level,
shared cashes, it is usually beneficial to switch from REP MOVSB to
non-temporal stores at a certain point even though it impacts single-thread
performance.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (2 preceding siblings ...)
2023-10-25 12:50 ` fweimer at redhat dot com
@ 2023-10-25 13:21 ` bmerry at sarao dot ac.za
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 13:21 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #3 from Bruce Merry <bmerry at sarao dot ac.za> ---
> To avoid evicting higher-level, shared cashes, it is usually beneficial to switch from REP MOVSB to non-temporal stores at a certain point even though it impacts single-thread performance.
From what I can tell, REP MOVSB on Zen 4 already does this for large copies. I
base that off using AMD uProf to read the DRAM bandwidth counters while running
the copy benchmark. When copying 1GB with a single REP MOVSB, the read and
write counters match the rate of data transfer (no read-for-ownership
overhead). When breaking the copy into smaller pieces (less than 32MB), the
read counter is roughly double the transfer rate due to read-for-ownership.
I've tried running the benchmark on all 32 cores of the CPU; in this case
glibc's memcpy is about 5% faster than using REP MOVSB (and my simple AVX512
streaming copy with a linear access pattern gets pretty much the same
performance as REP MOVSB in this case). So you're correct that there is a
trade-off, but being 5% faster when bandwidth-limited but 30% slower on a
single core (as well as using more space in icache) doesn't seem like a great
tradeoff (I appreciate that trying to write a memcpy that works well across a
wide range of hardware is no easy task though).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (3 preceding siblings ...)
2023-10-25 13:21 ` bmerry at sarao dot ac.za
@ 2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:30 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |adhemerval.zanella at linaro dot o
| |rg
--- Comment #4 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
On Zen3 I can confirm that REP MOSVB is not faster than the vectorized path,
but with an unaligned destination the results are also subpar:
# Default non-temporal stores
$ ./memcpy_loop -f memcpy -D 1
4.19552
# GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=134217730
$ ./memcpy_loop -f memcpy -D 1
11.7379
# Modified glibc with tunables to force REP MOVSB
$ ./memcpy_loop -f memcpy -D 1
1.01945
With aligned stores I see ~20 GB on Zen3. I am even more convinced that REP
MOVSB is not really a good strategy for Zen3.
I still think it would be better to avoid non-temporal stores for unaligned
inputs on Zen3. Another possibility would avoid unaligned stores, but it would
require adding another code path that might not be optimal for all x86 cpus.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (4 preceding siblings ...)
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
2023-10-30 14:00 ` bmerry at sarao dot ac.za
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:34 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #5 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
As a side-note, the benchmark you are referring does not have some of the
options you are using (-r, memcpy_rep_movsb).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (5 preceding siblings ...)
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 14:00 ` bmerry at sarao dot ac.za
2023-10-30 14:24 ` bmerry at sarao dot ac.za
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:00 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #6 from Bruce Merry <bmerry at sarao dot ac.za> ---
> As a side-note, the benchmark you are referring does not have some of the options you are using (-r, memcpy_rep_movsb).
Oh, I linked to a fixed commit and those features have since been added to
main: https://github.com/ska-sa/katgpucbf/blob/main/scratch/memcpy_loop.cpp
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (6 preceding siblings ...)
2023-10-30 14:00 ` bmerry at sarao dot ac.za
@ 2023-10-30 14:24 ` bmerry at sarao dot ac.za
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:24 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #7 from Bruce Merry <bmerry at sarao dot ac.za> ---
> I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.
As noted in comment 2, non-temporal stores for large copies have benefits that
won't show up in a single-threaded microbenchmark: both less pollution of the
shared cache, and 1/3 less DRAM bandwidth (eliminates read-for-ownership).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (7 preceding siblings ...)
2023-10-30 14:24 ` bmerry at sarao dot ac.za
@ 2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 16:17 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
--- Comment #8 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
(In reply to Bruce Merry from comment #7)
> > I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.
>
> As noted in comment 2, non-temporal stores for large copies have benefits
> that won't show up in a single-threaded microbenchmark: both less pollution
> of the shared cache, and 1/3 less DRAM bandwidth (eliminates
> read-for-ownership).
Indeed, after some tests, it seems that the performance difference when
multiple issuers are involved does seem to show the advantage of non-temporal
stores even for unaligned arguments.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (8 preceding siblings ...)
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
@ 2023-11-07 13:01 ` jamborm at gcc dot gnu.org
2023-11-29 17:27 ` gabravier at gmail dot com
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-11-07 13:01 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Martin Jambor <jamborm at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jamborm at gcc dot gnu.org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (9 preceding siblings ...)
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
@ 2023-11-29 17:27 ` gabravier at gmail dot com
2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: gabravier at gmail dot com @ 2023-11-29 17:27 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Gabriel Ravier <gabravier at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |gabravier at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (10 preceding siblings ...)
2023-11-29 17:27 ` gabravier at gmail dot com
@ 2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-11-29 17:30 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
Sam James <sam at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://sourceware.org/bugz
| |illa/show_bug.cgi?id=30994
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
` (11 preceding siblings ...)
2023-11-29 17:30 ` sam at gentoo dot org
@ 2023-11-29 21:08 ` pageexec at gmail dot com
12 siblings, 0 replies; 14+ messages in thread
From: pageexec at gmail dot com @ 2023-11-29 21:08 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30995
PaX Team <pageexec at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pageexec at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-11-29 21:08 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-24 7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
2023-10-25 12:50 ` fweimer at redhat dot com
2023-10-25 13:21 ` bmerry at sarao dot ac.za
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
2023-10-30 14:00 ` bmerry at sarao dot ac.za
2023-10-30 14:24 ` bmerry at sarao dot ac.za
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
2023-11-29 17:27 ` gabravier at gmail dot com
2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).