[Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies
@ 2023-10-24  7:38 bmerry at sarao dot ac.za
  2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-24  7:38 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

            Bug ID: 30995
           Summary: Zen 4: sub-optimal memcpy on very large copies
           Product: glibc
           Version: 2.38
            Status: UNCONFIRMED
          Severity: minor
          Priority: P2
         Component: string
          Assignee: unassigned at sourceware dot org
          Reporter: bmerry at sarao dot ac.za
  Target Milestone: ---

At sizes significantly larger than 32MB, the copy strategy seems to perform
worse on Zen 4 than either REP MOVSB or a more naive AVX-512 streaming copy.

Steps to reproduce:
1. Compile the microbench at
https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
using the adjacent Makefile (or g++ -std=c++17 -std=c++17 -Wall -O3 -pthread -o
memcpy_loop memcpy_loop.cpp)
2. Run it as ./memcpy_loop -f memcpy -r 5
3. Run it again as ./memcpy_loop -f memcpy_rep_movsb -r 5
4. Run it again as ./memcpy_loop -f memcpy_stream_avx512 -r 5

On the system I'm testing, the first reports 19.2 GB/s while the second (which
directly invokes REP MOVSB) reports 27-27.5 GB/s and the third (a
straight-forward non-temporal AVX-512 implementation) reports 27.8 GB/s. This
is for a 128 MiB copy (other sizes can be passed to the benchmark with -b).

Interestingly, I don't see this regression on a similarly-configured Zen 3
system, where memcpy and memcpy_rep_movsb seem to have roughly the same
performance on large copies. This is in spite of the comment at
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/dl-cacheinfo.h;h=87486054f931e52f53123c672217f1903297ec76;hb=HEAD#l1031
claiming that Zen 3's REP MOVSB performs poorly on large copies.

System information: Epyc 9374F processor, Ubuntu 22.04, glibc compiled from git
glibc-2.38.9000-185-g2aa0974d25

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
@ 2023-10-24 17:56 ` sam at gentoo dot org
  2023-10-25 10:16 ` bmerry at sarao dot ac.za
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-10-24 17:56 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Sam James <sam at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sam at gentoo dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
  2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
@ 2023-10-25 10:16 ` bmerry at sarao dot ac.za
  2023-10-25 12:50 ` fweimer at redhat dot com
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 10:16 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #1 from Bruce Merry <bmerry at sarao dot ac.za> ---
I've only got access to the Zen 4 systems until the end of the week, so if
there are any diagnostics that would be useful to capture, let me know ASAP.
There is some further information attached to #30994.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
  2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
  2023-10-25 10:16 ` bmerry at sarao dot ac.za
@ 2023-10-25 12:50 ` fweimer at redhat dot com
  2023-10-25 13:21 ` bmerry at sarao dot ac.za
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: fweimer at redhat dot com @ 2023-10-25 12:50 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com

--- Comment #2 from Florian Weimer <fweimer at redhat dot com> ---
This is likely a trade-off between whole-system performance and
single-thread/single-process performance. To avoid evicting higher-level,
shared cashes, it is usually beneficial to switch from REP MOVSB to
non-temporal stores at a certain point even though it impacts single-thread
performance.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (2 preceding siblings ...)
  2023-10-25 12:50 ` fweimer at redhat dot com
@ 2023-10-25 13:21 ` bmerry at sarao dot ac.za
  2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-25 13:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #3 from Bruce Merry <bmerry at sarao dot ac.za> ---
> To avoid evicting higher-level, shared cashes, it is usually beneficial to switch from REP MOVSB to non-temporal stores at a certain point even though it impacts single-thread performance.

From what I can tell, REP MOVSB on Zen 4 already does this for large copies. I
base that off using AMD uProf to read the DRAM bandwidth counters while running
the copy benchmark. When copying 1GB with a single REP MOVSB, the read and
write counters match the rate of data transfer (no read-for-ownership
overhead). When breaking the copy into smaller pieces (less than 32MB), the
read counter is roughly double the transfer rate due to read-for-ownership.

I've tried running the benchmark on all 32 cores of the CPU; in this case
glibc's memcpy is about 5% faster than using REP MOVSB (and my simple AVX512
streaming copy with a linear access pattern gets pretty much the same
performance as REP MOVSB in this case). So you're correct that there is a
trade-off, but being 5% faster when bandwidth-limited but 30% slower on a
single core (as well as using more space in icache) doesn't seem like a great
tradeoff (I appreciate that trying to write a memcpy that works well across a
wide range of hardware is no easy task though).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (3 preceding siblings ...)
  2023-10-25 13:21 ` bmerry at sarao dot ac.za
@ 2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
  2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |adhemerval.zanella at linaro dot o
                   |                            |rg

--- Comment #4 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
On Zen3 I can confirm that REP MOSVB is not faster than the vectorized path,
but with an unaligned destination the results are also subpar:

# Default non-temporal stores
$ ./memcpy_loop -f memcpy -D 1
4.19552

# GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=134217730
$ ./memcpy_loop -f memcpy -D 1
11.7379

# Modified glibc with tunables to force REP MOVSB
$ ./memcpy_loop -f memcpy -D 1
1.01945

With aligned stores I see ~20 GB on Zen3.  I am even more convinced that REP
MOVSB is not really a good strategy for Zen3.

I still think it would be better to avoid non-temporal stores for unaligned
inputs on Zen3.  Another possibility would avoid unaligned stores, but it would
require adding another code path that might not be optimal for all x86 cpus.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (4 preceding siblings ...)
  2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
  2023-10-30 14:00 ` bmerry at sarao dot ac.za
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 12:34 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #5 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
As a side-note, the benchmark you are referring does not have some of the
options you are using (-r, memcpy_rep_movsb).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (5 preceding siblings ...)
  2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
@ 2023-10-30 14:00 ` bmerry at sarao dot ac.za
  2023-10-30 14:24 ` bmerry at sarao dot ac.za
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:00 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #6 from Bruce Merry <bmerry at sarao dot ac.za> ---
> As a side-note, the benchmark you are referring does not have some of the options you are using (-r, memcpy_rep_movsb).

Oh, I linked to a fixed commit and those features have since been added to
main: https://github.com/ska-sa/katgpucbf/blob/main/scratch/memcpy_loop.cpp

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (6 preceding siblings ...)
  2023-10-30 14:00 ` bmerry at sarao dot ac.za
@ 2023-10-30 14:24 ` bmerry at sarao dot ac.za
  2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: bmerry at sarao dot ac.za @ 2023-10-30 14:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #7 from Bruce Merry <bmerry at sarao dot ac.za> ---
> I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.

As noted in comment 2, non-temporal stores for large copies have benefits that
won't show up in a single-threaded microbenchmark: both less pollution of the
shared cache, and 1/3 less DRAM bandwidth (eliminates read-for-ownership).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (7 preceding siblings ...)
  2023-10-30 14:24 ` bmerry at sarao dot ac.za
@ 2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
  2023-11-07 13:01 ` jamborm at gcc dot gnu.org
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2023-10-30 16:17 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

--- Comment #8 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
(In reply to Bruce Merry from comment #7)
> > I still think it would be better to avoid non-temporal stores for unaligned inputs on Zen3.
> 
> As noted in comment 2, non-temporal stores for large copies have benefits
> that won't show up in a single-threaded microbenchmark: both less pollution
> of the shared cache, and 1/3 less DRAM bandwidth (eliminates
> read-for-ownership).

Indeed, after some tests, it seems that the performance difference when
multiple issuers are involved does seem to show the advantage of non-temporal
stores even for unaligned arguments.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (8 preceding siblings ...)
  2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
@ 2023-11-07 13:01 ` jamborm at gcc dot gnu.org
  2023-11-29 17:27 ` gabravier at gmail dot com
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-11-07 13:01 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (9 preceding siblings ...)
  2023-11-07 13:01 ` jamborm at gcc dot gnu.org
@ 2023-11-29 17:27 ` gabravier at gmail dot com
  2023-11-29 17:30 ` sam at gentoo dot org
  2023-11-29 21:08 ` pageexec at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: gabravier at gmail dot com @ 2023-11-29 17:27 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Gabriel Ravier <gabravier at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gabravier at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (10 preceding siblings ...)
  2023-11-29 17:27 ` gabravier at gmail dot com
@ 2023-11-29 17:30 ` sam at gentoo dot org
  2023-11-29 21:08 ` pageexec at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: sam at gentoo dot org @ 2023-11-29 17:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

Sam James <sam at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://sourceware.org/bugz
                   |                            |illa/show_bug.cgi?id=30994

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug string/30995] Zen 4: sub-optimal memcpy on very large copies
  2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
                   ` (11 preceding siblings ...)
  2023-11-29 17:30 ` sam at gentoo dot org
@ 2023-11-29 21:08 ` pageexec at gmail dot com
  12 siblings, 0 replies; 14+ messages in thread
From: pageexec at gmail dot com @ 2023-11-29 21:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30995

PaX Team <pageexec at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pageexec at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-11-29 21:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-24  7:38 [Bug string/30995] New: Zen 4: sub-optimal memcpy on very large copies bmerry at sarao dot ac.za
2023-10-24 17:56 ` [Bug string/30995] " sam at gentoo dot org
2023-10-25 10:16 ` bmerry at sarao dot ac.za
2023-10-25 12:50 ` fweimer at redhat dot com
2023-10-25 13:21 ` bmerry at sarao dot ac.za
2023-10-30 12:30 ` adhemerval.zanella at linaro dot org
2023-10-30 12:34 ` adhemerval.zanella at linaro dot org
2023-10-30 14:00 ` bmerry at sarao dot ac.za
2023-10-30 14:24 ` bmerry at sarao dot ac.za
2023-10-30 16:17 ` adhemerval.zanella at linaro dot org
2023-11-07 13:01 ` jamborm at gcc dot gnu.org
2023-11-29 17:27 ` gabravier at gmail dot com
2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 21:08 ` pageexec at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).