[Bug string/30994] REP MOVSB performance suffers from page aliasing on Zen 4

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "adhemerval.zanella at linaro dot org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug string/30994] REP MOVSB performance suffers from page aliasing on Zen 4
Date: Fri, 27 Oct 2023 12:39:09 +0000	[thread overview]
Message-ID: <bug-30994-131-2AunmDusWq@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-30994-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=30994

Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |adhemerval.zanella at linaro dot o
                   |                            |rg

--- Comment #6 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB
seems to be always worse than vector instructions.  ERMS is used for sizes
between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the
L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a
microcode since I don't see similar performance difference with other
alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s'
' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s'
' 0 23`
990.845 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq
-s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq
-s' ' 0 23`
510.87 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s'
' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s'
' 0 23`
40.4088 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq
-s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq
-s' ' 0 23`


While with vectorized instructions I see:


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
325.996 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh
./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`
498.552 GB/s

So it seems there in gain in using REP MOVSB on Zen3/Zen4, specially on the
size is was supposed to be better. glibc 2.34 added a fix from AMD
(6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f), where the assumption is ERMS
performs poorly on data above L2 cache size so REP MOVSB is limited to L2 cache
size (from 2113 to 524287), but I think AMD engineers did not really evaluated
that ERM is indeed better than vectorized instruction.

And I think BZ#30995 is the same issue, since __memcpy_avx512_unaligned_erms
uses the same tunable to decide whether to use ERMS. I have created a patch
that just disable ERMS usage on AMD cores [1], can you check if it improves
performance on Zen4 as well?

Also, I have notices that memset is also showing subpar performance with ERMS
and I also disable it on my branch.

[1]
https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz30944-memcpy-zen

-- 
You are receiving this mail because:
You are on the CC list for the bug.

next prev parent reply	other threads:[~2023-10-27 12:39 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24  6:18 [Bug string/30994] New: " bmerry at sarao dot ac.za
2023-10-24  6:19 ` [Bug string/30994] " bmerry at sarao dot ac.za
2023-10-24  6:20 ` bmerry at sarao dot ac.za
2023-10-24  6:21 ` bmerry at sarao dot ac.za
2023-10-24  6:21 ` bmerry at sarao dot ac.za
2023-10-24  6:32 ` bmerry at sarao dot ac.za
2023-10-24 17:57 ` sam at gentoo dot org
2023-10-25 12:40 ` fweimer at redhat dot com
2023-10-25 13:37 ` bmerry at sarao dot ac.za
2023-10-27 12:39 ` adhemerval.zanella at linaro dot org [this message]
2023-10-27 13:04 ` bmerry at sarao dot ac.za
2023-10-27 13:16 ` bmerry at sarao dot ac.za
2023-10-30  8:21 ` bmerry at sarao dot ac.za
2023-10-30 13:30 ` adhemerval.zanella at linaro dot org
2023-10-30 14:21 ` bmerry at sarao dot ac.za
2023-10-30 16:27 ` adhemerval.zanella at linaro dot org
2023-11-07 15:44 ` jamborm at gcc dot gnu.org
2023-11-29  3:08 ` lilydjwg at gmail dot com
2023-11-29 13:01 ` holger@applied-asynchrony.com
2023-11-29 15:57 ` jrmuizel at gmail dot com
2023-11-29 17:25 ` gabravier at gmail dot com
2023-11-29 17:30 ` sam at gentoo dot org
2023-11-29 19:58 ` matti.niemenmaa+sourcesbugs at iki dot fi
2023-11-29 21:08 ` pageexec at gmail dot com
2023-11-30  3:13 ` dushistov at mail dot ru
2023-12-08  8:32 ` mati865 at gmail dot com
2024-02-13 16:54 ` cvs-commit at gcc dot gnu.org
2024-04-04 10:36 ` cvs-commit at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-30994-131-2AunmDusWq@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).