public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "adhemerval.zanella at linaro dot org" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug string/30994] REP MOVSB performance suffers from page aliasing on Zen 4 Date: Fri, 27 Oct 2023 12:39:09 +0000 [thread overview] Message-ID: <bug-30994-131-2AunmDusWq@http.sourceware.org/bugzilla/> (raw) In-Reply-To: <bug-30994-131@http.sourceware.org/bugzilla/> https://sourceware.org/bugzilla/show_bug.cgi?id=30994 Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |adhemerval.zanella at linaro dot o | |rg --- Comment #6 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> --- I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions. ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments. On Zen3 with REP MOVSB I see: $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 84.2448 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23` 506.099 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23` 990.845 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 57.1122 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23` 325.409 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23` 510.87 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 4.43104 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23` 22.4551 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23` 40.4088 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 4.34671 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23` 22.0829 GB/s $ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23` While with vectorized instructions I see: $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 124.183 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23` 773.696 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23` 1413.02 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 58.3212 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23` 322.583 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23` 506.116 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 121.872 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23` 717.717 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23` 1318.17 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 58.5352 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23` 325.996 GB/s $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23` 498.552 GB/s So it seems there in gain in using REP MOVSB on Zen3/Zen4, specially on the size is was supposed to be better. glibc 2.34 added a fix from AMD (6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f), where the assumption is ERMS performs poorly on data above L2 cache size so REP MOVSB is limited to L2 cache size (from 2113 to 524287), but I think AMD engineers did not really evaluated that ERM is indeed better than vectorized instruction. And I think BZ#30995 is the same issue, since __memcpy_avx512_unaligned_erms uses the same tunable to decide whether to use ERMS. I have created a patch that just disable ERMS usage on AMD cores [1], can you check if it improves performance on Zen4 as well? Also, I have notices that memset is also showing subpar performance with ERMS and I also disable it on my branch. [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz30944-memcpy-zen -- You are receiving this mail because: You are on the CC list for the bug.
next prev parent reply other threads:[~2023-10-27 12:39 UTC|newest] Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-10-24 6:18 [Bug string/30994] New: " bmerry at sarao dot ac.za 2023-10-24 6:19 ` [Bug string/30994] " bmerry at sarao dot ac.za 2023-10-24 6:20 ` bmerry at sarao dot ac.za 2023-10-24 6:21 ` bmerry at sarao dot ac.za 2023-10-24 6:21 ` bmerry at sarao dot ac.za 2023-10-24 6:32 ` bmerry at sarao dot ac.za 2023-10-24 17:57 ` sam at gentoo dot org 2023-10-25 12:40 ` fweimer at redhat dot com 2023-10-25 13:37 ` bmerry at sarao dot ac.za 2023-10-27 12:39 ` adhemerval.zanella at linaro dot org [this message] 2023-10-27 13:04 ` bmerry at sarao dot ac.za 2023-10-27 13:16 ` bmerry at sarao dot ac.za 2023-10-30 8:21 ` bmerry at sarao dot ac.za 2023-10-30 13:30 ` adhemerval.zanella at linaro dot org 2023-10-30 14:21 ` bmerry at sarao dot ac.za 2023-10-30 16:27 ` adhemerval.zanella at linaro dot org 2023-11-07 15:44 ` jamborm at gcc dot gnu.org 2023-11-29 3:08 ` lilydjwg at gmail dot com 2023-11-29 13:01 ` holger@applied-asynchrony.com 2023-11-29 15:57 ` jrmuizel at gmail dot com 2023-11-29 17:25 ` gabravier at gmail dot com 2023-11-29 17:30 ` sam at gentoo dot org 2023-11-29 19:58 ` matti.niemenmaa+sourcesbugs at iki dot fi 2023-11-29 21:08 ` pageexec at gmail dot com 2023-11-30 3:13 ` dushistov at mail dot ru 2023-12-08 8:32 ` mati865 at gmail dot com 2024-02-13 16:54 ` cvs-commit at gcc dot gnu.org 2024-04-04 10:36 ` cvs-commit at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-30994-131-2AunmDusWq@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).