public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "already5chosen at yahoo dot com" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3 Date: Sat, 26 Nov 2022 22:00:40 +0000 [thread overview] Message-ID: <bug-97832-4-Y36DPROdgA@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-97832-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #22 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Alexander Monakov from comment #21) > (In reply to Michael_S from comment #19) > > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > > > IVs for the two arrays actually helps on this testcase. > > > > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), > > %ymm3, %ymm0' would be turned into 2 uops. > > The difference is at which point in the pipeline. The latter goes through > renaming as one fused uop. > Intel never documents such fine details in their Optimization Reference manuals. But I believe you. > > Misuse of load+op is far bigger problem in this particular test case than > > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > > loop that can potentially run at 3 clocks per iteration into loop of 4+ > > clocks per iteration. > > Sorry, which assembler output this refers to? > gcc12 -O3 -mavx2 -mfma gcc12 -O3 -march=skylake does not suffer from this problem. I still think that RISC-style icc code will be a little faster on Skylake, but here we are arguing about 1/4th of the cycle per iteration rather than a full cycle. https://godbolt.org/z/nfa7c9se3 > > But I consider it a separate issue. I reported similar issue in 97127, but > > here it is more serious. It looks to me that the issue is not soluble within > > existing gcc optimization framework. The only chance is if you accept my old > > and simple advice - within inner loops pretend that AVX is RISC, i.e. > > generate code as if load-op form of AVX instructions weren't existing. > > In bug 97127 the best explanation we have so far is we don't optimally > handle the case where non-memory inputs of an fma are reused, so we can't > combine a load with an fma without causing an extra register copy (PR 97127 > comment 16 demonstrates what I mean). I cannot imagine such trouble arising > with more common commutative operations like mul/add, especially with > non-destructive VEX encoding. If you hit such examples, I would suggest to > report them also, because their root cause might be different. > > In general load-op combining should be very helpful on x86, because it > reduces the number of uops flowing through the renaming stage, which is one > of the narrowest points in the pipeline. If compilers were perfect, AVX load-op combining would be somewhat helpful. I have my doubts about very helpful. But compilers are not perfect. For none-AVX case, where every op is destructive and repeated loads are on average cheaper than on AVX, combined load-ops is far more profitable.
next prev parent reply other threads:[~2022-11-26 22:00 UTC|newest] Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-11-14 20:44 [Bug target/97832] New: " already5chosen at yahoo dot com 2020-11-16 7:21 ` [Bug target/97832] " rguenth at gcc dot gnu.org 2020-11-16 11:11 ` rguenth at gcc dot gnu.org 2020-11-16 20:11 ` already5chosen at yahoo dot com 2020-11-17 9:21 ` [Bug tree-optimization/97832] " rguenth at gcc dot gnu.org 2020-11-17 10:18 ` rguenth at gcc dot gnu.org 2020-11-18 8:53 ` rguenth at gcc dot gnu.org 2020-11-18 9:15 ` rguenth at gcc dot gnu.org 2020-11-18 13:23 ` rguenth at gcc dot gnu.org 2020-11-18 13:39 ` rguenth at gcc dot gnu.org 2020-11-19 19:55 ` already5chosen at yahoo dot com 2020-11-20 7:10 ` rguenth at gcc dot gnu.org 2021-06-09 12:41 ` cvs-commit at gcc dot gnu.org 2021-06-09 12:54 ` rguenth at gcc dot gnu.org 2022-01-21 0:16 ` pinskia at gcc dot gnu.org 2022-11-24 23:22 ` already5chosen at yahoo dot com 2022-11-25 8:16 ` rguenth at gcc dot gnu.org 2022-11-25 13:19 ` already5chosen at yahoo dot com 2022-11-25 20:46 ` rguenth at gcc dot gnu.org 2022-11-25 21:27 ` amonakov at gcc dot gnu.org 2022-11-26 18:27 ` already5chosen at yahoo dot com 2022-11-26 18:36 ` already5chosen at yahoo dot com 2022-11-26 19:36 ` amonakov at gcc dot gnu.org 2022-11-26 22:00 ` already5chosen at yahoo dot com [this message] 2022-11-28 6:29 ` crazylht at gmail dot com 2022-11-28 6:42 ` crazylht at gmail dot com 2022-11-28 7:21 ` rguenther at suse dot de 2022-11-28 7:24 ` crazylht at gmail dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-97832-4-Y36DPROdgA@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).