public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake
@ 2020-09-20 20:36 already5chosen at yahoo dot com
  2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-20 20:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

            Bug ID: 97127
           Summary: FMA3 code transformation leads to slowdown on Skylake
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

The following clever gcc transformation leads to generation of slower code than
non-transformed original:
 a = *mem;
 a = a + b * c;
where both b and c are reused further down is transformed to:
 a = b
 a = *mem + a * c;

Or, expressing the same in asm terms
 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

You may ask "Why transformed variant is slower?" and I can try my best to
answer (my guess is that performance bottleneck is in rename stage rather than
in the execution stage and transformed code occupies 3 rename slots vs 2 rename
slots by original) but it would be mostly pointless. What's matters that on
Skylake the transformed variant is slower and I can prove it with benchmark.
BTW, on Haswell too.

You can see comparison of two variants at
https://github.com/already5chosen/others/tree/master/cholesky_solver/gcc-badopt-fma3
The interesting spot is starting at line 367 in file chol.cpp.
Or starting two lines below .L21: in asm generated by gcc 10.2.0 (chol_a.s).
Run 's_chol_a 100' vs 's_chol_b 100' and see the difference in favor of the
second (de-transformed) variant.
The difference, in this particular case, is small, order of 2-4 percents, but
very consistent.
In more tight loops I would expect a bigger difference.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-09-30 12:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
2020-09-21 10:35 ` amonakov at gcc dot gnu.org
2020-09-21 13:40 ` already5chosen at yahoo dot com
2020-09-21 15:17 ` amonakov at gcc dot gnu.org
2020-09-22  8:10 ` crazylht at gmail dot com
2020-09-22 10:01 ` already5chosen at yahoo dot com
2020-09-23  1:38 ` crazylht at gmail dot com
2020-09-23 17:49 ` already5chosen at yahoo dot com
2020-09-24  3:23 ` crazylht at gmail dot com
2020-09-24  8:28 ` already5chosen at yahoo dot com
2020-09-24 10:06 ` crazylht at gmail dot com
2020-09-24 10:46 ` crazylht at gmail dot com
2020-09-24 12:38 ` already5chosen at yahoo dot com
2020-09-25  5:24 ` crazylht at gmail dot com
2020-09-25 13:21 ` already5chosen at yahoo dot com
2020-09-25 14:02 ` amonakov at gcc dot gnu.org
2020-09-25 15:55 ` amonakov at gcc dot gnu.org
2020-09-30 12:09 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).