[Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake
@ 2020-09-20 20:36 already5chosen at yahoo dot com
  2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-20 20:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

            Bug ID: 97127
           Summary: FMA3 code transformation leads to slowdown on Skylake
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

The following clever gcc transformation leads to generation of slower code than
non-transformed original:
 a = *mem;
 a = a + b * c;
where both b and c are reused further down is transformed to:
 a = b
 a = *mem + a * c;

Or, expressing the same in asm terms
 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

You may ask "Why transformed variant is slower?" and I can try my best to
answer (my guess is that performance bottleneck is in rename stage rather than
in the execution stage and transformed code occupies 3 rename slots vs 2 rename
slots by original) but it would be mostly pointless. What's matters that on
Skylake the transformed variant is slower and I can prove it with benchmark.
BTW, on Haswell too.

You can see comparison of two variants at
https://github.com/already5chosen/others/tree/master/cholesky_solver/gcc-badopt-fma3
The interesting spot is starting at line 367 in file chol.cpp.
Or starting two lines below .L21: in asm generated by gcc 10.2.0 (chol_a.s).
Run 's_chol_a 100' vs 's_chol_b 100' and see the difference in favor of the
second (de-transformed) variant.
The difference, in this particular case, is small, order of 2-4 percents, but
very consistent.
In more tight loops I would expect a bigger difference.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
@ 2020-09-21  6:54 ` rguenth at gcc dot gnu.org
  2020-09-21 10:35 ` amonakov at gcc dot gnu.org
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-21  6:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|i386,x86-64                 |x86_64-*-* i?86-*-*
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC does not model CPU pipelines in such detail (not to say documentation on
the CPU side is insufficient).  In principle the vmovaxx %ymmB, %ymmA should be
handled at the rename stage and be 'free' and both cases should end up with
the same number of uops.

Did you try to throw Intel iaca on it? (not that it is very reliable)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
  2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
@ 2020-09-21 10:35 ` amonakov at gcc dot gnu.org
  2020-09-21 13:40 ` already5chosen at yahoo dot com
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-21 10:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Richard, though register moves are resolved by renaming, they still occupy a
uop in all stages except execution, and since renaming is one of the narrowest
points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of
uops generally helps.

In Michael's the actual memory address has two operands:

<       vmovapd %ymm1, %ymm10
<       vmovapd %ymm1, %ymm11
<       vfnmadd213pd    (%rdx,%rax), %ymm9, %ymm10
<       vfnmadd213pd    (%rcx,%rax), %ymm7, %ymm11
---
> 	vmovupd	(%rdx,%rax), %ymm10
> 	vmovupd	(%rcx,%rax), %ymm11
> 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> 	vfnmadd231pd	%ymm1, %ymm7, %ymm11

The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
renaming (because otherwise there would be too many operands to handle). Hence
the original code has 4 uops after decoding, 6 uops before renaming, and the
transformed code has 4 uops before renaming. Execution handles 4 uops in both
cases.

FMA unlamination is mentioned in
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes

Michael, you can probably measure it for yourself with

   perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
  2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
  2020-09-21 10:35 ` amonakov at gcc dot gnu.org
@ 2020-09-21 13:40 ` already5chosen at yahoo dot com
  2020-09-21 15:17 ` amonakov at gcc dot gnu.org
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-21 13:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #3 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Alexander Monakov from comment #2)
> Richard, though register moves are resolved by renaming, they still occupy a
> uop in all stages except execution, and since renaming is one of the
> narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> reducing number of uops generally helps.
> 
> In Michael's the actual memory address has two operands:
> 
> < 	vmovapd	%ymm1, %ymm10
> < 	vmovapd	%ymm1, %ymm11
> < 	vfnmadd213pd	(%rdx,%rax), %ymm9, %ymm10
> < 	vfnmadd213pd	(%rcx,%rax), %ymm7, %ymm11
> ---
> > 	vmovupd	(%rdx,%rax), %ymm10
> > 	vmovupd	(%rcx,%rax), %ymm11
> > 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> > 	vfnmadd231pd	%ymm1, %ymm7, %ymm11
> 
> The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
> renaming (because otherwise there would be too many operands to handle).
> Hence the original code has 4 uops after decoding, 6 uops before renaming,
> and the transformed code has 4 uops before renaming. Execution handles 4
> uops in both cases.
> 
> FMA unlamination is mentioned in
> https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-
> modes
> 

That's pretty much what I assumed.
More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs 4
entries by de-transformed sequence. But capacity of reservation stations is not
a bottleneck (and generally very rarely a bottleneck in tight FP loops) while
rename stage is one of the bottlenecks, supposedly together with decode and L2
read throughput, but rename limits are more severe than the other two.

Anyway, as I mentioned in original post, the question *why* clever gcc
transformation makes execution slower is of lesser interest. What matters is a
fact that it makes execution slower.

> Michael, you can probably measure it for yourself with
> 
>    perf stat -e
> cycles,instructions,uops_retired.all,uops_retired.retire_slots

I am playing with this code on msys2. Strongly suspect that 'perf stat' is not
going to work here. Also, repeating myself, I am not too interested in *why* a
clever code is slower than simple one. For me it's enough to know that it *is*
slower.

[generalization mode on]
As a rule of thumb, I'd say that in inner loops [both on Skylake and on
Haswell/Broadwell] FMA3 with memory source operand almost never a measurable
win relatively to RISCy load+FMA3 sequence, even when number of uOPs going into
rename stage is the same. Most of the time it's a wash and sometimes a loss.
So, compiler that works overtime to produce such complex instructions, can
spend its effort better elsewhere. IMHO.
Outside of inner loops it's different. Here [in SSE/AVX code] most of the time
the main bottleneck is instruction fetch. So, optimizer should try to emit the
shortest sequence, as measured in bytes and does not have to pay a lot of
attention to anything else.
[generalization mode off]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (2 preceding siblings ...)
  2020-09-21 13:40 ` already5chosen at yahoo dot com
@ 2020-09-21 15:17 ` amonakov at gcc dot gnu.org
  2020-09-22  8:10 ` crazylht at gmail dot com
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-21 15:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs
> 4 entries by de-transformed sequence.

I don't think this is true for the test at hand? With base+offset memory
operand the renaming stage already sees two separate uops for each fma, so
reservation etc. should also see two for each fma, 4 uops in total. And they
will not be fused.

It would be true if memory operands required just one register (and then
pressure on renaming stage would be the same for both variants).

> For me it's enough to know that it *is* slower.

Understood, but I hope GCC developers want to understand the nature of the
slowdown before attempting to fix it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (3 preceding siblings ...)
  2020-09-21 15:17 ` amonakov at gcc dot gnu.org
@ 2020-09-22  8:10 ` crazylht at gmail dot com
  2020-09-22 10:01 ` already5chosen at yahoo dot com
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-22  8:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #3)
> (In reply to Alexander Monakov from comment #2)
> > Richard, though register moves are resolved by renaming, they still occupy a
> > uop in all stages except execution, and since renaming is one of the
> > narrowest points in the pipeline (only up to 4 uops/cycle on Intel),
> > reducing number of uops generally helps.
> > 
> > In Michael's the actual memory address has two operands:
> > 
> > < 	vmovapd	%ymm1, %ymm10
> > < 	vmovapd	%ymm1, %ymm11
> > < 	vfnmadd213pd	(%rdx,%rax), %ymm9, %ymm10
> > < 	vfnmadd213pd	(%rcx,%rax), %ymm7, %ymm11
> > ---
> > > 	vmovupd	(%rdx,%rax), %ymm10
> > > 	vmovupd	(%rcx,%rax), %ymm11
> > > 	vfnmadd231pd	%ymm1, %ymm9, %ymm10
> > > 	vfnmadd231pd	%ymm1, %ymm7, %ymm11
> > 

We can add peephole2 pattern for this particular situation(Assume the
transformation won't hurt the performance when instructions are outside of
inner loops), but not sure if GCC could hanlde it in *global view*(handle them
differently inside/outside of a loop).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (4 preceding siblings ...)
  2020-09-22  8:10 ` crazylht at gmail dot com
@ 2020-09-22 10:01 ` already5chosen at yahoo dot com
  2020-09-23  1:38 ` crazylht at gmail dot com
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-22 10:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #6 from Michael_S <already5chosen at yahoo dot com> ---
Why do you see it as addition of peephole pattern?
I see it as removal. Like, "do what's written in the source and don't try to be
tricky".
Probably, I am too removed from how compilers work :(

Or, may be, handle it at the level of cost of instructions?
I don't know how gcc works internally, but it seems that currently the cost of
register move [on Haswell and Skylake] is underestimated.
Although it is true that register move has no cost in terms of execution ports
and latency, it still has the same cost as, say, integer ALU instructions, in
terms of the front end and renamer.

Also, as pointed out above by Alexander, the cost of FMA3 with (base+index) or
(index*scale) memory operands could also be underestimated.

Unlike Alexander, I am not sure that the difference between (base+index) and
(base) is really what matters. IMO, the cost of FMA3 with *any* memory operands
is underestimated, but I am not going to insist on that.

In the ideal world compiler should reason as I do it myself when coding in asm:
estimate which resource is critical in the given loop and then try to reduce
pressure on this particular resource. In this particular loop a critical
resource appears to be renamer, so the cost of instructions should be seen as
cost at renamer. In other situations critical resource could be throughput of
particular issue port. In yet another situation it could be latency of
instructions that form dependency chain across multiple iterations of the loop.
The later is especially common in "reduction" algorithms of which dot product
is the most common example.
A single value of instruction cost simply can't cover all these different cases
in satisfactory manner.
May be, gcc it's already here, I don't know.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (5 preceding siblings ...)
  2020-09-22 10:01 ` already5chosen at yahoo dot com
@ 2020-09-23  1:38 ` crazylht at gmail dot com
  2020-09-23 17:49 ` already5chosen at yahoo dot com
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-23  1:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #7 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #6)
> Why do you see it as addition of peephole pattern?
> I see it as removal. Like, "do what's written in the source and don't try to
> be tricky".
> Probably, I am too removed from how compilers work :(
> 
> Or, may be, handle it at the level of cost of instructions?

Since the costs inside/outside of a loop are different, it's too rough to use
an unified cost to do estimation.

> I don't know how gcc works internally, but it seems that currently the cost
> of register move [on Haswell and Skylake] is underestimated.
> Although it is true that register move has no cost in terms of execution
> ports and latency, it still has the same cost as, say, integer ALU
> instructions, in terms of the front end and renamer.
> 
> Also, as pointed out above by Alexander, the cost of FMA3 with (base+index)
> or (index*scale) memory operands could also be underestimated.
> 
> Unlike Alexander, I am not sure that the difference between (base+index) and
> (base) is really what matters. IMO, the cost of FMA3 with *any* memory
> operands is underestimated, but I am not going to insist on that.
> 

We can increase FMA3 cost if it has memory operand, but not sure would it
benifit all cases, especially for those not in inner loop.

> 
> In the ideal world compiler should reason as I do it myself when coding in
> asm:
> estimate which resource is critical in the given loop and then try to reduce
> pressure on this particular resource. In this particular loop a critical
> resource appears to be renamer, so the cost of instructions should be seen
> as cost at renamer. In other situations critical resource could be
> throughput of particular issue port. In yet another situation it could be
> latency of instructions that form dependency chain across multiple

AFAIK, resource like issue port is only consided in software scheduler, but
scheduler won't do instruction tranformation like that, even scheduler itself
is subtle when target supports out-of-order execution.

> iterations of the loop.
> The later is especially common in "reduction" algorithms of which dot
> product is the most common example.
> A single value of instruction cost simply can't cover all these different
> cases in satisfactory manner.
> May be, gcc it's already here, I don't know.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (6 preceding siblings ...)
  2020-09-23  1:38 ` crazylht at gmail dot com
@ 2020-09-23 17:49 ` already5chosen at yahoo dot com
  2020-09-24  3:23 ` crazylht at gmail dot com
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-23 17:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #8 from Michael_S <already5chosen at yahoo dot com> ---
What are values of gcc "loop" cost of the relevant instructions now?
1. AVX256 Load
2. FMA3 ymm,ymm,ymm
3. AVX256 Regmove
4. FMA3 mem,ymm,ymm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (7 preceding siblings ...)
  2020-09-23 17:49 ` already5chosen at yahoo dot com
@ 2020-09-24  3:23 ` crazylht at gmail dot com
  2020-09-24  8:28 ` already5chosen at yahoo dot com
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-24  3:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #8)
> What are values of gcc "loop" cost of the relevant instructions now?
> 1. AVX256 Load
> 2. FMA3 ymm,ymm,ymm
> 3. AVX256 Regmove
> 4. FMA3 mem,ymm,ymm

For skylake, outside of register allocation.

they are
1. AVX256 Load  ---- 10
2. FMA3 ymm,ymm,ymm --- 16
3. AVX256 Regmove  --- 2
4. FMA3 mem,ymm,ymm --- 32

In RA, no direct cost for fma instrcutions, but we can disparage memory
alternative in FMA instructions， but again, it may hurt performance in some
cases.

1. AVX256 Load  ---- 10
3. AVX256 Regmove  --- 2

BTW: we have done a lot of experiments with different cost models and no
significant performance impact on SPEC2017.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (8 preceding siblings ...)
  2020-09-24  3:23 ` crazylht at gmail dot com
@ 2020-09-24  8:28 ` already5chosen at yahoo dot com
  2020-09-24 10:06 ` crazylht at gmail dot com
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-24  8:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #10 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Michael_S from comment #8)
> > What are values of gcc "loop" cost of the relevant instructions now?
> > 1. AVX256 Load
> > 2. FMA3 ymm,ymm,ymm
> > 3. AVX256 Regmove
> > 4. FMA3 mem,ymm,ymm
> 
> For skylake, outside of register allocation.
> 
> they are
> 1. AVX256 Load  ---- 10
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32
> 
> In RA, no direct cost for fma instrcutions, but we can disparage memory
> alternative in FMA instructions， but again, it may hurt performance in some
> cases.
> 
> 1. AVX256 Load  ---- 10
> 3. AVX256 Regmove  --- 2
> 
> BTW: we have done a lot of experiments with different cost models and no
> significant performance impact on SPEC2017.

Thank you.
With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only in
conditions of heavy registers pressure. So, why it generates it in my loop,
where registers pressure in the innermost loop is light and even in the next
outer level the pressure isn't heavy?
What am I missing?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (9 preceding siblings ...)
  2020-09-24  8:28 ` already5chosen at yahoo dot com
@ 2020-09-24 10:06 ` crazylht at gmail dot com
  2020-09-24 10:46 ` crazylht at gmail dot com
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-24 10:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Michael_S from comment #8)
> > > What are values of gcc "loop" cost of the relevant instructions now?
> > > 1. AVX256 Load
> > > 2. FMA3 ymm,ymm,ymm
> > > 3. AVX256 Regmove
> > > 4. FMA3 mem,ymm,ymm
> > 
> > For skylake, outside of register allocation.
> > 
> > they are
> > 1. AVX256 Load  ---- 10
> > 2. FMA3 ymm,ymm,ymm --- 16
> > 3. AVX256 Regmove  --- 2
> > 4. FMA3 mem,ymm,ymm --- 32
> > 
> > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > alternative in FMA instructions， but again, it may hurt performance in some
> > cases.
> > 
> > 1. AVX256 Load  ---- 10
> > 3. AVX256 Regmove  --- 2
> > 
> > BTW: we have done a lot of experiments with different cost models and no
> > significant performance impact on SPEC2017.
> 
> Thank you.
> With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> in conditions of heavy registers pressure. So, why it generates it in my
> loop, where registers pressure in the innermost loop is light and even in
> the next outer level the pressure isn't heavy?
> What am I missing?

the actual transformation gcc did is

vmovuxx (mem1), %ymmA     pass_combine     
vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA

then RA works like
                            RA
vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA

it "look like" but actually not this one.

 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

ymmB is allocate for (mem1) not (mem)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (10 preceding siblings ...)
  2020-09-24 10:06 ` crazylht at gmail dot com
@ 2020-09-24 10:46 ` crazylht at gmail dot com
  2020-09-24 12:38 ` already5chosen at yahoo dot com
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-24 10:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> ---
Correct AVX256 load cost outside of register allocation and vectorizer

> they are
> 1. AVX256 Load  ---- 16
> 2. FMA3 ymm,ymm,ymm --- 16
> 3. AVX256 Regmove  --- 2
> 4. FMA3 mem,ymm,ymm --- 32

That's why pass_combine would combine *avx256 load* and *FMA3 ymm,ymm,ymm* to
*FMA3 mem,ymm,ymm*

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (11 preceding siblings ...)
  2020-09-24 10:46 ` crazylht at gmail dot com
@ 2020-09-24 12:38 ` already5chosen at yahoo dot com
  2020-09-25  5:24 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-24 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #13 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Michael_S from comment #10)
> > (In reply to Hongtao.liu from comment #9)
> > > (In reply to Michael_S from comment #8)
> > > > What are values of gcc "loop" cost of the relevant instructions now?
> > > > 1. AVX256 Load
> > > > 2. FMA3 ymm,ymm,ymm
> > > > 3. AVX256 Regmove
> > > > 4. FMA3 mem,ymm,ymm
> > > 
> > > For skylake, outside of register allocation.
> > > 
> > > they are
> > > 1. AVX256 Load  ---- 10
> > > 2. FMA3 ymm,ymm,ymm --- 16
> > > 3. AVX256 Regmove  --- 2
> > > 4. FMA3 mem,ymm,ymm --- 32
> > > 
> > > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > > alternative in FMA instructions， but again, it may hurt performance in some
> > > cases.
> > > 
> > > 1. AVX256 Load  ---- 10
> > > 3. AVX256 Regmove  --- 2
> > > 
> > > BTW: we have done a lot of experiments with different cost models and no
> > > significant performance impact on SPEC2017.
> > 
> > Thank you.
> > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> > in conditions of heavy registers pressure. So, why it generates it in my
> > loop, where registers pressure in the innermost loop is light and even in
> > the next outer level the pressure isn't heavy?
> > What am I missing?
> 
> the actual transformation gcc did is
> 
> vmovuxx (mem1), %ymmA     pass_combine     
> vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
> vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA
> 
> then RA works like
>                             RA
> vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
> vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA
> 
> it "look like" but actually not this one.
> 
>  vmovuxx      (mem), %ymmA
>  vfnmadd231xx %ymmB, %ymmC, %ymmA
> transformed to
>  vmovaxx      %ymmB, %ymmA
>  vfnmadd213xx (mem), %ymmC, %ymmA
> 
> ymmB is allocate for (mem1) not (mem)

Thank you.
Now compiler's reasoning is starting to make more sense.
Still I don't understand why compiler does not compare the cost of full loop
body after combining to the cost before combining and does not come to
conclusion that combining increased the cost.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (12 preceding siblings ...)
  2020-09-24 12:38 ` already5chosen at yahoo dot com
@ 2020-09-25  5:24 ` crazylht at gmail dot com
  2020-09-25 13:21 ` already5chosen at yahoo dot com
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: crazylht at gmail dot com @ 2020-09-25  5:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
> Still I don't understand why compiler does not compare the cost of full loop
> body after combining to the cost before combining and does not come to
> conclusion that combining increased the cost.

As Richard says, GCC does not model CPU pipelines in such detail.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (13 preceding siblings ...)
  2020-09-25  5:24 ` crazylht at gmail dot com
@ 2020-09-25 13:21 ` already5chosen at yahoo dot com
  2020-09-25 14:02 ` amonakov at gcc dot gnu.org
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: already5chosen at yahoo dot com @ 2020-09-25 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #15 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Hongtao.liu from comment #14)
> > Still I don't understand why compiler does not compare the cost of full loop
> > body after combining to the cost before combining and does not come to
> > conclusion that combining increased the cost.
> 
> As Richard says, GCC does not model CPU pipelines in such detail.

I don't understand what "details" you have in mind.
The costs of instructions that you quoted above looks fine. 
But for reason, I don't understand, compiler had chosen more costly "combined"
code sequence over less costly, according to its own cost model,  "RISCy"
sequence.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (14 preceding siblings ...)
  2020-09-25 13:21 ` already5chosen at yahoo dot com
@ 2020-09-25 14:02 ` amonakov at gcc dot gnu.org
  2020-09-25 15:55 ` amonakov at gcc dot gnu.org
  2020-09-30 12:09 ` rsandifo at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-25 14:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #16 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Mostly because prior to register allocation the compiler does not naturally see
that x = *mem + a*b will need an extra mov when both 'a' and 'b' are live (as
in that case registers allocated for them cannot be immediately reused for
'x').

(this is why RTL combine merges the two instructions, but on the isolated
testcase below we already have the combined form thanks to tree-TER)

Isolated testcase, by the way (gcc -O2 -mfma):

double f(double t, double a, double b, double c[], long i)
{
    t = __builtin_fma(a, b, c[i]);
    asm("" :: "x"(t), "x"(a), "x"(b));
    return t;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (15 preceding siblings ...)
  2020-09-25 14:02 ` amonakov at gcc dot gnu.org
@ 2020-09-25 15:55 ` amonakov at gcc dot gnu.org
  2020-09-30 12:09 ` rsandifo at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-09-25 15:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #17 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
To me this suggests that in fact it's okay to carry the combined form in RTL up
to register allocation, but RA should decompose it to load+fma instead of
inserting a register copy that preserves the live operand.

(not a fan of "RA should deal with it" theme, but here it does look like the
most appropriate resolution)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Bug target/97127] FMA3 code transformation leads to slowdown on Skylake
  2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
                   ` (16 preceding siblings ...)
  2020-09-25 15:55 ` amonakov at gcc dot gnu.org
@ 2020-09-30 12:09 ` rsandifo at gcc dot gnu.org
  17 siblings, 0 replies; 19+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2020-09-30 12:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #18 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Not sure this is a good idea or not, but: combine could
look at the constraints of the instruction it matches,
compare that with the information it has about whether
the inputs remain live after the instruction, and cost
in an extra move if all alternatives require a reload.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-09-30 12:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-20 20:36 [Bug target/97127] New: FMA3 code transformation leads to slowdown on Skylake already5chosen at yahoo dot com
2020-09-21  6:54 ` [Bug target/97127] " rguenth at gcc dot gnu.org
2020-09-21 10:35 ` amonakov at gcc dot gnu.org
2020-09-21 13:40 ` already5chosen at yahoo dot com
2020-09-21 15:17 ` amonakov at gcc dot gnu.org
2020-09-22  8:10 ` crazylht at gmail dot com
2020-09-22 10:01 ` already5chosen at yahoo dot com
2020-09-23  1:38 ` crazylht at gmail dot com
2020-09-23 17:49 ` already5chosen at yahoo dot com
2020-09-24  3:23 ` crazylht at gmail dot com
2020-09-24  8:28 ` already5chosen at yahoo dot com
2020-09-24 10:06 ` crazylht at gmail dot com
2020-09-24 10:46 ` crazylht at gmail dot com
2020-09-24 12:38 ` already5chosen at yahoo dot com
2020-09-25  5:24 ` crazylht at gmail dot com
2020-09-25 13:21 ` already5chosen at yahoo dot com
2020-09-25 14:02 ` amonakov at gcc dot gnu.org
2020-09-25 15:55 ` amonakov at gcc dot gnu.org
2020-09-30 12:09 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).