public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware
@ 2022-08-08 10:21 tnfchris at gcc dot gnu.org
2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
0 siblings, 2 replies; 3+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-08-08 10:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553
Bug ID: 106553
Summary: pre-register allocation scheduler is now RMW aware
Product: gcc
Version: 11.3.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*
The following example is minimized from the math routines in glibc:
#include <arm_neon.h>
typedef float32x4_t v_f32_t;
static inline v_f32_t
v_fma_f32 (v_f32_t x, v_f32_t y, v_f32_t z)
{
return vfmaq_f32 (z, x, y);
}
v_f32_t
__v_sinf (v_f32_t x,v_f32_t z, v_f32_t n, v_f32_t r)
{
v_f32_t r2, y;
r2 = r * r;
y = v_fma_f32 (n, r2, x);
y = v_fma_f32 (y, r2, x);
r = v_fma_f32 (y, r2, z);
y = v_fma_f32 (y, r2, x);
y = v_fma_f32 (y * r2, r, r);
return y;
}
here we generate at -O2
__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmul v3.4s, v3.4s, v3.4s
mov v5.16b, v0.16b
mov v4.16b, v0.16b
fmla v5.4s, v2.4s, v3.4s
fmla v4.4s, v5.4s, v3.4s
fmla v0.4s, v4.4s, v3.4s
mov v2.16b, v0.16b
mov v0.16b, v1.16b
fmla v0.4s, v4.4s, v3.4s
fmul v3.4s, v3.4s, v2.4s
fmla v0.4s, v3.4s, v0.4s
ret
the 3rd move is there because the pre-register allocation scheduler created a
false dependency by scheduling the the fmul after the fmla. This forces reload
to have to create a reload to keep `v0` alive after the destructive operation.
with -O2 -fno-schedule-insns we get
__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmul v3.4s, v3.4s, v3.4s
mov v4.16b, v0.16b
fmla v0.4s, v2.4s, v3.4s
mov v2.16b, v4.16b
fmla v2.4s, v0.4s, v3.4s
mov v0.16b, v1.16b
fmla v4.4s, v2.4s, v3.4s
fmla v0.4s, v2.4s, v3.4s
fmul v3.4s, v3.4s, v4.4s
fmla v0.4s, v3.4s, v0.4s
ret
In glibc these additional moves cost double digit performance by breaking up
the fmla chains.
Should we perhaps use a special RMW scheduling attribute to make it treat the
last input as an output too?
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware
2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
@ 2022-08-08 14:37 ` amonakov at gcc dot gnu.org
2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-08-08 14:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Are you sure the testcase is correctly reduced, i.e. does it show the same
performance degradation? Latency-wise the scheduler is making the correct
decision here: we really want to schedule second-to-last FMA
y = v_fma_f32 (y, r2, x);
earlier than its predecessor
r = v_fma_f32 (y, r2, z);
because we need to compute y*r2 before the last FMA.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware
2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
@ 2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-08-08 16:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #1)
> Are you sure the testcase is correctly reduced, i.e. does it show the same
> performance degradation? Latency-wise the scheduler is making the correct
> decision here: we really want to schedule second-to-last FMA
The reduction wasn't tested for performance, I'm not even claiming the final
result is optimal because scheduling was completely disabled.
>
> y = v_fma_f32 (y, r2, x);
>
> earlier than its predecessor
>
> r = v_fma_f32 (y, r2, z);
>
> because we need to compute y*r2 before the last FMA.
The relative order of the instructions didn't change as far as I can tell in
the reduced example.
I had expected the mul to be moved earlier.
__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
fmul v3.4s, v3.4s, v3.4s
mov v5.16b, v0.16b
mov v4.16b, v0.16b
fmla v5.4s, v2.4s, v3.4s
fmla v4.4s, v5.4s, v3.4s
fmla v0.4s, v4.4s, v3.4s
fmul v6.4s, v3.4s, v0.4s
mov v0.16b, v1.16b
fmla v0.4s, v4.4s, v3.4s
fmla v0.4s, v6.4s, v0.4s
ret
as the copy of v0 into v2 is still redundant. However looking at the RTL of the
reduction, I don't really understand why the mov existed.
The bad case is
(insn 13 11 12 2 (set (reg:V4SF 94 [ _9 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
(expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
(nil)))
(insn 12 13 14 2 (set (reg:V4SF 95 [ _10 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
(expr_list:REG_DEAD (reg:V4SF 106)
(expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
(nil))))
(insn 14 12 19 2 (set (reg:V4SF 103)
(mult:V4SF (reg/v:V4SF 93 [ r2 ])
(reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
(expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
(expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
(nil))))
and the good case
(insn 12 11 13 2 (set (reg:V4SF 95 [ _10 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
(expr_list:REG_DEAD (reg:V4SF 106)
(nil)))
(insn 13 12 14 2 (set (reg:V4SF 94 [ _9 ])
(fma:V4SF (reg:V4SF 96 [ _11 ])
(reg/v:V4SF 93 [ r2 ])
(reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
(expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
(expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
(nil))))
(insn 14 13 15 2 (set (reg:V4SF 103)
(mult:V4SF (reg/v:V4SF 93 [ r2 ])
(reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
(expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
(expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
(nil))))
So I don't really see why the live range of _9 was extended... so maybe this is
register allocation after all.
It's weird that -fno-schedule-insns removes the redundant moves in all cases
though.. But perhaps that's just coincidence...
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-08-08 16:25 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).