[Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware
@ 2022-08-08 10:21 tnfchris at gcc dot gnu.org
  2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
  2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-08-08 10:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

            Bug ID: 106553
           Summary: pre-register allocation scheduler is now RMW aware
           Product: gcc
           Version: 11.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

The following example is minimized from the math routines in glibc:

#include <arm_neon.h>

typedef float32x4_t v_f32_t;

static inline v_f32_t
v_fma_f32 (v_f32_t x, v_f32_t y, v_f32_t z)
{
  return vfmaq_f32 (z, x, y);
}

v_f32_t
__v_sinf (v_f32_t x,v_f32_t z, v_f32_t n, v_f32_t r)
{
  v_f32_t r2, y;
  r2 = r * r;
  y = v_fma_f32 (n, r2, x);
  y = v_fma_f32 (y, r2, x);
  r = v_fma_f32 (y, r2, z);
  y = v_fma_f32 (y, r2, x);
  y = v_fma_f32 (y * r2, r, r);

  return y;
}

here we generate at -O2

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
        fmul    v3.4s, v3.4s, v3.4s
        mov     v5.16b, v0.16b
        mov     v4.16b, v0.16b
        fmla    v5.4s, v2.4s, v3.4s
        fmla    v4.4s, v5.4s, v3.4s
        fmla    v0.4s, v4.4s, v3.4s
        mov     v2.16b, v0.16b
        mov     v0.16b, v1.16b
        fmla    v0.4s, v4.4s, v3.4s
        fmul    v3.4s, v3.4s, v2.4s
        fmla    v0.4s, v3.4s, v0.4s
        ret

the 3rd move is there because the pre-register allocation scheduler created a
false dependency by scheduling the the fmul after the fmla. This forces reload
to have to create a reload to keep `v0` alive after the destructive operation.

with -O2  -fno-schedule-insns we get

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
        fmul    v3.4s, v3.4s, v3.4s
        mov     v4.16b, v0.16b
        fmla    v0.4s, v2.4s, v3.4s
        mov     v2.16b, v4.16b
        fmla    v2.4s, v0.4s, v3.4s
        mov     v0.16b, v1.16b
        fmla    v4.4s, v2.4s, v3.4s
        fmla    v0.4s, v2.4s, v3.4s
        fmul    v3.4s, v3.4s, v4.4s
        fmla    v0.4s, v3.4s, v0.4s
        ret

In glibc these additional moves cost double digit performance by breaking up
the fmla chains.


Should we perhaps use a special RMW scheduling attribute to make it treat the
last input as an output too?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware
  2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
@ 2022-08-08 14:37 ` amonakov at gcc dot gnu.org
  2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: amonakov at gcc dot gnu.org @ 2022-08-08 14:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Are you sure the testcase is correctly reduced, i.e. does it show the same
performance degradation? Latency-wise the scheduler is making the correct
decision here: we really want to schedule second-to-last FMA

  y = v_fma_f32 (y, r2, x);

earlier than its predecessor

  r = v_fma_f32 (y, r2, z);

because we need to compute y*r2 before the last FMA.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug rtl-optimization/106553] pre-register allocation scheduler is now RMW aware
  2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
  2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
@ 2022-08-08 16:25 ` tnfchris at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2022-08-08 16:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106553

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #1)
> Are you sure the testcase is correctly reduced, i.e. does it show the same
> performance degradation? Latency-wise the scheduler is making the correct
> decision here: we really want to schedule second-to-last FMA


The reduction wasn't tested for performance, I'm not even claiming the final
result is optimal because scheduling was completely disabled.

> 
>   y = v_fma_f32 (y, r2, x);
> 
> earlier than its predecessor
> 
>   r = v_fma_f32 (y, r2, z);
> 
> because we need to compute y*r2 before the last FMA.

The relative order of the instructions didn't change as far as I can tell in
the reduced example.

I had expected the mul to be moved earlier.

__v_sinf(__Float32x4_t, __Float32x4_t, __Float32x4_t, __Float32x4_t):
        fmul    v3.4s, v3.4s, v3.4s
        mov     v5.16b, v0.16b
        mov     v4.16b, v0.16b
        fmla    v5.4s, v2.4s, v3.4s
        fmla    v4.4s, v5.4s, v3.4s
        fmla    v0.4s, v4.4s, v3.4s
        fmul    v6.4s, v3.4s, v0.4s
        mov     v0.16b, v1.16b
        fmla    v0.4s, v4.4s, v3.4s
        fmla    v0.4s, v6.4s, v0.4s
        ret

as the copy of v0 into v2 is still redundant. However looking at the RTL of the
reduction, I don't really understand why the mov existed.

The bad case is

(insn 13 11 12 2 (set (reg:V4SF 94 [ _9 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
        (nil)))
(insn 12 13 14 2 (set (reg:V4SF 95 [ _10 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg:V4SF 106)
        (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
            (nil))))
(insn 14 12 19 2 (set (reg:V4SF 103)
        (mult:V4SF (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
     (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
        (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
            (nil))))

and the good case

(insn 12 11 13 2 (set (reg:V4SF 95 [ _10 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 106))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg:V4SF 106)
        (nil)))
(insn 13 12 14 2 (set (reg:V4SF 94 [ _9 ])
        (fma:V4SF (reg:V4SF 96 [ _11 ])
            (reg/v:V4SF 93 [ r2 ])
            (reg/v:V4SF 99 [ x ]))) "":14605:10 2206 {fmav4sf4}
     (expr_list:REG_DEAD (reg/v:V4SF 99 [ x ])
        (expr_list:REG_DEAD (reg:V4SF 96 [ _11 ])
            (nil))))
(insn 14 13 15 2 (set (reg:V4SF 103)
        (mult:V4SF (reg/v:V4SF 93 [ r2 ])
            (reg:V4SF 94 [ _9 ]))) "":20:7 2186 {mulv4sf3}
     (expr_list:REG_DEAD (reg:V4SF 94 [ _9 ])
        (expr_list:REG_DEAD (reg/v:V4SF 93 [ r2 ])
            (nil))))

So I don't really see why the live range of _9 was extended... so maybe this is
register allocation after all.

It's weird that -fno-schedule-insns removes the redundant moves in all cases
though.. But perhaps that's just coincidence...

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-08 16:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-08 10:21 [Bug rtl-optimization/106553] New: pre-register allocation scheduler is now RMW aware tnfchris at gcc dot gnu.org
2022-08-08 14:37 ` [Bug rtl-optimization/106553] " amonakov at gcc dot gnu.org
2022-08-08 16:25 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).