From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 0659A394740B; Thu, 27 Jan 2022 07:42:50 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0659A394740B
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm
 regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
Date: Thu, 27 Jan 2022 07:42:49 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization, ra
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: component cc keywords
Message-ID: <bug-102178-4-UtgVfeOvr6@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-102178-4@http.gcc.gnu.org/bugzilla/>
References: <bug-102178-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Jan 2022 07:42:50 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |rtl-optimization
                 CC|                            |vmakarov at gcc dot gnu.org
           Keywords|                            |ra
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I see a lot more GPR <-> XMM moves in the 'after' case:

    1035 :   401c8b: vaddsd %xmm1,%xmm0,%xmm0
    1953 :   401c8f: vmovq  %rcx,%xmm1
     305 :   401c94: vaddsd %xmm8,%xmm1,%xmm1
    3076 :   401c99: vmovq  %xmm0,%r14
     590 :   401c9e: vmovq  %r11,%xmm0
     267 :   401ca3: vmovq  %xmm1,%r8
     136 :   401ca8: vmovq  %rdx,%xmm1
     448 :   401cad: vaddsd %xmm1,%xmm0,%xmm1
    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)
    2782 :   401cc0: vaddsd %xmm0,%xmm1,%xmm1
   22135 :   401cc4: vmovsd %xmm1,%xmm1,%xmm0
    1261 :   401cc8: vmovq  %r14,%xmm1
     646 :   401ccd: vaddsd %xmm0,%xmm1,%xmm0
   18136 :   401cd1: vaddsd %xmm2,%xmm5,%xmm1
     629 :   401cd5: vmovq  %xmm1,%r8
     142 :   401cda: vaddsd %xmm6,%xmm3,%xmm1
     177 :   401cde: vmovq  %xmm0,%r14
     288 :   401ce3: vmovq  %xmm1,%r9
     177 :   401ce8: vmovq  %r8,%xmm1
     174 :   401ced: vmovq  %r9,%xmm0

those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think?  The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.

The function is LBM_performStreamCollide, the sinking pass does nothing wro=
ng,
it moves unconditionally executed

-  _948 =3D _861 + _867;
-  _957 =3D _944 + _948;
-  _912 =3D _861 + _873;
...
-  _981 =3D _853 + _865;
-  _989 =3D _977 + _981;
-  _916 =3D _853 + _857;
-  _924 =3D _912 + _916;

into a conditionally executed block.  But that increases register pressure
by 5 FP regs (if I counted correctly) in that area.  So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.

-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).

In postreload I see transforms like

-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
-        (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0  S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
-     (expr_list:REG_EQUAL (const_double:DF
9.939744999999999830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
-        (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
         (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
      (nil))

which seems like we could have reloaded %xmm5 from .LC10.  But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.

The differences can be seen on trunk with -Ofast -march=3Dznver2
[-fdisable-tree-sink2].

We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).

It might be interesting to benchmark with
-mtune-ctrl=3D^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find =
a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)=