[Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
@ 2021-09-02 15:38 jamborm at gcc dot gnu.org
  2021-09-03  7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
                   ` (39 more replies)
  0 siblings, 40 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-02 15:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

            Bug ID: 102178
           Summary: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after
                    r12-897-gde56f95afaaa22
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linix
            Target: x86_64-linux

LNT has detected an 18% regression of SPECFP 2006 benchmark 470.lbm
when it is compiled with -Ofast -march=native on a Zen2 machine:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=421.240.0&plot.1=301.240.0&

...and similarly a 6% regression when it is run on the same machine
with -Ofast:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&

I have bisected both on another zen2 machine to commit
r12-897-gde56f95afaaa22 (Run pass_sink_code once more before
store_merging).

Zen1 machine has also seen a similar -march=native regression in the
same time frame:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&

Zen1 -march=generic seems to be unaffected, which is also the case for
the Intel machines we track.

Although lbm has been known to have weird regressions caused entirely
by code layout where the compiler was not really at fault, the fact
that both generic code-gen and Zen1 are affected seems to indicate this
is not the case.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
@ 2021-09-03  7:07 ` marxin at gcc dot gnu.org
  2021-09-06  6:40 ` rguenth at gcc dot gnu.org
                   ` (38 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-09-03  7:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-09-03
                 CC|                            |marxin at gcc dot gnu.org
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
  2021-09-03  7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
@ 2021-09-06  6:40 ` rguenth at gcc dot gnu.org
  2021-09-06  6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
                   ` (37 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-06  6:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Maybe related to PR102008, see the comment I made there.  Martin, maybe you can
try moving late sink to before the last phiopt pass.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
  2021-09-03  7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
  2021-09-06  6:40 ` rguenth at gcc dot gnu.org
@ 2021-09-06  6:41 ` rguenth at gcc dot gnu.org
  2021-09-07  2:46 ` luoxhu at gcc dot gnu.org
                   ` (36 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-06  6:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|11.0                        |12.0
            Summary|SPECFP 2006 470.lbm         |[12 Regression] SPECFP 2006
                   |regressions on AMD Zen CPUs |470.lbm regressions on AMD
                   |after                       |Zen CPUs after
                   |r12-897-gde56f95afaaa22     |r12-897-gde56f95afaaa22
   Target Milestone|---                         |12.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-09-06  6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
@ 2021-09-07  2:46 ` luoxhu at gcc dot gnu.org
  2021-09-08 14:06 ` jamborm at gcc dot gnu.org
                   ` (35 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2021-09-07  2:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Verified 470.lbm doesn't show regression on Power8 with Ofast.

runtime is 141 sec for r12-897, without that patch it is 142 sec.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-09-07  2:46 ` luoxhu at gcc dot gnu.org
@ 2021-09-08 14:06 ` jamborm at gcc dot gnu.org
  2021-09-16 16:17 ` jamborm at gcc dot gnu.org
                   ` (34 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-08 14:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> Martin, maybe you can try moving late sink to before the last phiopt pass.

If you mean the following then unfortunately that has not helped.

diff --git a/gcc/passes.def b/gcc/passes.def
index d7a1f8c97a6..5eb70cd2cd8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -347,10 +347,10 @@ along with GCC; see the file COPYING3.  If not see
       /* After late CD DCE we rewrite no longer addressed locals into SSA
         form if possible.  */
       NEXT_PASS (pass_forwprop);
+      NEXT_PASS (pass_sink_code);
       NEXT_PASS (pass_phiopt, false /* early_p */);
       NEXT_PASS (pass_fold_builtins);
       NEXT_PASS (pass_optimize_widening_mul);
-      NEXT_PASS (pass_sink_code);
       NEXT_PASS (pass_store_merging);
       NEXT_PASS (pass_tail_calls);
       /* If DCE is not run before checking for uninitialized uses,


...I'll have a very brief look at what is actually happening just so that I
have more reasons to believe this is not a code placement issue again.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-09-08 14:06 ` jamborm at gcc dot gnu.org
@ 2021-09-16 16:17 ` jamborm at gcc dot gnu.org
  2022-01-20 10:20 ` rguenth at gcc dot gnu.org
                   ` (33 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-16 16:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #4 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #3)
> ...I'll have a very brief look at what is actually happening just so that I
> have more reasons to believe this is not a code placement issue again.

The hot function is at the same address when compiled by both
revisions and the newer version looks sufficiently different.  I even
tried sprinkling it with nops and it did not help.  I am no saying we
are not bumping against some michro-architectural peculiarity but it
does not seem to be a code placement issue.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-09-16 16:17 ` jamborm at gcc dot gnu.org
@ 2022-01-20 10:20 ` rguenth at gcc dot gnu.org
  2022-01-26 15:57 ` marxin at gcc dot gnu.org
                   ` (32 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-20 10:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
           Priority|P3                          |P1
           Keywords|                            |missed-optimization
               Host|x86_64-linix                |

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Analysis is missing but the regression persists.  On Haswell I do not see any
effect.  I do suspect it's about cmov vs. non-cmov but w/o a profile and
looking affected assembly that's a wild guess.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-01-20 10:20 ` rguenth at gcc dot gnu.org
@ 2022-01-26 15:57 ` marxin at gcc dot gnu.org
  2022-01-27  7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
                   ` (31 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: marxin at gcc dot gnu.org @ 2022-01-26 15:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
Created attachment 52296
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52296&action=edit
perf annotate before and after the revision

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-01-26 15:57 ` marxin at gcc dot gnu.org
@ 2022-01-27  7:42 ` rguenth at gcc dot gnu.org
  2022-01-27  7:55 ` rguenth at gcc dot gnu.org
                   ` (30 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27  7:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |rtl-optimization
                 CC|                            |vmakarov at gcc dot gnu.org
           Keywords|                            |ra

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I see a lot more GPR <-> XMM moves in the 'after' case:

    1035 :   401c8b: vaddsd %xmm1,%xmm0,%xmm0
    1953 :   401c8f: vmovq  %rcx,%xmm1
     305 :   401c94: vaddsd %xmm8,%xmm1,%xmm1
    3076 :   401c99: vmovq  %xmm0,%r14
     590 :   401c9e: vmovq  %r11,%xmm0
     267 :   401ca3: vmovq  %xmm1,%r8
     136 :   401ca8: vmovq  %rdx,%xmm1
     448 :   401cad: vaddsd %xmm1,%xmm0,%xmm1
    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)
    2782 :   401cc0: vaddsd %xmm0,%xmm1,%xmm1
   22135 :   401cc4: vmovsd %xmm1,%xmm1,%xmm0
    1261 :   401cc8: vmovq  %r14,%xmm1
     646 :   401ccd: vaddsd %xmm0,%xmm1,%xmm0
   18136 :   401cd1: vaddsd %xmm2,%xmm5,%xmm1
     629 :   401cd5: vmovq  %xmm1,%r8
     142 :   401cda: vaddsd %xmm6,%xmm3,%xmm1
     177 :   401cde: vmovq  %xmm0,%r14
     288 :   401ce3: vmovq  %xmm1,%r9
     177 :   401ce8: vmovq  %r8,%xmm1
     174 :   401ced: vmovq  %r9,%xmm0

those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think?  The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.

The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed

-  _948 = _861 + _867;
-  _957 = _944 + _948;
-  _912 = _861 + _873;
...
-  _981 = _853 + _865;
-  _989 = _977 + _981;
-  _916 = _853 + _857;
-  _924 = _912 + _916;

into a conditionally executed block.  But that increases register pressure
by 5 FP regs (if I counted correctly) in that area.  So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.

-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).

In postreload I see transforms like

-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
-        (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0  S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
-     (expr_list:REG_EQUAL (const_double:DF
9.939744999999999830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
-        (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
         (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
      (nil))

which seems like we could have reloaded %xmm5 from .LC10.  But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.

The differences can be seen on trunk with -Ofast -march=znver2
[-fdisable-tree-sink2].

We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).

It might be interesting to benchmark with
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-01-27  7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
@ 2022-01-27  7:55 ` rguenth at gcc dot gnu.org
  2022-01-27  8:13 ` crazylht at gmail dot com
                   ` (29 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27  7:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
this improves to 114 seconds, with sink2 disabled I get 108 seconds
and with the tune-ctrl ontop I get 113 seconds.

Note that Zen2 is quite special in that it has the ability to handle
load/store from the stack by mapping it to a register, effectively
making them zero latency (zen3 lost this ability).

So while moves between GPRs and XMM might not be bad anymore _spilling_
to a GPR (and I suppose XMM, too) is still a bad idea and the stack
should be preferred.

Not sure if it's possible to do that though.

Doing the same experiment as above on a Zen3 machine would be nice, too.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-01-27  7:55 ` rguenth at gcc dot gnu.org
@ 2022-01-27  8:13 ` crazylht at gmail dot com
  2022-01-27  8:18 ` crazylht at gmail dot com
                   ` (28 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27  8:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
    1703 :   401cb1: vmovq  %xmm1,%r9   (*)
     834 :   401cb6: vmovq  %r8,%xmm1
    1719 :   401cbb: vmovq  %r9,%xmm0   (*)

Look like %r9 is dead after the second (*), and it can be optimized to

    1703 :   401cb1: vmovq  %xmm1,%xmm0
     834 :   401cb6: vmovq  %r8,%xmm1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-01-27  8:13 ` crazylht at gmail dot com
@ 2022-01-27  8:18 ` crazylht at gmail dot com
  2022-01-27  8:20 ` rguenth at gcc dot gnu.org
                   ` (27 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.
> 
> Note that Zen2 is quite special in that it has the ability to handle
> load/store from the stack by mapping it to a register, effectively
> making them zero latency (zen3 lost this ability).
> 
> So while moves between GPRs and XMM might not be bad anymore _spilling_
> to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> should be preferred.
> 

According to znver2_cost

Cost of sse_to_integer is a little bit less than fp_store, maybe increase
sse_to_integer cost(more than fp_store) can helps RA to choose memory instead
of GPR.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-01-27  8:18 ` crazylht at gmail dot com
@ 2022-01-27  8:20 ` rguenth at gcc dot gnu.org
  2022-01-27  9:34 ` rguenth at gcc dot gnu.org
                   ` (26 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.

With -Ofast -march=znver2 -fschedule-insns -fsched-pressure I get 113 seconds.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2022-01-27  8:20 ` rguenth at gcc dot gnu.org
@ 2022-01-27  9:34 ` rguenth at gcc dot gnu.org
  2022-01-27  9:55   ` Jan Hubicka
  2022-01-27  9:55 ` hubicka at kam dot mff.cuni.cz
                   ` (25 subsequent siblings)
  39 siblings, 1 reply; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27  9:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #10)
> (In reply to Richard Biener from comment #8)
> > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> > this improves to 114 seconds, with sink2 disabled I get 108 seconds
> > and with the tune-ctrl ontop I get 113 seconds.
> > 
> > Note that Zen2 is quite special in that it has the ability to handle
> > load/store from the stack by mapping it to a register, effectively
> > making them zero latency (zen3 lost this ability).
> > 
> > So while moves between GPRs and XMM might not be bad anymore _spilling_
> > to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> > should be preferred.
> > 
> 
> According to znver2_cost
> 
> Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> sse_to_integer cost(more than fp_store) can helps RA to choose memory
> instead of GPR.

That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
but modeling that doesn't seem to be necessary.

We seem to have store costs of 8 and load costs of 6, I'll try bumping the
gpr<->xmm move cost to 8.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2022-01-27  9:34 ` rguenth at gcc dot gnu.org
@ 2022-01-27  9:55   ` Jan Hubicka
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Hubicka @ 2022-01-27  9:55 UTC (permalink / raw)
  To: rguenth at gcc dot gnu.org; +Cc: gcc-bugs

> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2022-01-27  9:34 ` rguenth at gcc dot gnu.org
@ 2022-01-27  9:55 ` hubicka at kam dot mff.cuni.cz
  2022-01-27 10:13 ` rguenth at gcc dot gnu.org
                   ` (24 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27  9:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2022-01-27  9:55 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 10:13 ` rguenth at gcc dot gnu.org
  2022-01-27 10:14 ` rguenth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #9)
>     1703 :   401cb1: vmovq  %xmm1,%r9   (*)
>      834 :   401cb6: vmovq  %r8,%xmm1
>     1719 :   401cbb: vmovq  %r9,%xmm0   (*)
> 
> Look like %r9 is dead after the second (*), and it can be optimized to
> 
>     1703 :   401cb1: vmovq  %xmm1,%xmm0
>      834 :   401cb6: vmovq  %r8,%xmm1

Yep, we also have code like

-       movabsq $0x3ff03db8fde2ef4e, %r8
...
-       vmovq   %r8, %xmm11

or

        movq    .LC11(%rip), %rax
        vmovq   %rax, %xmm14

which is extremely odd to see ... (I didn't check how we arrive at that)

When I do

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..4c51358d7b6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = {
                                           in 32,64,128,256 and 512-bit.  */
   {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit.  */
-  6, 6,                                        /* SSE->integer and
integer->SSE
+  8, 8,                                        /* SSE->integer and
integer->SSE
                                           moves.  */
   8, 8,                                /* mask->integer and integer->mask
moves */
   {6, 6, 6},                           /* cost of loading mask register

performance improves from 128 seconds to 115 seconds.  The result is
a lot more stack spilling in the code but there are still cases like

        movq    .LC8(%rip), %rax
        vmovsd  .LC13(%rip), %xmm6
        vmovsd  .LC16(%rip), %xmm11
        vmovsd  .LC14(%rip), %xmm3
        vmovsd  .LC12(%rip), %xmm14
        vmovq   %rax, %xmm2
        vmovq   %rax, %xmm0
        movq    .LC9(%rip), %rax

see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of
at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also
loading directly to xmm0.

In the end register pressure is the main issue but how we deal with it
is bad.  It's likely caused by a combination of PRE & hoisting & sinking
which together exploit

  if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+
(0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

which makes the following computes partly compile-time resolvable.

I still think the above code generation issues need to be analyzed and
we should figure why we emit this weird code under register pressure.
I'll attach a testcase that has the function in question split out for
easier analysis.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2022-01-27 10:13 ` rguenth at gcc dot gnu.org
@ 2022-01-27 10:14 ` rguenth at gcc dot gnu.org
  2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
                   ` (22 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 52300
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52300&action=edit
LBM_performStreamCollide testcase

This is the relevant function.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2022-01-27 10:14 ` rguenth at gcc dot gnu.org
@ 2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
  2022-01-27 10:32 ` rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27 10:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> 
> Yep, we also have code like
> 
> -       movabsq $0x3ff03db8fde2ef4e, %r8
> ...
> -       vmovq   %r8, %xmm11

It is loading random constant to xmm11.  Since reg<->xmm moves are
relatively cheap it looks OK to me that we generate this.  Is it faster
to load constant from the memory?
>         movq    .LC11(%rip), %rax
>         vmovq   %rax, %xmm14
This is odd indeed and even more odd that we both movabs and memory load... 
i386 FE plays some games with allowing some constants in SSE
instructions (to allow simplification and combining) and split them out
to memory later.  It may be consequence of this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 10:32 ` rguenth at gcc dot gnu.org
  2022-01-27 11:18 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
So in .reload we have (with unpatched trunk)

  401: NOTE_INSN_BASIC_BLOCK 6
  462: ax:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
  407: xmm2:DF=ax:DF
  463: ax:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
  408: xmm4:DF=ax:DF

why??!  We can load .LC0 into xmm4 directly.  IRA sees

  401: NOTE_INSN_BASIC_BLOCK 6
  407: r118:DF=r482:DF
  408: r119:DF=r482:DF

now I cannot really decipher IRA or LRA dumps but my guess would be that
inheritance (causing us to load from LC0) interferes badly with register
class assignment?

Changing pseudo 482 in operand 1 of insn 407 on equiv
9.850689999999999724167309977929107844829559326171875e-1
...
          alt=21,overall=9,losers=1,rld_nregs=1
         Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
      Creating newreg=525, assigning class GENERAL_REGS to r525
  407: r118:DF=r525:DF
    Inserting insn reload before:
  462: r525:DF=[`*.LC0']
      REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1

we should have preferred alt 14 I think (0) v (1) m, but that has

          alt=14,overall=13,losers=1,rld_nregs=0
            0 Spill pseudo into memory: reject+=3
            Using memory insn operand 0: reject+=3
            0 Non input pseudo reload: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=15,overall=28,losers=3 -- refuse
            0 Costly set: reject++
            alt=16: Bad operand -- refuse
            0 Costly set: reject++
            1 Costly loser: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=17,overall=17,losers=2 -- refuse
            0 Costly set: reject++
            1 Spill Non-pseudo into memory: reject+=3
            Using memory insn operand 1: reject+=3
            1 Non input pseudo reload: reject++
            alt=18,overall=14,losers=1 -- refuse
            0 Spill pseudo into memory: reject+=3
            Using memory insn operand 0: reject+=3
            0 Non input pseudo reload: reject++
            1 Costly loser: reject++
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++
            alt=19,overall=29,losers=3 -- refuse
            0 Non-prefered reload: reject+=600
            0 Non input pseudo reload: reject++
            alt=20,overall=607,losers=1 -- refuse
            1 Non-pseudo reload: reject+=2
            1 Non input pseudo reload: reject++

I'm not sure I can decipher the reasoning but I don't understand how it
doesn't seem to anticipate the cost of reloading the GPR in the alternative
it chooses?

Vlad?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2022-01-27 10:32 ` rguenth at gcc dot gnu.org
@ 2022-01-27 11:18 ` rguenth at gcc dot gnu.org
  2022-01-27 11:30 ` rguenther at suse dot de
                   ` (19 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 11:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the case of LBM what also helps is disabling PRE or using PGO (which
sees the useless PRE) given that the path the expressions become partially
compile-time computable is never taken at runtime.  In theory we could
isolate that path completely via -ftracer but the pass pipeline setup doesn't
look optimal.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2022-01-27 11:18 ` rguenth at gcc dot gnu.org
@ 2022-01-27 11:30 ` rguenther at suse dot de
  2022-01-27 11:33 ` rguenther at suse dot de
                   ` (18 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenther at suse dot de @ 2022-01-27 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #19 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > > According to znver2_cost
> > > 
> > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > > instead of GPR.
> > 
> > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> > but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> > but modeling that doesn't seem to be necessary.
> > 
> > We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> > gpr<->xmm move cost to 8.
> 
> I was simply following latencies here, so indeed reg<->mem bypass is not
> really modelled.  I recall doing few experiments which was kind of
> inconclusive.

Yes, I think xmm->gpr->xmm vs. xmm->mem->xmm isn't really the issue here
but it's mem->gpr->xmm vs. mem->xmm with all the constant pool remats.
Agner lists latency of 3 for gpr<->xmm and a latency of 4 for mem<->xmm
but then there's forwarding (and the clever renaming trick) which likely
makes xmm->mem->xmm cheaper than 4 + 4 but xmm->gpr->xmm will really
be 3 + 3 latency.  grp<->xmm seem to be also more resource constrained.

In any case for moving xmm to gpr it doesn't make sense to go through
memory but it doesn't seem worth to spill to xmm or gpr when we only
use gpr / xmm later.

Letting the odd and bogus code we generate for the .LC0 re-materialization
aside which we should fix and which fixing likely will fix LBM.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2022-01-27 11:30 ` rguenther at suse dot de
@ 2022-01-27 11:33 ` rguenther at suse dot de
  2022-01-27 12:04   ` Jan Hubicka
  2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
                   ` (17 subsequent siblings)
  39 siblings, 1 reply; 43+ messages in thread
From: rguenther at suse dot de @ 2022-01-27 11:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > Yep, we also have code like
> > 
> > -       movabsq $0x3ff03db8fde2ef4e, %r8
> > ...
> > -       vmovq   %r8, %xmm11
> 
> It is loading random constant to xmm11.  Since reg<->xmm moves are
> relatively cheap it looks OK to me that we generate this.  Is it faster
> to load constant from the memory?

I would say so.  It saves code size and also uop space unless the two
can magically fuse to a immediate to %xmm move (I doubt that).

> >         movq    .LC11(%rip), %rax
> >         vmovq   %rax, %xmm14
> This is odd indeed and even more odd that we both movabs and memory load... 
> i386 FE plays some games with allowing some constants in SSE
> instructions (to allow simplification and combining) and split them out
> to memory later.  It may be consequence of this.

I've pasted the LRA dump pieces I think are relevant but I don't
understand them.  The constant load isn't visible originally but
is introduced by LRA so that may be the key to the mystery here.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2022-01-27 11:33 ` rguenther at suse dot de
@ 2022-01-27 12:04   ` Jan Hubicka
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Hubicka @ 2022-01-27 12:04 UTC (permalink / raw)
  To: rguenther at suse dot de; +Cc: gcc-bugs

> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
        long int i;
        double sum,val1,val2,val3,val4;
         for (i=0;i<1000000000;i++)
         {
#if 1
#if 1
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val1): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val2): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val3): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   %%r8, %0": "=x"(val4): :"r8","xmm11");
#else
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val1):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val2):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val3):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
                asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a) :"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
                sum+=val1+val2+val3+val4;
                 }
         return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2022-01-27 11:33 ` rguenther at suse dot de
@ 2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
  2022-01-27 13:42 ` hjl.tools at gmail dot com
                   ` (16 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27 12:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #21 from hubicka at kam dot mff.cuni.cz ---
> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
        long int i;
        double sum,val1,val2,val3,val4;
         for (i=0;i<1000000000;i++)
         {
#if 1
#if 1
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val1): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val2): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val3): :"r8","xmm11");
                asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val1):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val2):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val3):"m"(a) :"r8","xmm11");
                asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
                asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a)
:"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a)
:"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a)
:"r8","xmm11");
                asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a)
:"r8","xmm11");
#endif
                sum+=val1+val2+val3+val4;
                 }
         return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 13:42 ` hjl.tools at gmail dot com
  2022-01-27 14:24 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: hjl.tools at gmail dot com @ 2022-01-27 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING

--- Comment #22 from H.J. Lu <hjl.tools at gmail dot com> ---
Is this related to PR 104059? Can you try:

https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2022-01-27 13:42 ` hjl.tools at gmail dot com
@ 2022-01-27 14:24 ` rguenth at gcc dot gnu.org
  2022-01-27 16:28 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 14:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #22)
> Is this related to PR 104059? Can you try:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

I verified that hardreg cprop does nothing on the testcase, so no.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2022-01-27 14:24 ` rguenth at gcc dot gnu.org
@ 2022-01-27 16:28 ` crazylht at gmail dot com
  2022-01-27 16:36 ` crazylht at gmail dot com
                   ` (13 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 16:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #24 from Hongtao.liu <crazylht at gmail dot com> ---
for
vmovq   %rdi, %xmm7     # 503   [c=4 l=4]  *movdf_internal/21
..
vmulsd  %xmm7, %xmm4, %xmm5     # 320   [c=12 l=4]  *fop_df_comm/2
..
        movabsq $0x3fef85af6c69b5a6, %rdi       # 409   [c=5 l=10] 
*movdf_internal/11



and 
7806(insn 320 319 322 8 (set (reg:DF 441)
7807        (mult:DF (reg:DF 166 [ _323 ])
7808            (reg:DF 249 [ _900 ]))) "../test.c":87:218 1072 {*fop_df_comm}
7809     (expr_list:REG_DEAD (reg:DF 249 [ _900 ])
7810        (nil)))

RA allocate rdi for 249 because cost of general reg is cheaper than mem.

  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NON_Q_REGS:5964,5964
TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:19546,19546
FP_SECOND_REG:19546,19546 FLOAT_REGS:19546,19546 SSE_FIRST_REG:0,0
NO_REX_SSE_REGS:0,0 SSE_REGS:0,0 FLOAT_SSE_REGS:19546,19546
FLOAT_INT_REGS:19546,19546 INT_SSE_REGS:19546,19546
FLOAT_INT_SSE_REGS:19546,19546 MEM:6294,6294

 950    r249: preferred SSE_REGS, alternative GENERAL_REGS, allocno
INT_SSE_REGS


Disposition:
 66:r249 l1     5

with -mtune=aldlake, for r249 cost of general regs is expensive than mem, and
RA will allocate mem for it, then no more movabsq/vmovq is needed.

 655  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NO\
    N_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964
FP_TOP_REG:14908,14908 FP_SECOND_REG:14908,14908 FLOAT_REGS:14908,14908
SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0\
    ,0 SSE_REGS:0,0 FLOAT_SSE_REGS:14908,14908 FLOAT_INT_REGS:14908,14908
INT_SSE_REGS:14908,14908 FLOAT_INT_SSE_REGS:14908,14908 MEM:5632,5632


 950    r249: preferred SSE_REGS, alternative NO_REGS, allocno SSE_REGS

 66:r249 l1   mem

vmulsd  -80(%rsp), %xmm2, %xmm3 # 320   [c=29 l=6]  *fop_df_comm/2

Guess we need to let RA know mem cost is cheaper than GPR for r249.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2022-01-27 16:28 ` crazylht at gmail dot com
@ 2022-01-27 16:36 ` crazylht at gmail dot com
  2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
                   ` (12 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 16:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #25 from Hongtao.liu <crazylht at gmail dot com> ---
> Guess we need to let RA know mem cost is cheaper than GPR for r249.

Reduce sse_store cost?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2022-01-27 16:36 ` crazylht at gmail dot com
@ 2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
  2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
                   ` (11 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-01-28 15:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #26 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> make costs in a way that IRA/LRA prefer re-materialization of constants
> from the constant pool over spilling to GPRs (if that's possible at all -
> Vlad?)

LRA rematerialization can not rematerialize constant value from memory pool. 
It can rematerialize value of expression only consisting of other pseudos
(currently assigned to hard regs) and constants.

I guess rematerialization pass can be extended to work for constants from
constant memory pool.  It is pretty doable project opposite to
rematerialization of any memory which would require a lot analysis including
aliasing and complicated cost calculation benefits.  May be somebody could pick
this project up.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
@ 2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
  2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
                   ` (10 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-01-28 16:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #27 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #17)
> So in .reload we have (with unpatched trunk)
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   462: ax:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
>   407: xmm2:DF=ax:DF
>   463: ax:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
>   408: xmm4:DF=ax:DF
> 
> why??!  We can load .LC0 into xmm4 directly.  IRA sees
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   407: r118:DF=r482:DF
>   408: r119:DF=r482:DF
> 
> now I cannot really decipher IRA or LRA dumps but my guess would be that
> inheritance (causing us to load from LC0) interferes badly with register
> class assignment?
> 
> Changing pseudo 482 in operand 1 of insn 407 on equiv
> 9.850689999999999724167309977929107844829559326171875e-1
> ...
>           alt=21,overall=9,losers=1,rld_nregs=1
>          Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
>       Creating newreg=525, assigning class GENERAL_REGS to r525
>   407: r118:DF=r525:DF
>     Inserting insn reload before:
>   462: r525:DF=[`*.LC0']
>       REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
> 
> we should have preferred alt 14 I think (0) v (1) m, but that has
> 
>           alt=14,overall=13,losers=1,rld_nregs=0
>             0 Spill pseudo into memory: reject+=3
>             Using memory insn operand 0: reject+=3
>             0 Non input pseudo reload: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=15,overall=28,losers=3 -- refuse
>             0 Costly set: reject++
>             alt=16: Bad operand -- refuse
>             0 Costly set: reject++
>             1 Costly loser: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=17,overall=17,losers=2 -- refuse
>             0 Costly set: reject++
>             1 Spill Non-pseudo into memory: reject+=3
>             Using memory insn operand 1: reject+=3
>             1 Non input pseudo reload: reject++
>             alt=18,overall=14,losers=1 -- refuse
>             0 Spill pseudo into memory: reject+=3
>             Using memory insn operand 0: reject+=3
>             0 Non input pseudo reload: reject++
>             1 Costly loser: reject++
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
>             alt=19,overall=29,losers=3 -- refuse
>             0 Non-prefered reload: reject+=600
>             0 Non input pseudo reload: reject++
>             alt=20,overall=607,losers=1 -- refuse
>             1 Non-pseudo reload: reject+=2
>             1 Non input pseudo reload: reject++
> 
> I'm not sure I can decipher the reasoning but I don't understand how it
> doesn't seem to anticipate the cost of reloading the GPR in the alternative
> it chooses?
> 
> Vlad?

All this diagnostics is just description of voodoo from the old reload pass. 
LRA choosing alternative the same way as the old reload pass (I doubt that any
other approach will not break all existing targets).  Simply the old reload
pass does not report its decisions in the dump.

LRA code (lra-constraints.cc::process_alt_operands) choosing the insn
alternatives (as the old reload pass) does not use any memory or register move
costs.  Instead, the alternative is chosen by heuristics and insn constraints
hints (like ? !). The only case where these costs are used, when we have
reg:=reg and the register move costs for this is 2.  In this case LRA(reload)
does not bother to check the insn constraints.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
@ 2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
  2022-02-10  7:45 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-02-09 15:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #28 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
Could somebody benchmark the following patch on zen2 470.lbm.

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..76619aca8eb 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
                             (x, lra_get_allocno_class (i)) == NO_REGS))
                        || contains_symbol_ref_p (x))))
              ira_reg_equiv[i].defined_p = false;
-           if (contains_reg_p (x, false, true))
+           if (contains_reg_p (x, false, true)
+               || (CONST_DOUBLE_P (x)
+                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
              ira_reg_equiv[i].profitable_p = false;
            if (get_equiv (reg) != reg)
              bitmap_ior_into (equiv_insn_bitmap,
&lra_reg_info[i].insn_bitmap);

If it improves the performance, I'll commit this patch.

The expander unconditionally uses memory pool for double constants.  I think
the analogous treatment could be done for equiv double constants in LRA.

I know only x86_64 permits 64-bit constants as immediate for moving them into
general regs.  As double fp operations is not done in general regs in the most
cases, they should be moved into fp regs and this is costly as Jan wrote.  So
it has sense to prohibit using equiv double constant values in LRA
unconditionally.  If in the future we have a target which can move double
immediate into fp regs we can introduce some target hooks to deal with equiv
double constant.  But right now I think there is no need for the new hook.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
@ 2022-02-10  7:45 ` rguenth at gcc dot gnu.org
  2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
                   ` (8 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-10  7:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Vladimir Makarov from comment #28)
> Could somebody benchmark the following patch on zen2 470.lbm.
> 
> diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
> index 9cee17479ba..76619aca8eb 100644
> --- a/gcc/lra-constraints.cc
> +++ b/gcc/lra-constraints.cc
> @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
>                              (x, lra_get_allocno_class (i)) == NO_REGS))
>                         || contains_symbol_ref_p (x))))
>               ira_reg_equiv[i].defined_p = false;
> -           if (contains_reg_p (x, false, true))
> +           if (contains_reg_p (x, false, true)
> +               || (CONST_DOUBLE_P (x)
> +                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
>               ira_reg_equiv[i].profitable_p = false;
>             if (get_equiv (reg) != reg)
>               bitmap_ior_into (equiv_insn_bitmap,
> &lra_reg_info[i].insn_bitmap);
> 
> If it improves the performance, I'll commit this patch.
> 
> The expander unconditionally uses memory pool for double constants.  I think
> the analogous treatment could be done for equiv double constants in LRA.
> 
> I know only x86_64 permits 64-bit constants as immediate for moving them
> into general regs.  As double fp operations is not done in general regs in
> the most cases, they should be moved into fp regs and this is costly as Jan
> wrote.  So it has sense to prohibit using equiv double constant values in
> LRA unconditionally.  If in the future we have a target which can move
> double immediate into fp regs we can introduce some target hooks to deal
> with equiv double constant.  But right now I think there is no need for the
> new hook.

Code generation changes quite a bit, with the patch the offending function
is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
there is still a lot of spilling of XMMs to GPRs.  Performance is
unchanged by the patch:

470.lbm         13740        128        107 S   13740        128        107 S
470.lbm         13740        128        107 *   13740        128        107 S
470.lbm         13740        128        107 S   13740        128        107 *

I've used

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..a0ec608c056 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5084,9 @@ lra_constraints (bool first_p)
                             (x, lra_get_allocno_class (i)) == NO_REGS))
                        || contains_symbol_ref_p (x))))
              ira_reg_equiv[i].defined_p = false;
-           if (contains_reg_p (x, false, true))
+           if (contains_reg_p (x, false, true)
+               || (CONST_DOUBLE_P (x)
+                   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)),
UNITS_PER_WORD)))
              ira_reg_equiv[i].profitable_p = false;
            if (get_equiv (reg) != reg)
              bitmap_ior_into (equiv_insn_bitmap,
&lra_reg_info[i].insn_bitmap);

note UNITS_PER_WORD vs. literal 8.

Without knowing much of the code I wonder if we can check whether the move
will be to a reg in GENERAL_REGS?  That is, do we know whether there are
(besides some special constants like zero), immediate moves to the
destination register class?

That said, given the result on LBM I'd not change this at this point.

Honza wanted to look at the move pattern to try to mitigate the
GPR spilling of XMMs.

I do think that we need to take costs into account at some point and get
rid of the reload style hand-waving with !?* in the move patterns.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2022-02-10  7:45 ` rguenth at gcc dot gnu.org
@ 2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
  2022-04-11 13:04 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-02-10 15:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #30 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #29)
> (In reply to Vladimir Makarov from comment #28)
> > Could somebody benchmark the following patch on zen2 470.lbm.
> 
> Code generation changes quite a bit, with the patch the offending function
> is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
> there is still a lot of spilling of XMMs to GPRs.  Performance is
> unchanged by the patch:
> 
> 470.lbm         13740        128        107 S   13740        128        107 S
> 470.lbm         13740        128        107 *   13740        128        107 S
> 470.lbm         13740        128        107 S   13740        128        107 *
> 
> 

Thank you very much for testing the patch, Richard.  The results mean no go for
the patch to me.

> Without knowing much of the code I wonder if we can check whether the move
> will be to a reg in GENERAL_REGS?  That is, do we know whether there are
> (besides some special constants like zero), immediate moves to the
> destination register class?
>

There are no such info from the target code.  Ideally we need to have the cost
of loading *particular* immediate value into register class on the same cost
basis
as load/store.  Still to use this info efficiently choosing alternatives should
be based on costs not on the hints and some machine independent general
heuristics (as now).

> That said, given the result on LBM I'd not change this at this point.
> 
> Honza wanted to look at the move pattern to try to mitigate the
> GPR spilling of XMMs.
> 
> I do think that we need to take costs into account at some point and get
> rid of the reload style hand-waving with !?* in the move patterns.

In general I am agree with the direction but it will be quite hard to do.  I
know it well from my experience to change register class cost calculation
algorithm in IRA (the experimental code can be found on the branch ira-select).
I expect huge number of test failures and some benchmark performance
degradation practically for any targets and a big involvement of target
maintainers to fix them.  Although it is possible to try to do this for one
target at the time.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
@ 2022-04-11 13:04 ` rguenth at gcc dot gnu.org
  2022-04-25  9:45 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-11 13:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2021-09-03 00:00:00         |2022-4-11

--- Comment #31 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the testcase I still see 58 vmovq GPR <-> XMM at -Ofast -march=znver2,
resulting from spilling of xmms.  And there's still cases like

        movq    .LC0(%rip), %rax
...
        vmovq   %rax, %xmm2
        vmovq   %rax, %xmm4

Honza - you said you wanted to have some look here.  As I understand Vlad
costing isn't taken into account when choosing alternatives for reloading.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2022-04-11 13:04 ` rguenth at gcc dot gnu.org
@ 2022-04-25  9:45 ` rguenth at gcc dot gnu.org
  2022-04-25 12:52 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #32 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the bad "head" can be fixed via

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index c74edd1aaef..8f9f26e0a82 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3580,9 +3580,9 @@
 ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and
7.
 (define_insn "*movdf_internal"
   [(set (match_operand:DF 0 "nonimmediate_operand"
-    "=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r ,v,r 
,o ,r  ,m")
+    "=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,!r,!v,r
 ,o ,r  ,m")
        (match_operand:DF 1 "general_operand"
-    "Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r
,roF,rF,rmF,rC"))]
+    "Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m
,*x,!v,!r,roF,rF,rmF,rC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))
    && (lra_in_progress || reload_completed
        || !CONST_DOUBLE_P (operands[1])

which is adding ! to r<->v alternatives.  That should eventually be done
by duplicating the alternatives and enabling one set via some enable
attribute based on some tunable.  I see those alternatives are already

   (set (attr "preferred_for_speed")
     (cond [(eq_attr "alternative" "3,4")
              (symbol_ref "TARGET_INTEGER_DFMODE_MOVES")
            (eq_attr "alternative" "20")
              (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
            (eq_attr "alternative" "21")
              (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC")
           ]     
           (symbol_ref "true")))

not sure why it's preferred_for_speed here though - shouldn't that be
enabled for size if !TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC and otherwise
disabled?  Not sure if combining enabled and preferred_for_speed is
reasonably possible, but we have a preferred_for_size attribute here.

The diff with ! added is quite short, I've yet have to measure any
effect on LBM:

--- streamcollide.s.orig        2022-04-25 11:37:01.638733951 +0200
+++ streamcollide.s2    2022-04-25 11:35:54.885849296 +0200
@@ -33,28 +33,24 @@
        .p2align 4
        .p2align 3
 .L12:
-       movq    .LC0(%rip), %rax
-       vmovsd  .LC4(%rip), %xmm6
+       vmovsd  .LC0(%rip), %xmm2
+       vmovsd  .LC1(%rip), %xmm13
+       movabsq $0x3ff01878b7a1c25d, %rax
        movabsq $0x3fef85af6c69b5a6, %rdi
+       vmovsd  .LC2(%rip), %xmm12
+       vmovsd  .LC3(%rip), %xmm14
        movabsq $0x3ff03db8fde2ef4e, %r8
+       movabsq $0x3fefcea39c51dabe, %r9
+       vmovsd  .LC4(%rip), %xmm6
        vmovsd  .LC5(%rip), %xmm7
        movq    .LC8(%rip), %r11
-       movabsq $0x3fefcea39c51dabe, %r9
        movq    .LC6(%rip), %rdx
        movq    .LC7(%rip), %rcx
-       vmovq   %rax, %xmm2
-       vmovq   %rax, %xmm4
-       movq    .LC1(%rip), %rax
        movq    %r11, %rsi
        movq    %r11, %r12
-       vmovq   %rax, %xmm13
-       vmovq   %rax, %xmm8
-       movq    .LC2(%rip), %rax
-       vmovq   %rax, %xmm12
-       vmovq   %rax, %xmm5
-       movq    .LC3(%rip), %rax
-       vmovq   %rax, %xmm14
-       movabsq $0x3ff01878b7a1c25d, %rax
+       vmovsd  %xmm2, %xmm2, %xmm4
+       vmovsd  %xmm13, %xmm13, %xmm8
+       vmovsd  %xmm12, %xmm12, %xmm5
        vmovsd  %xmm14, -16(%rsp)
 .L5:
        vmulsd  .LC9(%rip), %xmm0, %xmm3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2022-04-25  9:45 ` rguenth at gcc dot gnu.org
@ 2022-04-25 12:52 ` rguenth at gcc dot gnu.org
  2022-04-25 13:02 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #32)
> The diff with ! added is quite short, I've yet have to measure any
> effect on LBM:
> 
> --- streamcollide.s.orig        2022-04-25 11:37:01.638733951 +0200
> +++ streamcollide.s2    2022-04-25 11:35:54.885849296 +0200
> @@ -33,28 +33,24 @@
>         .p2align 4
>         .p2align 3
>  .L12:
> -       movq    .LC0(%rip), %rax
> -       vmovsd  .LC4(%rip), %xmm6
> +       vmovsd  .LC0(%rip), %xmm2
> +       vmovsd  .LC1(%rip), %xmm13
> +       movabsq $0x3ff01878b7a1c25d, %rax
>         movabsq $0x3fef85af6c69b5a6, %rdi
> +       vmovsd  .LC2(%rip), %xmm12
> +       vmovsd  .LC3(%rip), %xmm14
>         movabsq $0x3ff03db8fde2ef4e, %r8
> +       movabsq $0x3fefcea39c51dabe, %r9
> +       vmovsd  .LC4(%rip), %xmm6
>         vmovsd  .LC5(%rip), %xmm7
>         movq    .LC8(%rip), %r11
> -       movabsq $0x3fefcea39c51dabe, %r9
>         movq    .LC6(%rip), %rdx
>         movq    .LC7(%rip), %rcx
> -       vmovq   %rax, %xmm2
> -       vmovq   %rax, %xmm4
> -       movq    .LC1(%rip), %rax
>         movq    %r11, %rsi
>         movq    %r11, %r12
> -       vmovq   %rax, %xmm13
> -       vmovq   %rax, %xmm8
> -       movq    .LC2(%rip), %rax
> -       vmovq   %rax, %xmm12
> -       vmovq   %rax, %xmm5
> -       movq    .LC3(%rip), %rax
> -       vmovq   %rax, %xmm14
> -       movabsq $0x3ff01878b7a1c25d, %rax
> +       vmovsd  %xmm2, %xmm2, %xmm4
> +       vmovsd  %xmm13, %xmm13, %xmm8
> +       vmovsd  %xmm12, %xmm12, %xmm5
>         vmovsd  %xmm14, -16(%rsp)
>  .L5:
>         vmulsd  .LC9(%rip), %xmm0, %xmm3

Huh, and the net effect is that the + code is 9% _slower_!?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2022-04-25 12:52 ` rguenth at gcc dot gnu.org
@ 2022-04-25 13:02 ` rguenth at gcc dot gnu.org
  2022-04-25 13:09 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 13:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
As noted the effect of

  if(...) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

is PRE of most(!) dependent instructions, creating

  # prephitmp_1099 = PHI <_1098(6),
6.49971724999999889149648879538290202617645263671875e-1(5)>
  # prephitmp_1111 = PHI <_1110(6),
1.089805708333333178483570691241766326129436492919921875e-1(5)>
...

we successfully coalesce the non-constant incoming register with the result
but have to emit copies for all constants on the other edge where we have
quite a number of duplicate constants to deal with.

I've experimented with ensuring we get _full_ PRE of the dependent expressions
by more aggressively re-associating (give PHIs with a constant incoming operand
on at least one edge a rank similar to constants, 1).

This increases the number of PHIs further but reduces the followup computations
more.  We still fail to simply tail-duplicate the merge block - another
possibility to eventually save some of the overhead, our tail duplication
code (gimple-ssa-split-paths.cc) doesn't handle this case since the
diamond is not the one immediately preceeding the loop exit/latch.

The result of "full PRE" is a little bit worse than the current state (so
it's not a full solution here).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2022-04-25 13:02 ` rguenth at gcc dot gnu.org
@ 2022-04-25 13:09 ` rguenth at gcc dot gnu.org
  2023-04-26  6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 13:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.0                        |13.0
           Priority|P1                          |P2

--- Comment #35 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #34)
> As noted the effect of
> 
>   if(...) {
>    ux = 0.005;
>    uy = 0.002;
>    uz = 0.000;
>   }
> 
> is PRE of most(!) dependent instructions, creating
> 
>   # prephitmp_1099 = PHI <_1098(6),
> 6.49971724999999889149648879538290202617645263671875e-1(5)>
>   # prephitmp_1111 = PHI <_1110(6),
> 1.089805708333333178483570691241766326129436492919921875e-1(5)>
> ...
> 
> we successfully coalesce the non-constant incoming register with the result
> but have to emit copies for all constants on the other edge where we have
> quite a number of duplicate constants to deal with.
> 
> I've experimented with ensuring we get _full_ PRE of the dependent
> expressions
> by more aggressively re-associating (give PHIs with a constant incoming
> operand
> on at least one edge a rank similar to constants, 1).
> 
> This increases the number of PHIs further but reduces the followup
> computations
> more.  We still fail to simply tail-duplicate the merge block - another
> possibility to eventually save some of the overhead, our tail duplication
> code (gimple-ssa-split-paths.cc) doesn't handle this case since the
> diamond is not the one immediately preceeding the loop exit/latch.
> 
> The result of "full PRE" is a little bit worse than the current state (so
> it's not a full solution here).

Btw, looking at coverage the constant case is only an umimportant fraction
of the runtime, so the register pressure increase by the PRE dominates
(but the branch is predicted to be 50/50):

3562383000:  241:               if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
 55296000:  242:                        ux = 0.005;
 55296000:  243:                        uy = 0.002;
 55296000:  244:                        uz = 0.000;
        -:  245:                }

we can also see that PGO notices this and we do _not_ perform the PRE.

So the root cause is nothing we can fix for GCC 12, tuning to avoid
spilling to GPRs can recover parts of the regression but will definitely
have effects elsewhere.

Re-targeting to GCC 13.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12/13/14 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2022-04-25 13:09 ` rguenth at gcc dot gnu.org
@ 2023-04-26  6:55 ` rguenth at gcc dot gnu.org
  2023-07-27  9:22 ` rguenth at gcc dot gnu.org
  2024-05-21  9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-04-26  6:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.0                        |13.2

--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.1 is being released, retargeting bugs to GCC 13.2.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12/13/14 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2023-04-26  6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-27  9:22 ` rguenth at gcc dot gnu.org
  2024-05-21  9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org
  39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-27  9:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.2                        |13.3

--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.2 is being released, retargeting bugs to GCC 13.3.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Bug rtl-optimization/102178] [12/13/14/15 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
  2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2023-07-27  9:22 ` rguenth at gcc dot gnu.org
@ 2024-05-21  9:10 ` jakub at gcc dot gnu.org
  39 siblings, 0 replies; 43+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-21  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.3                        |13.4

--- Comment #38 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 13.3 is being released, retargeting bugs to GCC 13.4.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-05-21  9:10 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
2021-09-03  7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
2021-09-06  6:40 ` rguenth at gcc dot gnu.org
2021-09-06  6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
2021-09-07  2:46 ` luoxhu at gcc dot gnu.org
2021-09-08 14:06 ` jamborm at gcc dot gnu.org
2021-09-16 16:17 ` jamborm at gcc dot gnu.org
2022-01-20 10:20 ` rguenth at gcc dot gnu.org
2022-01-26 15:57 ` marxin at gcc dot gnu.org
2022-01-27  7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
2022-01-27  7:55 ` rguenth at gcc dot gnu.org
2022-01-27  8:13 ` crazylht at gmail dot com
2022-01-27  8:18 ` crazylht at gmail dot com
2022-01-27  8:20 ` rguenth at gcc dot gnu.org
2022-01-27  9:34 ` rguenth at gcc dot gnu.org
2022-01-27  9:55   ` Jan Hubicka
2022-01-27  9:55 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:13 ` rguenth at gcc dot gnu.org
2022-01-27 10:14 ` rguenth at gcc dot gnu.org
2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:32 ` rguenth at gcc dot gnu.org
2022-01-27 11:18 ` rguenth at gcc dot gnu.org
2022-01-27 11:30 ` rguenther at suse dot de
2022-01-27 11:33 ` rguenther at suse dot de
2022-01-27 12:04   ` Jan Hubicka
2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
2022-01-27 13:42 ` hjl.tools at gmail dot com
2022-01-27 14:24 ` rguenth at gcc dot gnu.org
2022-01-27 16:28 ` crazylht at gmail dot com
2022-01-27 16:36 ` crazylht at gmail dot com
2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
2022-02-10  7:45 ` rguenth at gcc dot gnu.org
2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
2022-04-11 13:04 ` rguenth at gcc dot gnu.org
2022-04-25  9:45 ` rguenth at gcc dot gnu.org
2022-04-25 12:52 ` rguenth at gcc dot gnu.org
2022-04-25 13:02 ` rguenth at gcc dot gnu.org
2022-04-25 13:09 ` rguenth at gcc dot gnu.org
2023-04-26  6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
2023-07-27  9:22 ` rguenth at gcc dot gnu.org
2024-05-21  9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).