public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
@ 2021-09-02 15:38 jamborm at gcc dot gnu.org
2021-09-03 7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
` (39 more replies)
0 siblings, 40 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-02 15:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Bug ID: 102178
Summary: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after
r12-897-gde56f95afaaa22
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
Blocks: 26163
Target Milestone: ---
Host: x86_64-linix
Target: x86_64-linux
LNT has detected an 18% regression of SPECFP 2006 benchmark 470.lbm
when it is compiled with -Ofast -march=native on a Zen2 machine:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=421.240.0&plot.1=301.240.0&
...and similarly a 6% regression when it is run on the same machine
with -Ofast:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&
I have bisected both on another zen2 machine to commit
r12-897-gde56f95afaaa22 (Run pass_sink_code once more before
store_merging).
Zen1 machine has also seen a similar -march=native regression in the
same time frame:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=450.240.0&plot.1=24.240.0&
Zen1 -march=generic seems to be unaffected, which is also the case for
the Intel machines we track.
Although lbm has been known to have weird regressions caused entirely
by code layout where the compiler was not really at fault, the fact
that both generic code-gen and Zen1 are affected seems to indicate this
is not the case.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
@ 2021-09-03 7:07 ` marxin at gcc dot gnu.org
2021-09-06 6:40 ` rguenth at gcc dot gnu.org
` (38 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-09-03 7:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Martin Liška <marxin at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2021-09-03
CC| |marxin at gcc dot gnu.org
Ever confirmed|0 |1
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
2021-09-03 7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
@ 2021-09-06 6:40 ` rguenth at gcc dot gnu.org
2021-09-06 6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
` (37 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-06 6:40 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Maybe related to PR102008, see the comment I made there. Martin, maybe you can
try moving late sink to before the last phiopt pass.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
2021-09-03 7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
2021-09-06 6:40 ` rguenth at gcc dot gnu.org
@ 2021-09-06 6:41 ` rguenth at gcc dot gnu.org
2021-09-07 2:46 ` luoxhu at gcc dot gnu.org
` (36 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-06 6:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Version|11.0 |12.0
Summary|SPECFP 2006 470.lbm |[12 Regression] SPECFP 2006
|regressions on AMD Zen CPUs |470.lbm regressions on AMD
|after |Zen CPUs after
|r12-897-gde56f95afaaa22 |r12-897-gde56f95afaaa22
Target Milestone|--- |12.0
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (2 preceding siblings ...)
2021-09-06 6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
@ 2021-09-07 2:46 ` luoxhu at gcc dot gnu.org
2021-09-08 14:06 ` jamborm at gcc dot gnu.org
` (35 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: luoxhu at gcc dot gnu.org @ 2021-09-07 2:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #2 from luoxhu at gcc dot gnu.org ---
Verified 470.lbm doesn't show regression on Power8 with Ofast.
runtime is 141 sec for r12-897, without that patch it is 142 sec.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (3 preceding siblings ...)
2021-09-07 2:46 ` luoxhu at gcc dot gnu.org
@ 2021-09-08 14:06 ` jamborm at gcc dot gnu.org
2021-09-16 16:17 ` jamborm at gcc dot gnu.org
` (34 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-08 14:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> Martin, maybe you can try moving late sink to before the last phiopt pass.
If you mean the following then unfortunately that has not helped.
diff --git a/gcc/passes.def b/gcc/passes.def
index d7a1f8c97a6..5eb70cd2cd8 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -347,10 +347,10 @@ along with GCC; see the file COPYING3. If not see
/* After late CD DCE we rewrite no longer addressed locals into SSA
form if possible. */
NEXT_PASS (pass_forwprop);
+ NEXT_PASS (pass_sink_code);
NEXT_PASS (pass_phiopt, false /* early_p */);
NEXT_PASS (pass_fold_builtins);
NEXT_PASS (pass_optimize_widening_mul);
- NEXT_PASS (pass_sink_code);
NEXT_PASS (pass_store_merging);
NEXT_PASS (pass_tail_calls);
/* If DCE is not run before checking for uninitialized uses,
...I'll have a very brief look at what is actually happening just so that I
have more reasons to believe this is not a code placement issue again.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (4 preceding siblings ...)
2021-09-08 14:06 ` jamborm at gcc dot gnu.org
@ 2021-09-16 16:17 ` jamborm at gcc dot gnu.org
2022-01-20 10:20 ` rguenth at gcc dot gnu.org
` (33 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: jamborm at gcc dot gnu.org @ 2021-09-16 16:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #4 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Martin Jambor from comment #3)
> ...I'll have a very brief look at what is actually happening just so that I
> have more reasons to believe this is not a code placement issue again.
The hot function is at the same address when compiled by both
revisions and the newer version looks sufficiently different. I even
tried sprinkling it with nops and it did not help. I am no saying we
are not bumping against some michro-architectural peculiarity but it
does not seem to be a code placement issue.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (5 preceding siblings ...)
2021-09-16 16:17 ` jamborm at gcc dot gnu.org
@ 2022-01-20 10:20 ` rguenth at gcc dot gnu.org
2022-01-26 15:57 ` marxin at gcc dot gnu.org
` (32 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-20 10:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
Priority|P3 |P1
Keywords| |missed-optimization
Host|x86_64-linix |
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Analysis is missing but the regression persists. On Haswell I do not see any
effect. I do suspect it's about cmov vs. non-cmov but w/o a profile and
looking affected assembly that's a wild guess.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug tree-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (6 preceding siblings ...)
2022-01-20 10:20 ` rguenth at gcc dot gnu.org
@ 2022-01-26 15:57 ` marxin at gcc dot gnu.org
2022-01-27 7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
` (31 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: marxin at gcc dot gnu.org @ 2022-01-26 15:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
Created attachment 52296
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52296&action=edit
perf annotate before and after the revision
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (7 preceding siblings ...)
2022-01-26 15:57 ` marxin at gcc dot gnu.org
@ 2022-01-27 7:42 ` rguenth at gcc dot gnu.org
2022-01-27 7:55 ` rguenth at gcc dot gnu.org
` (30 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 7:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|tree-optimization |rtl-optimization
CC| |vmakarov at gcc dot gnu.org
Keywords| |ra
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I see a lot more GPR <-> XMM moves in the 'after' case:
1035 : 401c8b: vaddsd %xmm1,%xmm0,%xmm0
1953 : 401c8f: vmovq %rcx,%xmm1
305 : 401c94: vaddsd %xmm8,%xmm1,%xmm1
3076 : 401c99: vmovq %xmm0,%r14
590 : 401c9e: vmovq %r11,%xmm0
267 : 401ca3: vmovq %xmm1,%r8
136 : 401ca8: vmovq %rdx,%xmm1
448 : 401cad: vaddsd %xmm1,%xmm0,%xmm1
1703 : 401cb1: vmovq %xmm1,%r9 (*)
834 : 401cb6: vmovq %r8,%xmm1
1719 : 401cbb: vmovq %r9,%xmm0 (*)
2782 : 401cc0: vaddsd %xmm0,%xmm1,%xmm1
22135 : 401cc4: vmovsd %xmm1,%xmm1,%xmm0
1261 : 401cc8: vmovq %r14,%xmm1
646 : 401ccd: vaddsd %xmm0,%xmm1,%xmm0
18136 : 401cd1: vaddsd %xmm2,%xmm5,%xmm1
629 : 401cd5: vmovq %xmm1,%r8
142 : 401cda: vaddsd %xmm6,%xmm3,%xmm1
177 : 401cde: vmovq %xmm0,%r14
288 : 401ce3: vmovq %xmm1,%r9
177 : 401ce8: vmovq %r8,%xmm1
174 : 401ced: vmovq %r9,%xmm0
those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think? The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.
The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed
- _948 = _861 + _867;
- _957 = _944 + _948;
- _912 = _861 + _873;
...
- _981 = _853 + _865;
- _989 = _977 + _981;
- _916 = _853 + _857;
- _924 = _912 + _916;
into a conditionally executed block. But that increases register pressure
by 5 FP regs (if I counted correctly) in that area. So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.
-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).
In postreload I see transforms like
-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
- (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0 S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
- (expr_list:REG_EQUAL (const_double:DF
9.939744999999999830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
- (nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
(reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
(nil))
which seems like we could have reloaded %xmm5 from .LC10. But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.
The differences can be seen on trunk with -Ofast -march=znver2
[-fdisable-tree-sink2].
We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).
It might be interesting to benchmark with
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (8 preceding siblings ...)
2022-01-27 7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
@ 2022-01-27 7:55 ` rguenth at gcc dot gnu.org
2022-01-27 8:13 ` crazylht at gmail dot com
` (29 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 7:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hubicka at gcc dot gnu.org
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
this improves to 114 seconds, with sink2 disabled I get 108 seconds
and with the tune-ctrl ontop I get 113 seconds.
Note that Zen2 is quite special in that it has the ability to handle
load/store from the stack by mapping it to a register, effectively
making them zero latency (zen3 lost this ability).
So while moves between GPRs and XMM might not be bad anymore _spilling_
to a GPR (and I suppose XMM, too) is still a bad idea and the stack
should be preferred.
Not sure if it's possible to do that though.
Doing the same experiment as above on a Zen3 machine would be nice, too.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (9 preceding siblings ...)
2022-01-27 7:55 ` rguenth at gcc dot gnu.org
@ 2022-01-27 8:13 ` crazylht at gmail dot com
2022-01-27 8:18 ` crazylht at gmail dot com
` (28 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 8:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
1703 : 401cb1: vmovq %xmm1,%r9 (*)
834 : 401cb6: vmovq %r8,%xmm1
1719 : 401cbb: vmovq %r9,%xmm0 (*)
Look like %r9 is dead after the second (*), and it can be optimized to
1703 : 401cb1: vmovq %xmm1,%xmm0
834 : 401cb6: vmovq %r8,%xmm1
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (10 preceding siblings ...)
2022-01-27 8:13 ` crazylht at gmail dot com
@ 2022-01-27 8:18 ` crazylht at gmail dot com
2022-01-27 8:20 ` rguenth at gcc dot gnu.org
` (27 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 8:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.
>
> Note that Zen2 is quite special in that it has the ability to handle
> load/store from the stack by mapping it to a register, effectively
> making them zero latency (zen3 lost this ability).
>
> So while moves between GPRs and XMM might not be bad anymore _spilling_
> to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> should be preferred.
>
According to znver2_cost
Cost of sse_to_integer is a little bit less than fp_store, maybe increase
sse_to_integer cost(more than fp_store) can helps RA to choose memory instead
of GPR.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (11 preceding siblings ...)
2022-01-27 8:18 ` crazylht at gmail dot com
@ 2022-01-27 8:20 ` rguenth at gcc dot gnu.org
2022-01-27 9:34 ` rguenth at gcc dot gnu.org
` (26 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 8:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.
With -Ofast -march=znver2 -fschedule-insns -fsched-pressure I get 113 seconds.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (12 preceding siblings ...)
2022-01-27 8:20 ` rguenth at gcc dot gnu.org
@ 2022-01-27 9:34 ` rguenth at gcc dot gnu.org
2022-01-27 9:55 ` Jan Hubicka
2022-01-27 9:55 ` hubicka at kam dot mff.cuni.cz
` (25 subsequent siblings)
39 siblings, 1 reply; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 9:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #10)
> (In reply to Richard Biener from comment #8)
> > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> > this improves to 114 seconds, with sink2 disabled I get 108 seconds
> > and with the tune-ctrl ontop I get 113 seconds.
> >
> > Note that Zen2 is quite special in that it has the ability to handle
> > load/store from the stack by mapping it to a register, effectively
> > making them zero latency (zen3 lost this ability).
> >
> > So while moves between GPRs and XMM might not be bad anymore _spilling_
> > to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> > should be preferred.
> >
>
> According to znver2_cost
>
> Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> sse_to_integer cost(more than fp_store) can helps RA to choose memory
> instead of GPR.
That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above
Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
but modeling that doesn't seem to be necessary.
We seem to have store costs of 8 and load costs of 6, I'll try bumping the
gpr<->xmm move cost to 8.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2022-01-27 9:34 ` rguenth at gcc dot gnu.org
@ 2022-01-27 9:55 ` Jan Hubicka
0 siblings, 0 replies; 43+ messages in thread
From: Jan Hubicka @ 2022-01-27 9:55 UTC (permalink / raw)
To: rguenth at gcc dot gnu.org; +Cc: gcc-bugs
> > According to znver2_cost
> >
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
>
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
>
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.
I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled. I recall doing few experiments which was kind of
inconclusive.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (13 preceding siblings ...)
2022-01-27 9:34 ` rguenth at gcc dot gnu.org
@ 2022-01-27 9:55 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:13 ` rguenth at gcc dot gnu.org
` (24 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27 9:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > According to znver2_cost
> >
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
>
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
>
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.
I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled. I recall doing few experiments which was kind of
inconclusive.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (14 preceding siblings ...)
2022-01-27 9:55 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 10:13 ` rguenth at gcc dot gnu.org
2022-01-27 10:14 ` rguenth at gcc dot gnu.org
` (23 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #9)
> 1703 : 401cb1: vmovq %xmm1,%r9 (*)
> 834 : 401cb6: vmovq %r8,%xmm1
> 1719 : 401cbb: vmovq %r9,%xmm0 (*)
>
> Look like %r9 is dead after the second (*), and it can be optimized to
>
> 1703 : 401cb1: vmovq %xmm1,%xmm0
> 834 : 401cb6: vmovq %r8,%xmm1
Yep, we also have code like
- movabsq $0x3ff03db8fde2ef4e, %r8
...
- vmovq %r8, %xmm11
or
movq .LC11(%rip), %rax
vmovq %rax, %xmm14
which is extremely odd to see ... (I didn't check how we arrive at that)
When I do
diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..4c51358d7b6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = {
in 32,64,128,256 and 512-bit. */
{8, 8, 8, 8, 16}, /* cost of storing SSE registers
in 32,64,128,256 and 512-bit. */
- 6, 6, /* SSE->integer and
integer->SSE
+ 8, 8, /* SSE->integer and
integer->SSE
moves. */
8, 8, /* mask->integer and integer->mask
moves */
{6, 6, 6}, /* cost of loading mask register
performance improves from 128 seconds to 115 seconds. The result is
a lot more stack spilling in the code but there are still cases like
movq .LC8(%rip), %rax
vmovsd .LC13(%rip), %xmm6
vmovsd .LC16(%rip), %xmm11
vmovsd .LC14(%rip), %xmm3
vmovsd .LC12(%rip), %xmm14
vmovq %rax, %xmm2
vmovq %rax, %xmm0
movq .LC9(%rip), %rax
see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of
at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also
loading directly to xmm0.
In the end register pressure is the main issue but how we deal with it
is bad. It's likely caused by a combination of PRE & hoisting & sinking
which together exploit
if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+
(0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) {
ux = 0.005;
uy = 0.002;
uz = 0.000;
}
which makes the following computes partly compile-time resolvable.
I still think the above code generation issues need to be analyzed and
we should figure why we emit this weird code under register pressure.
I'll attach a testcase that has the function in question split out for
easier analysis.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (15 preceding siblings ...)
2022-01-27 10:13 ` rguenth at gcc dot gnu.org
@ 2022-01-27 10:14 ` rguenth at gcc dot gnu.org
2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
` (22 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 52300
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52300&action=edit
LBM_performStreamCollide testcase
This is the relevant function.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (16 preceding siblings ...)
2022-01-27 10:14 ` rguenth at gcc dot gnu.org
@ 2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:32 ` rguenth at gcc dot gnu.org
` (21 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27 10:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
>
> Yep, we also have code like
>
> - movabsq $0x3ff03db8fde2ef4e, %r8
> ...
> - vmovq %r8, %xmm11
It is loading random constant to xmm11. Since reg<->xmm moves are
relatively cheap it looks OK to me that we generate this. Is it faster
to load constant from the memory?
> movq .LC11(%rip), %rax
> vmovq %rax, %xmm14
This is odd indeed and even more odd that we both movabs and memory load...
i386 FE plays some games with allowing some constants in SSE
instructions (to allow simplification and combining) and split them out
to memory later. It may be consequence of this.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (17 preceding siblings ...)
2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 10:32 ` rguenth at gcc dot gnu.org
2022-01-27 11:18 ` rguenth at gcc dot gnu.org
` (20 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 10:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
So in .reload we have (with unpatched trunk)
401: NOTE_INSN_BASIC_BLOCK 6
462: ax:DF=[`*.LC0']
REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
407: xmm2:DF=ax:DF
463: ax:DF=[`*.LC0']
REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
408: xmm4:DF=ax:DF
why??! We can load .LC0 into xmm4 directly. IRA sees
401: NOTE_INSN_BASIC_BLOCK 6
407: r118:DF=r482:DF
408: r119:DF=r482:DF
now I cannot really decipher IRA or LRA dumps but my guess would be that
inheritance (causing us to load from LC0) interferes badly with register
class assignment?
Changing pseudo 482 in operand 1 of insn 407 on equiv
9.850689999999999724167309977929107844829559326171875e-1
...
alt=21,overall=9,losers=1,rld_nregs=1
Choosing alt 21 in insn 407: (0) v (1) r {*movdf_internal}
Creating newreg=525, assigning class GENERAL_REGS to r525
407: r118:DF=r525:DF
Inserting insn reload before:
462: r525:DF=[`*.LC0']
REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
we should have preferred alt 14 I think (0) v (1) m, but that has
alt=14,overall=13,losers=1,rld_nregs=0
0 Spill pseudo into memory: reject+=3
Using memory insn operand 0: reject+=3
0 Non input pseudo reload: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=15,overall=28,losers=3 -- refuse
0 Costly set: reject++
alt=16: Bad operand -- refuse
0 Costly set: reject++
1 Costly loser: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=17,overall=17,losers=2 -- refuse
0 Costly set: reject++
1 Spill Non-pseudo into memory: reject+=3
Using memory insn operand 1: reject+=3
1 Non input pseudo reload: reject++
alt=18,overall=14,losers=1 -- refuse
0 Spill pseudo into memory: reject+=3
Using memory insn operand 0: reject+=3
0 Non input pseudo reload: reject++
1 Costly loser: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=19,overall=29,losers=3 -- refuse
0 Non-prefered reload: reject+=600
0 Non input pseudo reload: reject++
alt=20,overall=607,losers=1 -- refuse
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
I'm not sure I can decipher the reasoning but I don't understand how it
doesn't seem to anticipate the cost of reloading the GPR in the alternative
it chooses?
Vlad?
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (18 preceding siblings ...)
2022-01-27 10:32 ` rguenth at gcc dot gnu.org
@ 2022-01-27 11:18 ` rguenth at gcc dot gnu.org
2022-01-27 11:30 ` rguenther at suse dot de
` (19 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 11:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the case of LBM what also helps is disabling PRE or using PGO (which
sees the useless PRE) given that the path the expressions become partially
compile-time computable is never taken at runtime. In theory we could
isolate that path completely via -ftracer but the pass pipeline setup doesn't
look optimal.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (19 preceding siblings ...)
2022-01-27 11:18 ` rguenth at gcc dot gnu.org
@ 2022-01-27 11:30 ` rguenther at suse dot de
2022-01-27 11:33 ` rguenther at suse dot de
` (18 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenther at suse dot de @ 2022-01-27 11:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #19 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
>
> --- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > > According to znver2_cost
> > >
> > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > > instead of GPR.
> >
> > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> > but GPR<->xmm should be more expensive than GPR/xmm<->stack. As said above
> > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> > but modeling that doesn't seem to be necessary.
> >
> > We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> > gpr<->xmm move cost to 8.
>
> I was simply following latencies here, so indeed reg<->mem bypass is not
> really modelled. I recall doing few experiments which was kind of
> inconclusive.
Yes, I think xmm->gpr->xmm vs. xmm->mem->xmm isn't really the issue here
but it's mem->gpr->xmm vs. mem->xmm with all the constant pool remats.
Agner lists latency of 3 for gpr<->xmm and a latency of 4 for mem<->xmm
but then there's forwarding (and the clever renaming trick) which likely
makes xmm->mem->xmm cheaper than 4 + 4 but xmm->gpr->xmm will really
be 3 + 3 latency. grp<->xmm seem to be also more resource constrained.
In any case for moving xmm to gpr it doesn't make sense to go through
memory but it doesn't seem worth to spill to xmm or gpr when we only
use gpr / xmm later.
Letting the odd and bogus code we generate for the .LC0 re-materialization
aside which we should fix and which fixing likely will fix LBM.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (20 preceding siblings ...)
2022-01-27 11:30 ` rguenther at suse dot de
@ 2022-01-27 11:33 ` rguenther at suse dot de
2022-01-27 12:04 ` Jan Hubicka
2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
` (17 subsequent siblings)
39 siblings, 1 reply; 43+ messages in thread
From: rguenther at suse dot de @ 2022-01-27 11:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
>
> --- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> >
> > Yep, we also have code like
> >
> > - movabsq $0x3ff03db8fde2ef4e, %r8
> > ...
> > - vmovq %r8, %xmm11
>
> It is loading random constant to xmm11. Since reg<->xmm moves are
> relatively cheap it looks OK to me that we generate this. Is it faster
> to load constant from the memory?
I would say so. It saves code size and also uop space unless the two
can magically fuse to a immediate to %xmm move (I doubt that).
> > movq .LC11(%rip), %rax
> > vmovq %rax, %xmm14
> This is odd indeed and even more odd that we both movabs and memory load...
> i386 FE plays some games with allowing some constants in SSE
> instructions (to allow simplification and combining) and split them out
> to memory later. It may be consequence of this.
I've pasted the LRA dump pieces I think are relevant but I don't
understand them. The constant load isn't visible originally but
is introduced by LRA so that may be the key to the mystery here.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2022-01-27 11:33 ` rguenther at suse dot de
@ 2022-01-27 12:04 ` Jan Hubicka
0 siblings, 0 replies; 43+ messages in thread
From: Jan Hubicka @ 2022-01-27 12:04 UTC (permalink / raw)
To: rguenther at suse dot de; +Cc: gcc-bugs
> I would say so. It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark
double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
for (i=0;i<1000000000;i++)
{
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq %1, %0": "=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val4):"m"(a) :"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
}
return sum;
and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (21 preceding siblings ...)
2022-01-27 11:33 ` rguenther at suse dot de
@ 2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
2022-01-27 13:42 ` hjl.tools at gmail dot com
` (16 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-27 12:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #21 from hubicka at kam dot mff.cuni.cz ---
> I would say so. It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark
double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
for (i=0;i<1000000000;i++)
{
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq
%%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq
%%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq
%%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0":
"=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0":
"=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0":
"=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq %%r8, %0":
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq %1, %0": "=x"(val1):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val2):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val3):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq %1, %0": "=x"(val4):"m"(a)
:"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
}
return sum;
and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (22 preceding siblings ...)
2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-27 13:42 ` hjl.tools at gmail dot com
2022-01-27 14:24 ` rguenth at gcc dot gnu.org
` (15 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: hjl.tools at gmail dot com @ 2022-01-27 13:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
H.J. Lu <hjl.tools at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |WAITING
--- Comment #22 from H.J. Lu <hjl.tools at gmail dot com> ---
Is this related to PR 104059? Can you try:
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (23 preceding siblings ...)
2022-01-27 13:42 ` hjl.tools at gmail dot com
@ 2022-01-27 14:24 ` rguenth at gcc dot gnu.org
2022-01-27 16:28 ` crazylht at gmail dot com
` (14 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-27 14:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |NEW
--- Comment #23 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #22)
> Is this related to PR 104059? Can you try:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html
I verified that hardreg cprop does nothing on the testcase, so no.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (24 preceding siblings ...)
2022-01-27 14:24 ` rguenth at gcc dot gnu.org
@ 2022-01-27 16:28 ` crazylht at gmail dot com
2022-01-27 16:36 ` crazylht at gmail dot com
` (13 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 16:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #24 from Hongtao.liu <crazylht at gmail dot com> ---
for
vmovq %rdi, %xmm7 # 503 [c=4 l=4] *movdf_internal/21
..
vmulsd %xmm7, %xmm4, %xmm5 # 320 [c=12 l=4] *fop_df_comm/2
..
movabsq $0x3fef85af6c69b5a6, %rdi # 409 [c=5 l=10]
*movdf_internal/11
and
7806(insn 320 319 322 8 (set (reg:DF 441)
7807 (mult:DF (reg:DF 166 [ _323 ])
7808 (reg:DF 249 [ _900 ]))) "../test.c":87:218 1072 {*fop_df_comm}
7809 (expr_list:REG_DEAD (reg:DF 249 [ _900 ])
7810 (nil)))
RA allocate rdi for 249 because cost of general reg is cheaper than mem.
a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NON_Q_REGS:5964,5964
TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:19546,19546
FP_SECOND_REG:19546,19546 FLOAT_REGS:19546,19546 SSE_FIRST_REG:0,0
NO_REX_SSE_REGS:0,0 SSE_REGS:0,0 FLOAT_SSE_REGS:19546,19546
FLOAT_INT_REGS:19546,19546 INT_SSE_REGS:19546,19546
FLOAT_INT_SSE_REGS:19546,19546 MEM:6294,6294
950 r249: preferred SSE_REGS, alternative GENERAL_REGS, allocno
INT_SSE_REGS
Disposition:
66:r249 l1 5
with -mtune=aldlake, for r249 cost of general regs is expensive than mem, and
RA will allocate mem for it, then no more movabsq/vmovq is needed.
655 a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NO\
N_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964
FP_TOP_REG:14908,14908 FP_SECOND_REG:14908,14908 FLOAT_REGS:14908,14908
SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0\
,0 SSE_REGS:0,0 FLOAT_SSE_REGS:14908,14908 FLOAT_INT_REGS:14908,14908
INT_SSE_REGS:14908,14908 FLOAT_INT_SSE_REGS:14908,14908 MEM:5632,5632
950 r249: preferred SSE_REGS, alternative NO_REGS, allocno SSE_REGS
66:r249 l1 mem
vmulsd -80(%rsp), %xmm2, %xmm3 # 320 [c=29 l=6] *fop_df_comm/2
Guess we need to let RA know mem cost is cheaper than GPR for r249.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (25 preceding siblings ...)
2022-01-27 16:28 ` crazylht at gmail dot com
@ 2022-01-27 16:36 ` crazylht at gmail dot com
2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
` (12 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: crazylht at gmail dot com @ 2022-01-27 16:36 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #25 from Hongtao.liu <crazylht at gmail dot com> ---
> Guess we need to let RA know mem cost is cheaper than GPR for r249.
Reduce sse_store cost?
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (26 preceding siblings ...)
2022-01-27 16:36 ` crazylht at gmail dot com
@ 2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
` (11 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-01-28 15:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #26 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> make costs in a way that IRA/LRA prefer re-materialization of constants
> from the constant pool over spilling to GPRs (if that's possible at all -
> Vlad?)
LRA rematerialization can not rematerialize constant value from memory pool.
It can rematerialize value of expression only consisting of other pseudos
(currently assigned to hard regs) and constants.
I guess rematerialization pass can be extended to work for constants from
constant memory pool. It is pretty doable project opposite to
rematerialization of any memory which would require a lot analysis including
aliasing and complicated cost calculation benefits. May be somebody could pick
this project up.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (27 preceding siblings ...)
2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
@ 2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
` (10 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-01-28 16:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #27 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #17)
> So in .reload we have (with unpatched trunk)
>
> 401: NOTE_INSN_BASIC_BLOCK 6
> 462: ax:DF=[`*.LC0']
> REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
> 407: xmm2:DF=ax:DF
> 463: ax:DF=[`*.LC0']
> REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
> 408: xmm4:DF=ax:DF
>
> why??! We can load .LC0 into xmm4 directly. IRA sees
>
> 401: NOTE_INSN_BASIC_BLOCK 6
> 407: r118:DF=r482:DF
> 408: r119:DF=r482:DF
>
> now I cannot really decipher IRA or LRA dumps but my guess would be that
> inheritance (causing us to load from LC0) interferes badly with register
> class assignment?
>
> Changing pseudo 482 in operand 1 of insn 407 on equiv
> 9.850689999999999724167309977929107844829559326171875e-1
> ...
> alt=21,overall=9,losers=1,rld_nregs=1
> Choosing alt 21 in insn 407: (0) v (1) r {*movdf_internal}
> Creating newreg=525, assigning class GENERAL_REGS to r525
> 407: r118:DF=r525:DF
> Inserting insn reload before:
> 462: r525:DF=[`*.LC0']
> REG_EQUAL 9.850689999999999724167309977929107844829559326171875e-1
>
> we should have preferred alt 14 I think (0) v (1) m, but that has
>
> alt=14,overall=13,losers=1,rld_nregs=0
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=15,overall=28,losers=3 -- refuse
> 0 Costly set: reject++
> alt=16: Bad operand -- refuse
> 0 Costly set: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=17,overall=17,losers=2 -- refuse
> 0 Costly set: reject++
> 1 Spill Non-pseudo into memory: reject+=3
> Using memory insn operand 1: reject+=3
> 1 Non input pseudo reload: reject++
> alt=18,overall=14,losers=1 -- refuse
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=19,overall=29,losers=3 -- refuse
> 0 Non-prefered reload: reject+=600
> 0 Non input pseudo reload: reject++
> alt=20,overall=607,losers=1 -- refuse
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
>
> I'm not sure I can decipher the reasoning but I don't understand how it
> doesn't seem to anticipate the cost of reloading the GPR in the alternative
> it chooses?
>
> Vlad?
All this diagnostics is just description of voodoo from the old reload pass.
LRA choosing alternative the same way as the old reload pass (I doubt that any
other approach will not break all existing targets). Simply the old reload
pass does not report its decisions in the dump.
LRA code (lra-constraints.cc::process_alt_operands) choosing the insn
alternatives (as the old reload pass) does not use any memory or register move
costs. Instead, the alternative is chosen by heuristics and insn constraints
hints (like ? !). The only case where these costs are used, when we have
reg:=reg and the register move costs for this is 2. In this case LRA(reload)
does not bother to check the insn constraints.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (28 preceding siblings ...)
2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
@ 2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
2022-02-10 7:45 ` rguenth at gcc dot gnu.org
` (9 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-02-09 15:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #28 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
Could somebody benchmark the following patch on zen2 470.lbm.
diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..76619aca8eb 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
(x, lra_get_allocno_class (i)) == NO_REGS))
|| contains_symbol_ref_p (x))))
ira_reg_equiv[i].defined_p = false;
- if (contains_reg_p (x, false, true))
+ if (contains_reg_p (x, false, true)
+ || (CONST_DOUBLE_P (x)
+ && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
ira_reg_equiv[i].profitable_p = false;
if (get_equiv (reg) != reg)
bitmap_ior_into (equiv_insn_bitmap,
&lra_reg_info[i].insn_bitmap);
If it improves the performance, I'll commit this patch.
The expander unconditionally uses memory pool for double constants. I think
the analogous treatment could be done for equiv double constants in LRA.
I know only x86_64 permits 64-bit constants as immediate for moving them into
general regs. As double fp operations is not done in general regs in the most
cases, they should be moved into fp regs and this is costly as Jan wrote. So
it has sense to prohibit using equiv double constant values in LRA
unconditionally. If in the future we have a target which can move double
immediate into fp regs we can introduce some target hooks to deal with equiv
double constant. But right now I think there is no need for the new hook.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (29 preceding siblings ...)
2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
@ 2022-02-10 7:45 ` rguenth at gcc dot gnu.org
2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
` (8 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-10 7:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #29 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Vladimir Makarov from comment #28)
> Could somebody benchmark the following patch on zen2 470.lbm.
>
> diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
> index 9cee17479ba..76619aca8eb 100644
> --- a/gcc/lra-constraints.cc
> +++ b/gcc/lra-constraints.cc
> @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
> (x, lra_get_allocno_class (i)) == NO_REGS))
> || contains_symbol_ref_p (x))))
> ira_reg_equiv[i].defined_p = false;
> - if (contains_reg_p (x, false, true))
> + if (contains_reg_p (x, false, true)
> + || (CONST_DOUBLE_P (x)
> + && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
> ira_reg_equiv[i].profitable_p = false;
> if (get_equiv (reg) != reg)
> bitmap_ior_into (equiv_insn_bitmap,
> &lra_reg_info[i].insn_bitmap);
>
> If it improves the performance, I'll commit this patch.
>
> The expander unconditionally uses memory pool for double constants. I think
> the analogous treatment could be done for equiv double constants in LRA.
>
> I know only x86_64 permits 64-bit constants as immediate for moving them
> into general regs. As double fp operations is not done in general regs in
> the most cases, they should be moved into fp regs and this is costly as Jan
> wrote. So it has sense to prohibit using equiv double constant values in
> LRA unconditionally. If in the future we have a target which can move
> double immediate into fp regs we can introduce some target hooks to deal
> with equiv double constant. But right now I think there is no need for the
> new hook.
Code generation changes quite a bit, with the patch the offending function
is 16 bytes larger. I see no large immediate moves to GPRs anymore but
there is still a lot of spilling of XMMs to GPRs. Performance is
unchanged by the patch:
470.lbm 13740 128 107 S 13740 128 107 S
470.lbm 13740 128 107 * 13740 128 107 S
470.lbm 13740 128 107 S 13740 128 107 *
I've used
diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..a0ec608c056 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5084,9 @@ lra_constraints (bool first_p)
(x, lra_get_allocno_class (i)) == NO_REGS))
|| contains_symbol_ref_p (x))))
ira_reg_equiv[i].defined_p = false;
- if (contains_reg_p (x, false, true))
+ if (contains_reg_p (x, false, true)
+ || (CONST_DOUBLE_P (x)
+ && maybe_ge (GET_MODE_SIZE (GET_MODE (x)),
UNITS_PER_WORD)))
ira_reg_equiv[i].profitable_p = false;
if (get_equiv (reg) != reg)
bitmap_ior_into (equiv_insn_bitmap,
&lra_reg_info[i].insn_bitmap);
note UNITS_PER_WORD vs. literal 8.
Without knowing much of the code I wonder if we can check whether the move
will be to a reg in GENERAL_REGS? That is, do we know whether there are
(besides some special constants like zero), immediate moves to the
destination register class?
That said, given the result on LBM I'd not change this at this point.
Honza wanted to look at the move pattern to try to mitigate the
GPR spilling of XMMs.
I do think that we need to take costs into account at some point and get
rid of the reload style hand-waving with !?* in the move patterns.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (30 preceding siblings ...)
2022-02-10 7:45 ` rguenth at gcc dot gnu.org
@ 2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
2022-04-11 13:04 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2022-02-10 15:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #30 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #29)
> (In reply to Vladimir Makarov from comment #28)
> > Could somebody benchmark the following patch on zen2 470.lbm.
>
> Code generation changes quite a bit, with the patch the offending function
> is 16 bytes larger. I see no large immediate moves to GPRs anymore but
> there is still a lot of spilling of XMMs to GPRs. Performance is
> unchanged by the patch:
>
> 470.lbm 13740 128 107 S 13740 128 107 S
> 470.lbm 13740 128 107 * 13740 128 107 S
> 470.lbm 13740 128 107 S 13740 128 107 *
>
>
Thank you very much for testing the patch, Richard. The results mean no go for
the patch to me.
> Without knowing much of the code I wonder if we can check whether the move
> will be to a reg in GENERAL_REGS? That is, do we know whether there are
> (besides some special constants like zero), immediate moves to the
> destination register class?
>
There are no such info from the target code. Ideally we need to have the cost
of loading *particular* immediate value into register class on the same cost
basis
as load/store. Still to use this info efficiently choosing alternatives should
be based on costs not on the hints and some machine independent general
heuristics (as now).
> That said, given the result on LBM I'd not change this at this point.
>
> Honza wanted to look at the move pattern to try to mitigate the
> GPR spilling of XMMs.
>
> I do think that we need to take costs into account at some point and get
> rid of the reload style hand-waving with !?* in the move patterns.
In general I am agree with the direction but it will be quite hard to do. I
know it well from my experience to change register class cost calculation
algorithm in IRA (the experimental code can be found on the branch ira-select).
I expect huge number of test failures and some benchmark performance
degradation practically for any targets and a big involvement of target
maintainers to fix them. Although it is possible to try to do this for one
target at the time.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (31 preceding siblings ...)
2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
@ 2022-04-11 13:04 ` rguenth at gcc dot gnu.org
2022-04-25 9:45 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-11 13:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2021-09-03 00:00:00 |2022-4-11
--- Comment #31 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the testcase I still see 58 vmovq GPR <-> XMM at -Ofast -march=znver2,
resulting from spilling of xmms. And there's still cases like
movq .LC0(%rip), %rax
...
vmovq %rax, %xmm2
vmovq %rax, %xmm4
Honza - you said you wanted to have some look here. As I understand Vlad
costing isn't taken into account when choosing alternatives for reloading.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (32 preceding siblings ...)
2022-04-11 13:04 ` rguenth at gcc dot gnu.org
@ 2022-04-25 9:45 ` rguenth at gcc dot gnu.org
2022-04-25 12:52 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 9:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #32 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the bad "head" can be fixed via
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index c74edd1aaef..8f9f26e0a82 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3580,9 +3580,9 @@
;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and
7.
(define_insn "*movdf_internal"
[(set (match_operand:DF 0 "nonimmediate_operand"
- "=Yf*f,m ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r ,v,r
,o ,r ,m")
+ "=Yf*f,m ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,!r,!v,r
,o ,r ,m")
(match_operand:DF 1 "general_operand"
- "Yf*fm,Yf*f,G ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r
,roF,rF,rmF,rC"))]
+ "Yf*fm,Yf*f,G ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m
,*x,!v,!r,roF,rF,rmF,rC"))]
"!(MEM_P (operands[0]) && MEM_P (operands[1]))
&& (lra_in_progress || reload_completed
|| !CONST_DOUBLE_P (operands[1])
which is adding ! to r<->v alternatives. That should eventually be done
by duplicating the alternatives and enabling one set via some enable
attribute based on some tunable. I see those alternatives are already
(set (attr "preferred_for_speed")
(cond [(eq_attr "alternative" "3,4")
(symbol_ref "TARGET_INTEGER_DFMODE_MOVES")
(eq_attr "alternative" "20")
(symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
(eq_attr "alternative" "21")
(symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC")
]
(symbol_ref "true")))
not sure why it's preferred_for_speed here though - shouldn't that be
enabled for size if !TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC and otherwise
disabled? Not sure if combining enabled and preferred_for_speed is
reasonably possible, but we have a preferred_for_size attribute here.
The diff with ! added is quite short, I've yet have to measure any
effect on LBM:
--- streamcollide.s.orig 2022-04-25 11:37:01.638733951 +0200
+++ streamcollide.s2 2022-04-25 11:35:54.885849296 +0200
@@ -33,28 +33,24 @@
.p2align 4
.p2align 3
.L12:
- movq .LC0(%rip), %rax
- vmovsd .LC4(%rip), %xmm6
+ vmovsd .LC0(%rip), %xmm2
+ vmovsd .LC1(%rip), %xmm13
+ movabsq $0x3ff01878b7a1c25d, %rax
movabsq $0x3fef85af6c69b5a6, %rdi
+ vmovsd .LC2(%rip), %xmm12
+ vmovsd .LC3(%rip), %xmm14
movabsq $0x3ff03db8fde2ef4e, %r8
+ movabsq $0x3fefcea39c51dabe, %r9
+ vmovsd .LC4(%rip), %xmm6
vmovsd .LC5(%rip), %xmm7
movq .LC8(%rip), %r11
- movabsq $0x3fefcea39c51dabe, %r9
movq .LC6(%rip), %rdx
movq .LC7(%rip), %rcx
- vmovq %rax, %xmm2
- vmovq %rax, %xmm4
- movq .LC1(%rip), %rax
movq %r11, %rsi
movq %r11, %r12
- vmovq %rax, %xmm13
- vmovq %rax, %xmm8
- movq .LC2(%rip), %rax
- vmovq %rax, %xmm12
- vmovq %rax, %xmm5
- movq .LC3(%rip), %rax
- vmovq %rax, %xmm14
- movabsq $0x3ff01878b7a1c25d, %rax
+ vmovsd %xmm2, %xmm2, %xmm4
+ vmovsd %xmm13, %xmm13, %xmm8
+ vmovsd %xmm12, %xmm12, %xmm5
vmovsd %xmm14, -16(%rsp)
.L5:
vmulsd .LC9(%rip), %xmm0, %xmm3
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (33 preceding siblings ...)
2022-04-25 9:45 ` rguenth at gcc dot gnu.org
@ 2022-04-25 12:52 ` rguenth at gcc dot gnu.org
2022-04-25 13:02 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 12:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #33 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #32)
> The diff with ! added is quite short, I've yet have to measure any
> effect on LBM:
>
> --- streamcollide.s.orig 2022-04-25 11:37:01.638733951 +0200
> +++ streamcollide.s2 2022-04-25 11:35:54.885849296 +0200
> @@ -33,28 +33,24 @@
> .p2align 4
> .p2align 3
> .L12:
> - movq .LC0(%rip), %rax
> - vmovsd .LC4(%rip), %xmm6
> + vmovsd .LC0(%rip), %xmm2
> + vmovsd .LC1(%rip), %xmm13
> + movabsq $0x3ff01878b7a1c25d, %rax
> movabsq $0x3fef85af6c69b5a6, %rdi
> + vmovsd .LC2(%rip), %xmm12
> + vmovsd .LC3(%rip), %xmm14
> movabsq $0x3ff03db8fde2ef4e, %r8
> + movabsq $0x3fefcea39c51dabe, %r9
> + vmovsd .LC4(%rip), %xmm6
> vmovsd .LC5(%rip), %xmm7
> movq .LC8(%rip), %r11
> - movabsq $0x3fefcea39c51dabe, %r9
> movq .LC6(%rip), %rdx
> movq .LC7(%rip), %rcx
> - vmovq %rax, %xmm2
> - vmovq %rax, %xmm4
> - movq .LC1(%rip), %rax
> movq %r11, %rsi
> movq %r11, %r12
> - vmovq %rax, %xmm13
> - vmovq %rax, %xmm8
> - movq .LC2(%rip), %rax
> - vmovq %rax, %xmm12
> - vmovq %rax, %xmm5
> - movq .LC3(%rip), %rax
> - vmovq %rax, %xmm14
> - movabsq $0x3ff01878b7a1c25d, %rax
> + vmovsd %xmm2, %xmm2, %xmm4
> + vmovsd %xmm13, %xmm13, %xmm8
> + vmovsd %xmm12, %xmm12, %xmm5
> vmovsd %xmm14, -16(%rsp)
> .L5:
> vmulsd .LC9(%rip), %xmm0, %xmm3
Huh, and the net effect is that the + code is 9% _slower_!?
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (34 preceding siblings ...)
2022-04-25 12:52 ` rguenth at gcc dot gnu.org
@ 2022-04-25 13:02 ` rguenth at gcc dot gnu.org
2022-04-25 13:09 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 13:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
As noted the effect of
if(...) {
ux = 0.005;
uy = 0.002;
uz = 0.000;
}
is PRE of most(!) dependent instructions, creating
# prephitmp_1099 = PHI <_1098(6),
6.49971724999999889149648879538290202617645263671875e-1(5)>
# prephitmp_1111 = PHI <_1110(6),
1.089805708333333178483570691241766326129436492919921875e-1(5)>
...
we successfully coalesce the non-constant incoming register with the result
but have to emit copies for all constants on the other edge where we have
quite a number of duplicate constants to deal with.
I've experimented with ensuring we get _full_ PRE of the dependent expressions
by more aggressively re-associating (give PHIs with a constant incoming operand
on at least one edge a rank similar to constants, 1).
This increases the number of PHIs further but reduces the followup computations
more. We still fail to simply tail-duplicate the merge block - another
possibility to eventually save some of the overhead, our tail duplication
code (gimple-ssa-split-paths.cc) doesn't handle this case since the
diamond is not the one immediately preceeding the loop exit/latch.
The result of "full PRE" is a little bit worse than the current state (so
it's not a full solution here).
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (35 preceding siblings ...)
2022-04-25 13:02 ` rguenth at gcc dot gnu.org
@ 2022-04-25 13:09 ` rguenth at gcc dot gnu.org
2023-04-26 6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
` (2 subsequent siblings)
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-25 13:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|12.0 |13.0
Priority|P1 |P2
--- Comment #35 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #34)
> As noted the effect of
>
> if(...) {
> ux = 0.005;
> uy = 0.002;
> uz = 0.000;
> }
>
> is PRE of most(!) dependent instructions, creating
>
> # prephitmp_1099 = PHI <_1098(6),
> 6.49971724999999889149648879538290202617645263671875e-1(5)>
> # prephitmp_1111 = PHI <_1110(6),
> 1.089805708333333178483570691241766326129436492919921875e-1(5)>
> ...
>
> we successfully coalesce the non-constant incoming register with the result
> but have to emit copies for all constants on the other edge where we have
> quite a number of duplicate constants to deal with.
>
> I've experimented with ensuring we get _full_ PRE of the dependent
> expressions
> by more aggressively re-associating (give PHIs with a constant incoming
> operand
> on at least one edge a rank similar to constants, 1).
>
> This increases the number of PHIs further but reduces the followup
> computations
> more. We still fail to simply tail-duplicate the merge block - another
> possibility to eventually save some of the overhead, our tail duplication
> code (gimple-ssa-split-paths.cc) doesn't handle this case since the
> diamond is not the one immediately preceeding the loop exit/latch.
>
> The result of "full PRE" is a little bit worse than the current state (so
> it's not a full solution here).
Btw, looking at coverage the constant case is only an umimportant fraction
of the runtime, so the register pressure increase by the PRE dominates
(but the branch is predicted to be 50/50):
3562383000: 241: if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
55296000: 242: ux = 0.005;
55296000: 243: uy = 0.002;
55296000: 244: uz = 0.000;
-: 245: }
we can also see that PGO notices this and we do _not_ perform the PRE.
So the root cause is nothing we can fix for GCC 12, tuning to avoid
spilling to GPRs can recover parts of the regression but will definitely
have effects elsewhere.
Re-targeting to GCC 13.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12/13/14 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (36 preceding siblings ...)
2022-04-25 13:09 ` rguenth at gcc dot gnu.org
@ 2023-04-26 6:55 ` rguenth at gcc dot gnu.org
2023-07-27 9:22 ` rguenth at gcc dot gnu.org
2024-05-21 9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-04-26 6:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|13.0 |13.2
--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.1 is being released, retargeting bugs to GCC 13.2.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12/13/14 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (37 preceding siblings ...)
2023-04-26 6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
@ 2023-07-27 9:22 ` rguenth at gcc dot gnu.org
2024-05-21 9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org
39 siblings, 0 replies; 43+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-27 9:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|13.2 |13.3
--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 13.2 is being released, retargeting bugs to GCC 13.3.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [Bug rtl-optimization/102178] [12/13/14/15 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
` (38 preceding siblings ...)
2023-07-27 9:22 ` rguenth at gcc dot gnu.org
@ 2024-05-21 9:10 ` jakub at gcc dot gnu.org
39 siblings, 0 replies; 43+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-21 9:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|13.3 |13.4
--- Comment #38 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 13.3 is being released, retargeting bugs to GCC 13.4.
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2024-05-21 9:10 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 15:38 [Bug tree-optimization/102178] New: SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 jamborm at gcc dot gnu.org
2021-09-03 7:07 ` [Bug tree-optimization/102178] " marxin at gcc dot gnu.org
2021-09-06 6:40 ` rguenth at gcc dot gnu.org
2021-09-06 6:41 ` [Bug tree-optimization/102178] [12 Regression] " rguenth at gcc dot gnu.org
2021-09-07 2:46 ` luoxhu at gcc dot gnu.org
2021-09-08 14:06 ` jamborm at gcc dot gnu.org
2021-09-16 16:17 ` jamborm at gcc dot gnu.org
2022-01-20 10:20 ` rguenth at gcc dot gnu.org
2022-01-26 15:57 ` marxin at gcc dot gnu.org
2022-01-27 7:42 ` [Bug rtl-optimization/102178] " rguenth at gcc dot gnu.org
2022-01-27 7:55 ` rguenth at gcc dot gnu.org
2022-01-27 8:13 ` crazylht at gmail dot com
2022-01-27 8:18 ` crazylht at gmail dot com
2022-01-27 8:20 ` rguenth at gcc dot gnu.org
2022-01-27 9:34 ` rguenth at gcc dot gnu.org
2022-01-27 9:55 ` Jan Hubicka
2022-01-27 9:55 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:13 ` rguenth at gcc dot gnu.org
2022-01-27 10:14 ` rguenth at gcc dot gnu.org
2022-01-27 10:23 ` hubicka at kam dot mff.cuni.cz
2022-01-27 10:32 ` rguenth at gcc dot gnu.org
2022-01-27 11:18 ` rguenth at gcc dot gnu.org
2022-01-27 11:30 ` rguenther at suse dot de
2022-01-27 11:33 ` rguenther at suse dot de
2022-01-27 12:04 ` Jan Hubicka
2022-01-27 12:04 ` hubicka at kam dot mff.cuni.cz
2022-01-27 13:42 ` hjl.tools at gmail dot com
2022-01-27 14:24 ` rguenth at gcc dot gnu.org
2022-01-27 16:28 ` crazylht at gmail dot com
2022-01-27 16:36 ` crazylht at gmail dot com
2022-01-28 15:48 ` vmakarov at gcc dot gnu.org
2022-01-28 16:02 ` vmakarov at gcc dot gnu.org
2022-02-09 15:51 ` vmakarov at gcc dot gnu.org
2022-02-10 7:45 ` rguenth at gcc dot gnu.org
2022-02-10 15:17 ` vmakarov at gcc dot gnu.org
2022-04-11 13:04 ` rguenth at gcc dot gnu.org
2022-04-25 9:45 ` rguenth at gcc dot gnu.org
2022-04-25 12:52 ` rguenth at gcc dot gnu.org
2022-04-25 13:02 ` rguenth at gcc dot gnu.org
2022-04-25 13:09 ` rguenth at gcc dot gnu.org
2023-04-26 6:55 ` [Bug rtl-optimization/102178] [12/13/14 " rguenth at gcc dot gnu.org
2023-07-27 9:22 ` rguenth at gcc dot gnu.org
2024-05-21 9:10 ` [Bug rtl-optimization/102178] [12/13/14/15 " jakub at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).