public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
@ 2024-03-07 15:30 pheeck at gcc dot gnu.org
2024-03-08 8:18 ` [Bug tree-optimization/114269] " rguenth at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-03-07 15:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Bug ID: 114269
Summary: [14 Regression] Multiple 3-27% exec time regressions
of 434.zeusmp since r14-9193-ga0b1798042d033
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: pheeck at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux
Target: x86_64-linux
Our LNT instance has detected that runtime of benchmark 434.zeusmp from the
SPEC 2006 suite regressed on x86_64 machines on most configurations by 3-27%.
Some examples:
AMD Zen2 -Ofast -march=native -flto (22% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=289.80.0
Intel Skylake -Ofast -march=native PGO (11% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=791.80.0
AMD Zen4 -O2 -flto PGO (7% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=960.80.0
AMD Zen2 -Ofast -march=native (26% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=301.80.0
I have bisected these configurations to the same commit
- AMD Zen2 -Ofast
- AMD Zen2 -Ofast -march=native -flto PGO
- AMD Zen4 -O2 -flto PGO
the commit is r14-9193-ga0b1798042d033 (Richard Biener:
tree-optimization/114074 - CHREC multiplication and undefined overflow).
I haven't seen this speedup on our Aarch64 machine. I have seen it on our Intel
machine though, so this is not an AMD-specific issue.
There is another PR where a SPEC slowed down that was also bisected to the same
commit: pr114238. Maybe these two PRs are actually the same bug.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
@ 2024-03-08 8:18 ` rguenth at gcc dot gnu.org
2024-03-08 10:08 ` xry111 at gcc dot gnu.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-08 8:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Target Milestone|--- |14.0
Last reconfirmed| |2024-03-08
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will look if I can find a nice testcase for x86_64 here.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
2024-03-08 8:18 ` [Bug tree-optimization/114269] " rguenth at gcc dot gnu.org
@ 2024-03-08 10:08 ` xry111 at gcc dot gnu.org
2024-03-08 10:25 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: xry111 at gcc dot gnu.org @ 2024-03-08 10:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #2 from Xi Ruoyao <xry111 at gcc dot gnu.org> ---
*** Bug 114281 has been marked as a duplicate of this bug. ***
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
2024-03-08 8:18 ` [Bug tree-optimization/114269] " rguenth at gcc dot gnu.org
2024-03-08 10:08 ` xry111 at gcc dot gnu.org
@ 2024-03-08 10:25 ` rguenth at gcc dot gnu.org
2024-03-08 12:23 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-08 10:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
good (base) vs. bad (peak) on Zen2 with -Ofast -march=native shows
Samples: 654K of event 'cycles', Event count (approx.): 743149709374
Overhead Samples Command Shared Object Symbol
16.71% 109793 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] hsmoc_
14.37% 94016 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] hsmoc_
8.82% 57979 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.]
lorentz_
8.48% 55451 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.]
lorentz_
4.84% 31575 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] momx3_
4.68% 30456 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] momx3_
4.08% 26675 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.]
tranx3_
3.56% 23145 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.]
tranx3_
for hsmoc_ it looks like a difference in transformations done:
-hsmoc.f:826:19: optimized: loop vectorized using 32 byte vectors
(there are a lot more missed vectorizations).
subroutine hsmoc ( emf1, emf2, emf3 )
integer is, ie, js, je, ks, ke
common /gridcomi/
& is, ie, js, je, ks, ke
integer in, jn, kn, ijkn
integer i , j , k
parameter(in = 128+5
& , jn = 128+5
& , kn = 128+5)
parameter(ijkn = 128+5)
real*8 emf1 ( in, jn, kn), emf2 ( in, jn, kn)
real*8 vint (ijkn), bint (ijkn)
do 199 j=js,je+1
do 59 i=is,ie
do 858 k=ks,ke+1
vint(k)= k
bint(k)= k
858 continue
do 58 k=ks,ke+1
emf1(i,j,k) = vint(k)
emf2(i,j,k) = bint(k)
58 continue
59 continue
199 continue
return
end
doesn't reproduce it though. The actual difference for the whole testcase
is of course failed data-ref analysis:
Creating dr for (*emf2_1966(D))[_402]
-analyze_innermost: success.
- base_address: emf2_1966(D)
- offset from base address: (ssizetype) ((((sizetype) _1928 * 17689 +
(sizetype) j_2705 * 133) + (sizetype) i_2672) * 8)
- constant offset from base address: -142584
- step: 141512
- base alignment: 8
+analyze_innermost: hsmoc.f:828:72: missed: failed: evolution of offset is not
affine.
+ base_address:
+ offset from base address:
+ constant offset from base address:
+ step:
+ base alignment: 0
and then
hsmoc.f:826:19: note: === vect_analyze_data_ref_accesses ===
-hsmoc.f:826:19: missed: not consecutive access (*emf1_1964(D))[_402] = _403;
-hsmoc.f:826:19: note: using strided accesses
-hsmoc.f:826:19: missed: not consecutive access (*emf2_1966(D))[_402] = _404;
-hsmoc.f:826:19: note: using strided accesses
and we use gather and fail because of costs.
I suspect that relying on global ranges (that could save us here) is quite
fragile when there's a lot of other code around and thus opportunity for
random transforms "trashing" them.
Using the patch from PR114151 and enabling ranger during vectorization oddly
enough doesn't help (even when wiping the SCEV cache).
The odd thing is with the testcase above we get
Access function 0: (integer(kind=8)) {(((unsigned long) _30 * 17689 +
(unsigned long) _10) + (unsigned long) _66) + 18446744073709533793, +,
17689}_4;
where you can see some of the unsigned promotion being done, but we
still succeed.
As I'm lacking a smaller testcase right now it's difficult to understand why
we fail in one case but not the other.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (2 preceding siblings ...)
2024-03-08 10:25 ` rguenth at gcc dot gnu.org
@ 2024-03-08 12:23 ` rguenth at gcc dot gnu.org
2024-03-08 13:41 ` cvs-commit at gcc dot gnu.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-08 12:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The following is a C testcase for a case where ranges will not help:
void foo (int *a, long js, long je, long is, long ie, long ks, long ke, long
xi, long xj)
{
for (long j = js; j < je; ++j)
for (long i = is; i < ie; ++i)
for (long k = ks; k < ke; ++k)
a[i + j*xi + k*xi*xj] = 5;
}
SCEV analysis result before/after shows issues. When you re-order the loops
so the fast increment goes innermost this doesn't make a difference for
vectorization though. In the order above we now require (emulated) gather
which with SSE didn't work out and previously we used strided stores.
The reason seems to be that when analyzing k*xi*xj the first multiply
yields
(long int) {(unsigned long) ks_21(D) * (unsigned long) xi_24(D), +, (unsigned
long) xi_24(D)}_3
but when then asking to fold the multiply by xj we fail as we run into
tree
chrec_fold_multiply (tree type,
tree op0,
tree op1)
{
...
CASE_CONVERT:
if (tree_contains_chrecs (op0, NULL))
return chrec_dont_know;
/* FALLTHRU */
but this case is somewhat odd as all other unhandled cases simply run into
fold_build2. This possibly means we'd never build other ops with
CHREC operands. This was added for PR42326.
I think we can handle sign-conversions from unsigned just fine, chrec_fold_plus
does such thing already (but it misses one case).
Doing this restores things to some extent.
I'm testing this as an intermediate step before considering reversion of the
change.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (3 preceding siblings ...)
2024-03-08 12:23 ` rguenth at gcc dot gnu.org
@ 2024-03-08 13:41 ` cvs-commit at gcc dot gnu.org
2024-03-08 13:41 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-03-08 13:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:018ddc86b928514d7dfee024dcdeb204d5dcdd61
commit r14-9391-g018ddc86b928514d7dfee024dcdeb204d5dcdd61
Author: Richard Biener <rguenther@suse.de>
Date: Fri Mar 8 13:27:12 2024 +0100
tree-optimization/114269 - 434.zeusmp regression after SCEV analysis fix
The following addresses a performance regression caused by the recent
SCEV analysis fix with regard to folding multiplications and undefined
behavior on overflow. We do not handle (T) { a, +, b } * c but can
treat sign-conversions from unsigned by performing the multiplication
in the unsigned type. That's what we already do for additions (but
that misses one case that turns out important).
This fixes the 434.zeusmp regression for me.
PR tree-optimization/114269
PR tree-optimization/114074
* tree-chrec.cc (chrec_fold_plus_1): Handle sign-conversions
in the third CASE_CONVERT case as well.
(chrec_fold_multiply): Handle sign-conversions from unsigned
by performing the operation in the unsigned type.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (4 preceding siblings ...)
2024-03-08 13:41 ` cvs-commit at gcc dot gnu.org
@ 2024-03-08 13:41 ` rguenth at gcc dot gnu.org
2024-03-12 12:48 ` [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% " pheeck at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-08 13:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|ASSIGNED |RESOLVED
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
For me this is fixed on Zen2 with -Ofast -march=native.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (5 preceding siblings ...)
2024-03-08 13:41 ` rguenth at gcc dot gnu.org
@ 2024-03-12 12:48 ` pheeck at gcc dot gnu.org
2024-03-15 15:01 ` law at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-03-12 12:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Filip Kastl <pheeck at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|FIXED |---
Status|RESOLVED |REOPENED
Summary|[14 Regression] Multiple |[14 Regression] Multiple
|3-27% exec time regressions |3-6% exec time regressions
|of 434.zeusmp since |of 434.zeusmp since
|r14-9193-ga0b1798042d033 |r14-9193-ga0b1798042d033
--- Comment #7 from Filip Kastl <pheeck at gcc dot gnu.org> ---
Nice, our LNT instance already shows that the patch sped up 434.zeusmp. However
I'd like to reopen this bug since the benchmark didn't return to its original
values. At least some configurations are still slower than before
r14-9193-ga0b1798042d033:
Intel Skylake -Ofast -march=native (~5% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=798.80.0
Intel Skylake -Ofast -march=native -flto PGO (~5% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=785.80.0
AMD Zen3 -Ofast -march=native -flto (~6% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.80.0
AMD Zen3 -Ofast -march=native -flto PGO (~4% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=474.80.0
AMD Zen2 -Ofast -march=native -flto (~3% regression)
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=289.80.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (6 preceding siblings ...)
2024-03-12 12:48 ` [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% " pheeck at gcc dot gnu.org
@ 2024-03-15 15:01 ` law at gcc dot gnu.org
2024-03-19 12:12 ` cvs-commit at gcc dot gnu.org
2024-03-19 12:15 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: law at gcc dot gnu.org @ 2024-03-15 15:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Jeffrey A. Law <law at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |law at gcc dot gnu.org
Priority|P3 |P2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (7 preceding siblings ...)
2024-03-15 15:01 ` law at gcc dot gnu.org
@ 2024-03-19 12:12 ` cvs-commit at gcc dot gnu.org
2024-03-19 12:15 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-03-19 12:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #8 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:e0e9499aeffdaca88f0f29334384aa5f710a81a4
commit r14-9540-ge0e9499aeffdaca88f0f29334384aa5f710a81a4
Author: Richard Biener <rguenther@suse.de>
Date: Tue Mar 19 12:24:08 2024 +0100
tree-optimization/114151 - revert PR114074 fix
The following reverts the chrec_fold_multiply fix and only keeps
handling of constant overflow which keeps the original testcase
fixed. A better solution might involve ranger improvements or
tracking of assumptions during SCEV analysis similar to what niter
analysis does.
PR tree-optimization/114151
PR tree-optimization/114269
PR tree-optimization/114322
PR tree-optimization/114074
* tree-chrec.cc (chrec_fold_multiply): Restrict the use of
unsigned arithmetic when actual overflow on constant operands
is observed.
* gcc.dg/pr68317.c: Revert last change.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
` (8 preceding siblings ...)
2024-03-19 12:12 ` cvs-commit at gcc dot gnu.org
@ 2024-03-19 12:15 ` rguenth at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-19 12:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|REOPENED |RESOLVED
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the change has been reverted, we should be back to normal.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-03-19 12:15 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-07 15:30 [Bug tree-optimization/114269] New: [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033 pheeck at gcc dot gnu.org
2024-03-08 8:18 ` [Bug tree-optimization/114269] " rguenth at gcc dot gnu.org
2024-03-08 10:08 ` xry111 at gcc dot gnu.org
2024-03-08 10:25 ` rguenth at gcc dot gnu.org
2024-03-08 12:23 ` rguenth at gcc dot gnu.org
2024-03-08 13:41 ` cvs-commit at gcc dot gnu.org
2024-03-08 13:41 ` rguenth at gcc dot gnu.org
2024-03-12 12:48 ` [Bug tree-optimization/114269] [14 Regression] Multiple 3-6% " pheeck at gcc dot gnu.org
2024-03-15 15:01 ` law at gcc dot gnu.org
2024-03-19 12:12 ` cvs-commit at gcc dot gnu.org
2024-03-19 12:15 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).