* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
@ 2022-03-14 11:06 ` rguenth at gcc dot gnu.org
2022-03-14 11:55 ` [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58 rguenth at gcc dot gnu.org
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target| |x86_64-*-*
Target Milestone|--- |12.0
Blocks| |26163
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
@ 2022-03-14 11:55 ` rguenth at gcc dot gnu.org
2022-03-14 11:58 ` rguenth at gcc dot gnu.org
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
+mccas.fppized.f:3160:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3160:21: optimized: loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3195:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3195:21: optimized: loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3259:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3259:21: optimized: loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3304:21: optimized: loop versioned for vectorization because
of possible aliasing
mccas.fppized.f:2576:18: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:2524:17: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:3055:22: optimized: loop vectorized using 16 byte vectors
@@ -147,9 +155,11 @@
mccas.fppized.f:1890:25: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1859:20: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1843:19: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:1843:19: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1737:17: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1727:20: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1714:19: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:1714:19: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:884:24: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:904:33: optimized: basic block part vectorized using 16 byte
vectors
mccas.fppized.f:653:17: optimized: loop vectorized using 16 byte vectors
@@ -159,8 +169,11 @@
mccas.fppized.f:1188:14: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:1188:14: optimized: loop versioned for vectorization because
of possible aliasing
mccas.fppized.f:522:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:522:72: optimized: basic block part vectorized using 16 byte
vectors
mccas.fppized.f:2399:14: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:2399:14: optimized: loop versioned for vectorization because
of possible aliasing
mccas.fppized.f:2130:14: optimized: loop vectorized using 16 byte vectors
mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors
are the vectorization differences, the performance difference happens entirely
in TWOTFF (lines 3209 and following).
+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3304:21: optimized: loop versioned for vectorization because
of possible aliasing
are the inner loops of
DO 30 MK=1,NOC
DO 30 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
30 CONTINUE
and the other similar copy.
We are doing all strided loads and stores here but the vectorized code never
executes, instead we just pay the overhead of the runtime alias test for
each inner iteration (we'd ideally formulate it in a way including the
outer iteration so we could version the outer loop instead). The
runtime alias check is XPOKL(MPQ,MKL) vs. XPOKL(MRS,MKL) - an index check
on MPQ should be invariant but I guess the situation is more complicated
than that.
The cost model differences for this are
mccas.fppized.f:3304:21: note: Cost model analysis:
Vector inside of loop cost: 552
Vector prologue cost: 48
Vector epilogue cost: 280
Scalar iteration cost: 264
Scalar outside cost: 8
Vector outside cost: 328
prologue iterations: 0
epilogue iterations: 1
mccas.fppized.f:3304:21: missed: cost model: the vector iteration cost = 552
divided by the scalar iteration cost = 264 is greater or equal to the
vectorization factor = 2.
mccas.fppized.f:3304:21: missed: not vectorized: vectorization not profitable.
mccas.fppized.f:3304:21: missed: not vectorized: vector version will never be
profitable.
mccas.fppized.f:3304:21: missed: Loop costings may not be worthwhile.
vs.
mccas.fppized.f:3304:21: note: Cost model analysis:
Vector inside of loop cost: 480
Vector prologue cost: 48
Vector epilogue cost: 280
Scalar iteration cost: 264
Scalar outside cost: 8
Vector outside cost: 328
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 4
where the V2DF vec_construct costs are reduced from 24 to 12 which I think
is reasonable since we're replacing two scalar loads with one scalar load
and one movhpd from memory.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
2022-03-14 11:55 ` [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58 rguenth at gcc dot gnu.org
@ 2022-03-14 11:58 ` rguenth at gcc dot gnu.org
2022-03-14 12:17 ` rguenth at gcc dot gnu.org
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Last reconfirmed| |2022-03-14
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
PR87561 has a testcase and for it we pessimized strided loads & stores "a bit
more" in r9-6581-g7d7d1ce83889ee and r9-6580-g0538ed1d3602ec
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (2 preceding siblings ...)
2022-03-14 11:58 ` rguenth at gcc dot gnu.org
@ 2022-03-14 12:17 ` rguenth at gcc dot gnu.org
2022-03-14 13:49 ` rguenth at gcc dot gnu.org
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 12:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> PR87561 has a testcase and for it we pessimized strided loads & stores "a
> bit more" in r9-6581-g7d7d1ce83889ee and r9-6580-g0538ed1d3602ec
We're entering this CTOR cost pessimization with a cost of 4 now (down from 8),
aka one sse_op, and multiply that by 3. I think it would be better to add
TYPE_VECTOR_SUBPARTS times ->lea cost, though that would not help here
obviously.
This cost pessimization is a (bad) workaround for the inability to handle
PR89754 and PR89755
Since we halved the CTOR cost we'd now need to apply that factor of two
ontop of the pessimization for strided loads/stores to recover. Since
we only halved the CTOR case but not vec_to_scalar we get away with
just doing that for load_vec_info_type.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (3 preceding siblings ...)
2022-03-14 12:17 ` rguenth at gcc dot gnu.org
@ 2022-03-14 13:49 ` rguenth at gcc dot gnu.org
2022-03-14 14:25 ` rguenth at gcc dot gnu.org
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 13:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think for the case at hand no runtime alias checking is needed, since we have
DO 30 MK=1,NOC
DO 30 ML=1,MK
MKL = MKL+1
XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
* VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
* VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
30 CONTINUE
so we're dealing with reductions which we can interleave (with -Ofast).
Editing
the source with !GCC$ ivdep reduces the vectorization penalty to 5% (we still
need the niter/epilogue checks). It also shows that only fixing PR89755 isn't
the solution we're looking for.
In the end the vectorization is unlikely going to play out since V2DF is
usually handled well by dual issue capabilities for DFmode arithmetic on
modern archs.
The only mitigation I can think of is realizing the outer inner loop niter
is 0, 1, 2, .., NOC - 1 and thus the first outer iterations will have inner
loop vectorization not profitable. But the question is what to do with this
(not knowing the actual runtime values of NOC). As PR87561 says
"Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so
that there is no runtime aliasing between iterations most of the time
(sometimes they are indeed equal). The cost model check skips the
vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5.
An alternative for this kind of loop nest would be to cost-model for
MK % 2 == 0, thus requiring no epilogue loop."
In general applying no vectorization to these kind of loops looks wrong.
Versioning also the outer loop in addition to the inner loop in case the
number of iterations evolves in the outer loop looks excessive (but would
eventually help 416.gamess). Implementation-wise it's also non-trivial.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (4 preceding siblings ...)
2022-03-14 13:49 ` rguenth at gcc dot gnu.org
@ 2022-03-14 14:25 ` rguenth at gcc dot gnu.org
2022-03-17 12:31 ` rguenth at gcc dot gnu.org
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 14:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another thing is noticing the loop performs no vector loads/stores at all, all
of them are strided. If we'd improve SLP analysis we could get equal (but
VF==1)
basic-block vectorization - but with the caveat of having to deal with the
possible aliasing of XPQKL(MPQ,MKL) and XPQKL(MRS,MKL). Still in a case
where there's no aliasing doing BB vectorization will eventually be a better
solution.
That said - a x86 backend specific thing could be to count the number of
vector loads/stores as well as the number of strided loads/stores and
apply the biasing based on that at finish_cost time, not on the individual
case.
We can also count the number of "other" stmts in the loop body so to weight
the ratio between them. For gamess it's 10 vector stmts vs. 6 strided
loads + 2 strided stores. We could simply sum vector stmts (including
vector loads and stores), subtract the "emulated scalar" ones (maybe weight
the variably strided cases with a factor of two) and require the outcome
to be > 0 to be worthwhile to vectorize. Eventually the finish_cost hook
should get a bool result to indicate that independent of the cost of the
scalar loop we do not want this vectorization (that's nicer than returning
an arbitrary high number for example).
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (5 preceding siblings ...)
2022-03-14 14:25 ` rguenth at gcc dot gnu.org
@ 2022-03-17 12:31 ` rguenth at gcc dot gnu.org
2022-03-21 13:08 ` rguenth at gcc dot gnu.org
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-17 12:31 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 52640
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52640&action=edit
patch
Like this - this counts the number of vector stmts and the number of strided
loads/stores and then when finishing up:
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+ m_finished = true;
+ if (m_costing_for_scalar)
+ return;
+
+ /* When we have more than one strided load or store and the
+ number of strided stores is high compared to all vector
+ stmts in the body we require at least an estimated
+ improvement due to the vectorization of a factor of two. */
+ if (m_n_body_strided_load_store > 1
+ && m_n_body_stmts / m_n_body_strided_load_store < 4)
+ {
+ unsigned vf = 1;
+ if (is_a <loop_vec_info> (m_vinfo))
+ vf = vect_vf_for_cost (as_a <loop_vec_info> (m_vinfo));
+ if (scalar_costs->prologue_cost () * vf < 2 * body_cost ())
+ m_costs[vect_body] *= 2;
+ }
+}
the scaling of m_costs[vect_body] will make the vectorization unprofitable.
Instead of a hard limit like this we could also scale the strided load
cost based on the overall number of them, like if adding
m_n_body_strided_load_store squared to the cost.
Note that the "true" cost would only be visible when doing a scheduling
model with dependences in mind. Note that for this particular case this
is all hand-waving since the true cost is the versioning/branching overhead,
not the vectorized loop body and the low number of iterations makes this
particularly visible. So for 416.gamess it will be all a hack...
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (6 preceding siblings ...)
2022-03-17 12:31 ` rguenth at gcc dot gnu.org
@ 2022-03-21 13:08 ` rguenth at gcc dot gnu.org
2022-04-13 7:54 ` cvs-commit at gcc dot gnu.org
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-21 13:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm noting that for skylake cost we have
_28 * _33 1 times scalar_stmt costs 16 in prologue
and
_28 * _33 1 times vector_stmt costs 16 in body
but the load/store costs are just 12, compared to znver2 this tips the bias
over to allow vectorization while for znver2 I currently see no vectorization.
For generic I also see vectorization.
Note that costing currently assumes that the cost model niter check is
performed first and short-cuts all the versioning conditions. But since
we emit
_248 = (unsigned int) mk_113;
_247 = _248 + 4294967295;
_246 = _247 > 2;
_245 = stride.4_74 != 0;
_244 = _245 & _246;
...
_183 = _184 | _211;
_182 = _183 & _244;
if (_182 != 0)
goto <bb 27>; [80.00%]
else
goto <bb 28>; [20.00%]
on GIMPLE how things are expanded depends on some luck and with the standalone
testcase and -Ofast with generic tuning we emit the > 2 cost model check
quite late:
addq $1, %rdi
imulq %r13, %rdi
leaq (%rax,%rdi), %rcx
movq 32(%rsp), %rax
leaq (%rax,%rcx), %rsi
movq (%rsp), %rax
leaq 0(,%rsi,8), %rdx
addq %rax, %rcx
leaq 0(,%rcx,8), %rax
addq %r13, %rcx
salq $3, %rcx
cmpq %rcx, %rdx
setg %cl
addq %r13, %rsi
salq $3, %rsi
cmpq %rsi, %rax
setg %sil
orb %cl, %sil
je .L8
movl -100(%rsp), %esi
leal -1(%rsi), %ecx
cmpl $2, %ecx <-----
movl 112(%rsp), %ecx
seta %sil
testl %ecx, %ecx
setg %cl
testb %cl, %sil
je .L8
let me try to hack^Wfix this.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (7 preceding siblings ...)
2022-03-21 13:08 ` rguenth at gcc dot gnu.org
@ 2022-04-13 7:54 ` cvs-commit at gcc dot gnu.org
2022-04-13 8:02 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-13 7:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:ac8340ee4d1e65f3fd41c547b16895875f4aefa7
commit r12-8132-gac8340ee4d1e65f3fd41c547b16895875f4aefa7
Author: Richard Biener <rguenther@suse.de>
Date: Mon Mar 21 14:08:25 2022 +0100
tree-optimization/104912 - ensure cost model is checked first
The following makes sure that when we build the versioning condition
for vectorization including the cost model check, we check for the
cost model and branch over other versioning checks. That is what
the cost modeling assumes, since the cost model check is the only
one accounted for in the scalar outside cost. Currently we emit
all checks as straight-line code combined with bitwise ops which
can result in surprising ordering of checks in the final assembly.
Since loop_version accepts only a single versioning condition
the splitting is done after the fact.
The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
with -Ofast and tuning for generic or skylake. That's not enough
to recover from the slowdown when vectorizing but it now cuts off
the expensive alias versioning test.
2022-03-21 Richard Biener <rguenther@suse.de>
PR tree-optimization/104912
* tree-vect-loop-manip.cc (vect_loop_versioning): Split
the cost model check to a separate BB to make sure it is
checked first and not combined with other version checks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (8 preceding siblings ...)
2022-04-13 7:54 ` cvs-commit at gcc dot gnu.org
@ 2022-04-13 8:02 ` rguenth at gcc dot gnu.org
2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-13 8:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to CVS Commits from comment #8)
> The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
>
> https://gcc.gnu.org/g:ac8340ee4d1e65f3fd41c547b16895875f4aefa7
>
> commit r12-8132-gac8340ee4d1e65f3fd41c547b16895875f4aefa7
> Author: Richard Biener <rguenther@suse.de>
> Date: Mon Mar 21 14:08:25 2022 +0100
>
> tree-optimization/104912 - ensure cost model is checked first
>
> The following makes sure that when we build the versioning condition
> for vectorization including the cost model check, we check for the
> cost model and branch over other versioning checks. That is what
> the cost modeling assumes, since the cost model check is the only
> one accounted for in the scalar outside cost. Currently we emit
> all checks as straight-line code combined with bitwise ops which
> can result in surprising ordering of checks in the final assembly.
>
> Since loop_version accepts only a single versioning condition
> the splitting is done after the fact.
>
> The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
> with -Ofast and tuning for generic or skylake. That's not enough
> to recover from the slowdown when vectorizing but it now cuts off
> the expensive alias versioning test.
>
> 2022-03-21 Richard Biener <rguenther@suse.de>
>
> PR tree-optimization/104912
> * tree-vect-loop-manip.cc (vect_loop_versioning): Split
> the cost model check to a separate BB to make sure it is
> checked first and not combined with other version checks.
I've wrongly pushed this and reverted it again.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (9 preceding siblings ...)
2022-04-13 8:02 ` rguenth at gcc dot gnu.org
@ 2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
2022-05-06 8:33 ` [Bug target/104912] [12/13 " jakub at gcc dot gnu.org
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-20 11:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:4d4eaa77a7d27bf7b220c3be79615410c873fcf1
commit r12-8205-g4d4eaa77a7d27bf7b220c3be79615410c873fcf1
Author: Richard Biener <rguenther@suse.de>
Date: Mon Mar 21 14:08:25 2022 +0100
tree-optimization/104912 - ensure cost model is checked first
The following makes sure that when we build the versioning condition
for vectorization including the cost model check, we check for the
cost model and branch over other versioning checks. That is what
the cost modeling assumes, since the cost model check is the only
one accounted for in the scalar outside cost. Currently we emit
all checks as straight-line code combined with bitwise ops which
can result in surprising ordering of checks in the final assembly.
Since loop_version accepts only a single versioning condition
the splitting is done after the fact.
The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
with -Ofast and tuning for generic or skylake. That's not enough
to recover from the slowdown when vectorizing but it now cuts off
the expensive alias versioning test.
2022-03-21 Richard Biener <rguenther@suse.de>
PR tree-optimization/104912
* tree-vect-loop-manip.cc (vect_loop_versioning): Split
the cost model check to a separate BB to make sure it is
checked first and not combined with other version checks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (10 preceding siblings ...)
2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
@ 2022-05-06 8:33 ` jakub at gcc dot gnu.org
2022-07-26 12:42 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-05-06 8:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|12.0 |12.2
--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 12.1 is being released, retargeting bugs to GCC 12.2.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (11 preceding siblings ...)
2022-05-06 8:33 ` [Bug target/104912] [12/13 " jakub at gcc dot gnu.org
@ 2022-07-26 12:42 ` rguenth at gcc dot gnu.org
2023-01-31 11:22 ` jamborm at gcc dot gnu.org
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 12:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (12 preceding siblings ...)
2022-07-26 12:42 ` rguenth at gcc dot gnu.org
@ 2023-01-31 11:22 ` jamborm at gcc dot gnu.org
2023-01-31 12:02 ` rguenth at gcc dot gnu.org
2023-05-08 12:24 ` [Bug target/104912] [12/13/14 " rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-01-31 11:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Martin Jambor <jamborm at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jamborm at gcc dot gnu.org
--- Comment #13 from Martin Jambor <jamborm at gcc dot gnu.org> ---
there is still a 5% regression against gcc-10 on zen2 at (just) -Ofast:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=405.50.0&plot.1=414.50.0&plot.2=754.50.0&plot.3=300.50.0&
4% at -Ofast -march=native:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=406.50.0&plot.1=421.50.0&plot.2=758.50.0&plot.3=301.50.0&
13% on an Intel IceLake machine at -Ofast -march=native:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=805.50.0&plot.1=845.50.0&plot.2=820.50.0&plot.3=798.50.0&
5.8% on zen3 at just -Ofast (targetting generic x86_64 ISA):
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=524.50.0&plot.1=504.50.0&plot.2=741.50.0&plot.3=466.50.0&
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (13 preceding siblings ...)
2023-01-31 11:22 ` jamborm at gcc dot gnu.org
@ 2023-01-31 12:02 ` rguenth at gcc dot gnu.org
2023-05-08 12:24 ` [Bug target/104912] [12/13/14 " rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-31 12:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2022-03-14 00:00:00 |2023-1-31
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yep, the situation is complicated.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug target/104912] [12/13/14 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
` (14 preceding siblings ...)
2023-01-31 12:02 ` rguenth at gcc dot gnu.org
@ 2023-05-08 12:24 ` rguenth at gcc dot gnu.org
15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-08 12:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|12.3 |12.4
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.
^ permalink raw reply [flat|nested] 17+ messages in thread