[Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612
@ 2022-03-14 11:06 rguenth at gcc dot gnu.org
  2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

            Bug ID: 104912
           Summary: [12 Regression] 416.gamess regression after r12-7612
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

There's a 8% regression in runtime for 416.gamess when built with -Ofast
-mtune=generic on Zen2 (less pronounced when using -march=native).  It seems
to be triggered by r12-7612

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
@ 2022-03-14 11:06 ` rguenth at gcc dot gnu.org
  2022-03-14 11:55 ` [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58 rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*
   Target Milestone|---                         |12.0
             Blocks|                            |26163


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
  2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
@ 2022-03-14 11:55 ` rguenth at gcc dot gnu.org
  2022-03-14 11:58 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
+mccas.fppized.f:3160:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3160:21: optimized:  loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3195:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3195:21: optimized:  loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3259:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3259:21: optimized:  loop versioned for vectorization because
of possible aliasing
+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3304:21: optimized:  loop versioned for vectorization because
of possible aliasing
 mccas.fppized.f:2576:18: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:2524:17: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:3055:22: optimized: loop vectorized using 16 byte vectors
@@ -147,9 +155,11 @@
 mccas.fppized.f:1890:25: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1859:20: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1843:19: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:1843:19: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1737:17: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1727:20: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1714:19: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:1714:19: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:884:24: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:904:33: optimized: basic block part vectorized using 16 byte
vectors
 mccas.fppized.f:653:17: optimized: loop vectorized using 16 byte vectors
@@ -159,8 +169,11 @@
 mccas.fppized.f:1188:14: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:1188:14: optimized:  loop versioned for vectorization because
of possible aliasing
 mccas.fppized.f:522:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:522:72: optimized: basic block part vectorized using 16 byte
vectors
 mccas.fppized.f:2399:14: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:2399:14: optimized:  loop versioned for vectorization because
of possible aliasing
 mccas.fppized.f:2130:14: optimized: loop vectorized using 16 byte vectors
 mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors
+mccas.fppized.f:2261:72: optimized: basic block part vectorized using 16 byte
vectors


are the vectorization differences, the performance difference happens entirely
in TWOTFF (lines 3209 and following).

+mccas.fppized.f:3304:21: optimized: loop vectorized using 16 byte vectors
+mccas.fppized.f:3304:21: optimized:  loop versioned for vectorization because
of possible aliasing

are the inner loops of

            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE

and the other similar copy.

We are doing all strided loads and stores here but the vectorized code never
executes, instead we just pay the overhead of the runtime alias test for
each inner iteration (we'd ideally formulate it in a way including the
outer iteration so we could version the outer loop instead).  The
runtime alias check is XPOKL(MPQ,MKL) vs. XPOKL(MRS,MKL) - an index check
on MPQ should be invariant but I guess the situation is more complicated
than that.

The cost model differences for this are

mccas.fppized.f:3304:21: note:  Cost model analysis:
  Vector inside of loop cost: 552
  Vector prologue cost: 48
  Vector epilogue cost: 280
  Scalar iteration cost: 264
  Scalar outside cost: 8
  Vector outside cost: 328
  prologue iterations: 0
  epilogue iterations: 1
mccas.fppized.f:3304:21: missed:  cost model: the vector iteration cost = 552
divided by the scalar iteration cost = 264 is greater or equal to the
vectorization factor = 2.
mccas.fppized.f:3304:21: missed:  not vectorized: vectorization not profitable.
mccas.fppized.f:3304:21: missed:  not vectorized: vector version will never be
profitable.
mccas.fppized.f:3304:21: missed:  Loop costings may not be worthwhile.

vs.

mccas.fppized.f:3304:21: note:  Cost model analysis:
  Vector inside of loop cost: 480
  Vector prologue cost: 48
  Vector epilogue cost: 280
  Scalar iteration cost: 264
  Scalar outside cost: 8
  Vector outside cost: 328
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 4

where the V2DF vec_construct costs are reduced from 24 to 12 which I think
is reasonable since we're replacing two scalar loads with one scalar load
and one movhpd from memory.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
  2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
  2022-03-14 11:55 ` [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58 rguenth at gcc dot gnu.org
@ 2022-03-14 11:58 ` rguenth at gcc dot gnu.org
  2022-03-14 12:17 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 11:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
   Last reconfirmed|                            |2022-03-14
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
PR87561 has a testcase and for it we pessimized strided loads & stores "a bit
more" in r9-6581-g7d7d1ce83889ee and r9-6580-g0538ed1d3602ec

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-03-14 11:58 ` rguenth at gcc dot gnu.org
@ 2022-03-14 12:17 ` rguenth at gcc dot gnu.org
  2022-03-14 13:49 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 12:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> PR87561 has a testcase and for it we pessimized strided loads & stores "a
> bit more" in r9-6581-g7d7d1ce83889ee and r9-6580-g0538ed1d3602ec

We're entering this CTOR cost pessimization with a cost of 4 now (down from 8),
aka one sse_op, and multiply that by 3.  I think it would be better to add
TYPE_VECTOR_SUBPARTS times ->lea cost, though that would not help here
obviously.

This cost pessimization is a (bad) workaround for the inability to handle
PR89754 and PR89755

Since we halved the CTOR cost we'd now need to apply that factor of two
ontop of the pessimization for strided loads/stores to recover.  Since
we only halved the CTOR case but not vec_to_scalar we get away with
just doing that for load_vec_info_type.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-03-14 12:17 ` rguenth at gcc dot gnu.org
@ 2022-03-14 13:49 ` rguenth at gcc dot gnu.org
  2022-03-14 14:25 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 13:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think for the case at hand no runtime alias checking is needed, since we have

            DO 30 MK=1,NOC
            DO 30 ML=1,MK
               MKL = MKL+1
               XPQKL(MPQ,MKL) = XPQKL(MPQ,MKL) +
     *               VAL1*(CO(MS,MK)*CO(MR,ML)+CO(MS,ML)*CO(MR,MK))
               XPQKL(MRS,MKL) = XPQKL(MRS,MKL) +
     *               VAL3*(CO(MQ,MK)*CO(MP,ML)+CO(MQ,ML)*CO(MP,MK))
   30       CONTINUE

so we're dealing with reductions which we can interleave (with -Ofast). 
Editing
the source with !GCC$ ivdep reduces the vectorization penalty to 5% (we still
need the niter/epilogue checks).  It also shows that only fixing PR89755 isn't
the solution we're looking for.

In the end the vectorization is unlikely going to play out since V2DF is
usually handled well by dual issue capabilities for DFmode arithmetic on
modern archs.

The only mitigation I can think of is realizing the outer inner loop niter
is 0, 1, 2, .., NOC - 1 and thus the first outer iterations will have inner
loop vectorization not profitable.  But the question is what to do with this
(not knowing the actual runtime values of NOC).  As PR87561 says

"Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so
that there is no runtime aliasing between iterations most of the time
(sometimes they are indeed equal).  The cost model check skips the
vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5.
An alternative for this kind of loop nest would be to cost-model for
MK % 2 == 0, thus requiring no epilogue loop."

In general applying no vectorization to these kind of loops looks wrong.
Versioning also the outer loop in addition to the inner loop in case the
number of iterations evolves in the outer loop looks excessive (but would
eventually help 416.gamess).  Implementation-wise it's also non-trivial.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-03-14 13:49 ` rguenth at gcc dot gnu.org
@ 2022-03-14 14:25 ` rguenth at gcc dot gnu.org
  2022-03-17 12:31 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-14 14:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another thing is noticing the loop performs no vector loads/stores at all, all
of them are strided.  If we'd improve SLP analysis we could get equal (but
VF==1)
basic-block vectorization - but with the caveat of having to deal with the
possible aliasing of XPQKL(MPQ,MKL) and XPQKL(MRS,MKL).  Still in a case
where there's no aliasing doing BB vectorization will eventually be a better
solution.

That said - a x86 backend specific thing could be to count the number of
vector loads/stores as well as the number of strided loads/stores and
apply the biasing based on that at finish_cost time, not on the individual
case.
We can also count the number of "other" stmts in the loop body so to weight
the ratio between them.  For gamess it's 10 vector stmts vs. 6 strided
loads + 2 strided stores.  We could simply sum vector stmts (including
vector loads and stores), subtract the "emulated scalar" ones (maybe weight
the variably strided cases with a factor of two) and require the outcome
to be > 0 to be worthwhile to vectorize.  Eventually the finish_cost hook
should get a bool result to indicate that independent of the cost of the
scalar loop we do not want this vectorization (that's nicer than returning
an arbitrary high number for example).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-03-14 14:25 ` rguenth at gcc dot gnu.org
@ 2022-03-17 12:31 ` rguenth at gcc dot gnu.org
  2022-03-21 13:08 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-17 12:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 52640
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52640&action=edit
patch

Like this - this counts the number of vector stmts and the number of strided
loads/stores and then when finishing up:

+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+  m_finished = true;
+  if (m_costing_for_scalar)
+    return;
+
+  /* When we have more than one strided load or store and the
+     number of strided stores is high compared to all vector
+     stmts in the body we require at least an estimated
+     improvement due to the vectorization of a factor of two.  */
+  if (m_n_body_strided_load_store > 1
+      && m_n_body_stmts / m_n_body_strided_load_store < 4)
+    {
+      unsigned vf = 1;
+      if (is_a <loop_vec_info> (m_vinfo))
+       vf = vect_vf_for_cost (as_a <loop_vec_info> (m_vinfo));
+      if (scalar_costs->prologue_cost () * vf < 2 * body_cost ())
+       m_costs[vect_body] *= 2;
+    }
+}

the scaling of m_costs[vect_body] will make the vectorization unprofitable.
Instead of a hard limit like this we could also scale the strided load
cost based on the overall number of them, like if adding
m_n_body_strided_load_store squared to the cost.

Note that the "true" cost would only be visible when doing a scheduling
model with dependences in mind.  Note that for this particular case this
is all hand-waving since the true cost is the versioning/branching overhead,
not the vectorized loop body and the low number of iterations makes this
particularly visible.  So for 416.gamess it will be all a hack...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-03-17 12:31 ` rguenth at gcc dot gnu.org
@ 2022-03-21 13:08 ` rguenth at gcc dot gnu.org
  2022-04-13  7:54 ` cvs-commit at gcc dot gnu.org
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-21 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm noting that for skylake cost we have

  _28 * _33 1 times scalar_stmt costs 16 in prologue

and

  _28 * _33 1 times vector_stmt costs 16 in body

but the load/store costs are just 12, compared to znver2 this tips the bias
over to allow vectorization while for znver2 I currently see no vectorization.
For generic I also see vectorization.

Note that costing currently assumes that the cost model niter check is
performed first and short-cuts all the versioning conditions.  But since
we emit

  _248 = (unsigned int) mk_113;
  _247 = _248 + 4294967295;
  _246 = _247 > 2;
  _245 = stride.4_74 != 0;
  _244 = _245 & _246;
...
  _183 = _184 | _211;
  _182 = _183 & _244;
  if (_182 != 0)
    goto <bb 27>; [80.00%]
  else
    goto <bb 28>; [20.00%]

on GIMPLE how things are expanded depends on some luck and with the standalone
testcase and -Ofast with generic tuning we emit the > 2 cost model check
quite late:

        addq    $1, %rdi
        imulq   %r13, %rdi
        leaq    (%rax,%rdi), %rcx
        movq    32(%rsp), %rax
        leaq    (%rax,%rcx), %rsi
        movq    (%rsp), %rax
        leaq    0(,%rsi,8), %rdx
        addq    %rax, %rcx
        leaq    0(,%rcx,8), %rax
        addq    %r13, %rcx
        salq    $3, %rcx
        cmpq    %rcx, %rdx
        setg    %cl
        addq    %r13, %rsi
        salq    $3, %rsi
        cmpq    %rsi, %rax
        setg    %sil
        orb     %cl, %sil
        je      .L8
        movl    -100(%rsp), %esi
        leal    -1(%rsi), %ecx
        cmpl    $2, %ecx             <-----
        movl    112(%rsp), %ecx
        seta    %sil
        testl   %ecx, %ecx
        setg    %cl
        testb   %cl, %sil
        je      .L8

let me try to hack^Wfix this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-03-21 13:08 ` rguenth at gcc dot gnu.org
@ 2022-04-13  7:54 ` cvs-commit at gcc dot gnu.org
  2022-04-13  8:02 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-13  7:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:ac8340ee4d1e65f3fd41c547b16895875f4aefa7

commit r12-8132-gac8340ee4d1e65f3fd41c547b16895875f4aefa7
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Mar 21 14:08:25 2022 +0100

    tree-optimization/104912 - ensure cost model is checked first

    The following makes sure that when we build the versioning condition
    for vectorization including the cost model check, we check for the
    cost model and branch over other versioning checks.  That is what
    the cost modeling assumes, since the cost model check is the only
    one accounted for in the scalar outside cost.  Currently we emit
    all checks as straight-line code combined with bitwise ops which
    can result in surprising ordering of checks in the final assembly.

    Since loop_version accepts only a single versioning condition
    the splitting is done after the fact.

    The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
    with -Ofast and tuning for generic or skylake.  That's not enough
    to recover from the slowdown when vectorizing but it now cuts off
    the expensive alias versioning test.

    2022-03-21  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/104912
            * tree-vect-loop-manip.cc (vect_loop_versioning): Split
            the cost model check to a separate BB to make sure it is
            checked first and not combined with other version checks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-04-13  7:54 ` cvs-commit at gcc dot gnu.org
@ 2022-04-13  8:02 ` rguenth at gcc dot gnu.org
  2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-04-13  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to CVS Commits from comment #8)
> The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:ac8340ee4d1e65f3fd41c547b16895875f4aefa7
> 
> commit r12-8132-gac8340ee4d1e65f3fd41c547b16895875f4aefa7
> Author: Richard Biener <rguenther@suse.de>
> Date:   Mon Mar 21 14:08:25 2022 +0100
> 
>     tree-optimization/104912 - ensure cost model is checked first
>     
>     The following makes sure that when we build the versioning condition
>     for vectorization including the cost model check, we check for the
>     cost model and branch over other versioning checks.  That is what
>     the cost modeling assumes, since the cost model check is the only
>     one accounted for in the scalar outside cost.  Currently we emit
>     all checks as straight-line code combined with bitwise ops which
>     can result in surprising ordering of checks in the final assembly.
>     
>     Since loop_version accepts only a single versioning condition
>     the splitting is done after the fact.
>     
>     The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
>     with -Ofast and tuning for generic or skylake.  That's not enough
>     to recover from the slowdown when vectorizing but it now cuts off
>     the expensive alias versioning test.
>     
>     2022-03-21  Richard Biener  <rguenther@suse.de>
>     
>             PR tree-optimization/104912
>             * tree-vect-loop-manip.cc (vect_loop_versioning): Split
>             the cost model check to a separate BB to make sure it is
>             checked first and not combined with other version checks.

I've wrongly pushed this and reverted it again.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-04-13  8:02 ` rguenth at gcc dot gnu.org
@ 2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
  2022-05-06  8:33 ` [Bug target/104912] [12/13 " jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-04-20 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:4d4eaa77a7d27bf7b220c3be79615410c873fcf1

commit r12-8205-g4d4eaa77a7d27bf7b220c3be79615410c873fcf1
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Mar 21 14:08:25 2022 +0100

    tree-optimization/104912 - ensure cost model is checked first

    The following makes sure that when we build the versioning condition
    for vectorization including the cost model check, we check for the
    cost model and branch over other versioning checks.  That is what
    the cost modeling assumes, since the cost model check is the only
    one accounted for in the scalar outside cost.  Currently we emit
    all checks as straight-line code combined with bitwise ops which
    can result in surprising ordering of checks in the final assembly.

    Since loop_version accepts only a single versioning condition
    the splitting is done after the fact.

    The result is a 1.5% speedup of 416.gamess on x86_64 when compiling
    with -Ofast and tuning for generic or skylake.  That's not enough
    to recover from the slowdown when vectorizing but it now cuts off
    the expensive alias versioning test.

    2022-03-21  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/104912
            * tree-vect-loop-manip.cc (vect_loop_versioning): Split
            the cost model check to a separate BB to make sure it is
            checked first and not combined with other version checks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
@ 2022-05-06  8:33 ` jakub at gcc dot gnu.org
  2022-07-26 12:42 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-05-06  8:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.0                        |12.2

--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 12.1 is being released, retargeting bugs to GCC 12.2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-05-06  8:33 ` [Bug target/104912] [12/13 " jakub at gcc dot gnu.org
@ 2022-07-26 12:42 ` rguenth at gcc dot gnu.org
  2023-01-31 11:22 ` jamborm at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-07-26 12:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2022-07-26 12:42 ` rguenth at gcc dot gnu.org
@ 2023-01-31 11:22 ` jamborm at gcc dot gnu.org
  2023-01-31 12:02 ` rguenth at gcc dot gnu.org
  2023-05-08 12:24 ` [Bug target/104912] [12/13/14 " rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: jamborm at gcc dot gnu.org @ 2023-01-31 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org

--- Comment #13 from Martin Jambor <jamborm at gcc dot gnu.org> ---
there is still a 5% regression against gcc-10 on zen2 at (just) -Ofast:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=405.50.0&plot.1=414.50.0&plot.2=754.50.0&plot.3=300.50.0&

4% at -Ofast -march=native:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=406.50.0&plot.1=421.50.0&plot.2=758.50.0&plot.3=301.50.0&

13% on an Intel IceLake machine at -Ofast -march=native:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=805.50.0&plot.1=845.50.0&plot.2=820.50.0&plot.3=798.50.0&

5.8% on zen3 at just -Ofast (targetting generic x86_64 ISA):
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=524.50.0&plot.1=504.50.0&plot.2=741.50.0&plot.3=466.50.0&

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12/13 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2023-01-31 11:22 ` jamborm at gcc dot gnu.org
@ 2023-01-31 12:02 ` rguenth at gcc dot gnu.org
  2023-05-08 12:24 ` [Bug target/104912] [12/13/14 " rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-31 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2022-03-14 00:00:00         |2023-1-31

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yep, the situation is complicated.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/104912] [12/13/14 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58
  2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2023-01-31 12:02 ` rguenth at gcc dot gnu.org
@ 2023-05-08 12:24 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-08 12:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.3                        |12.4

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-05-08 12:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-14 11:06 [Bug target/104912] New: [12 Regression] 416.gamess regression after r12-7612 rguenth at gcc dot gnu.org
2022-03-14 11:06 ` [Bug target/104912] " rguenth at gcc dot gnu.org
2022-03-14 11:55 ` [Bug target/104912] [12 Regression] 416.gamess regression after r12-7612-g69619acd8d9b58 rguenth at gcc dot gnu.org
2022-03-14 11:58 ` rguenth at gcc dot gnu.org
2022-03-14 12:17 ` rguenth at gcc dot gnu.org
2022-03-14 13:49 ` rguenth at gcc dot gnu.org
2022-03-14 14:25 ` rguenth at gcc dot gnu.org
2022-03-17 12:31 ` rguenth at gcc dot gnu.org
2022-03-21 13:08 ` rguenth at gcc dot gnu.org
2022-04-13  7:54 ` cvs-commit at gcc dot gnu.org
2022-04-13  8:02 ` rguenth at gcc dot gnu.org
2022-04-20 11:28 ` cvs-commit at gcc dot gnu.org
2022-05-06  8:33 ` [Bug target/104912] [12/13 " jakub at gcc dot gnu.org
2022-07-26 12:42 ` rguenth at gcc dot gnu.org
2023-01-31 11:22 ` jamborm at gcc dot gnu.org
2023-01-31 12:02 ` rguenth at gcc dot gnu.org
2023-05-08 12:24 ` [Bug target/104912] [12/13/14 " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).