public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled.
@ 2024-03-22 17:17 jchrist at linux dot ibm.com
  2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: jchrist at linux dot ibm.com @ 2024-03-22 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

            Bug ID: 114435
           Summary: Bad code generated when SSA and PCOM are enabled.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jchrist at linux dot ibm.com
  Target Milestone: ---

Created attachment 57783
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57783&action=edit
Reproducer

While investigating a preformance difference between clang and gcc on
imagemagick I discovered that attached test case gets badly vectorized due to
pcom pass.  If I disable pcom and set the vector cost model to unlimited, SLP
produces exactly what I would expect.  With pcom, however, the code becomes
considerably bigger and, depending on the target, even mixes scalar and
vectorized operations while the whole body of the loop should be vectorizable
via SLP.

Difference is observed between the output of

```
gcc -O3 -fvect-cost-model=unlimited fma.c -c
```
and
```
gcc -O3 -fvect-cost-model=unlimited fma.c -c -fdisable-tree-pcom
```

I am wondering if the pcom pass should be after SLP vectorization?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
@ 2024-03-25  8:52 ` rguenth at gcc dot gnu.org
  2024-05-29  8:03 ` jchrist at linux dot ibm.com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-25  8:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
   Last reconfirmed|                            |2024-03-25
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We've moved pcom after loop vectorization because of this.  But yes,
moving pcom further down is a possibility although we lack DCE after SLP
(prefetching and IVOPTs might be also confused because of that, so it
seems like a separate problem).

So yes, moving pcom after SLP and before loop prefetching sounds reasonable
to me.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
  2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
@ 2024-05-29  8:03 ` jchrist at linux dot ibm.com
  2024-05-29  8:23 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jchrist at linux dot ibm.com @ 2024-05-29  8:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #2 from jchrist at linux dot ibm.com ---
I tried this, but it seems like pcom does not handle vectors at all:  In the
gimple input I have

  vectp.5_32 = r_26(D);
  # VUSE <.MEM_52>
  vect__51.6_1 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_32];
  # PT = nonlocal null 
  # ALIGN = 8, MISALIGN = 0
  vectp.5_2 = vectp.5_32 + 16;
  # VUSE <.MEM_52>
  vect__51.7_3 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_2];
[...]
  vectp.15_12 = r_26(D);
  # .MEM_13 = VDEF <.MEM_52>
  MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_12] = vect__45.13_11;
  # PT = nonlocal null 
  # ALIGN = 8, MISALIGN = 0
  vectp.15_14 = vectp.15_12 + 16;
  # .MEM_15 = VDEF <.MEM_13>
  MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_14] = vect__45.13_29;

But the analyzed data dependencies are:

(Data Dep: 
#(Data Ref: 
#  bb: 9 
#  stmt: vect__51.6_1 = MEM <vector(2) double> [(double *)vectp.5_32];
#  ref: MEM <vector(2) double> [(double *)vectp.5_32];
#  base_object: MEM <vector(2) double> [(double *)vectp.5_32];
#)
#(Data Ref: 
#  bb: 9 
#  stmt: MEM <vector(2) double> [(double *)vectp.15_12] = vect__45.13_11;
#  ref: MEM <vector(2) double> [(double *)vectp.15_12];
#  base_object: MEM <vector(2) double> [(double *)vectp.15_12];
#)
    (don't know)
)

(Data Dep: 
#(Data Ref: 
#  bb: 9 
#  stmt: vect__51.7_3 = MEM <vector(2) double> [(double *)vectp.5_2];
#  ref: MEM <vector(2) double> [(double *)vectp.5_2];
#  base_object: MEM <vector(2) double> [(double *)vectp.5_2];
#)
#(Data Ref: 
#  bb: 9 
#  stmt: MEM <vector(2) double> [(double *)vectp.15_14] = vect__45.13_29;
#  ref: MEM <vector(2) double> [(double *)vectp.15_14];
#  base_object: MEM <vector(2) double> [(double *)vectp.15_14];
#)
    (don't know)
)

Is this expected?  Because I think this is the reason why the generated code is
still not optimal.  In every loop iteration, we still load the two accumulation
vectors on s390x, just to use them for fma and then store them.  If I
understand commoning correctly, this would be one case where it should solve
this problem and improve the code by loading and storing the accumulator
outside of the loop.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
  2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
  2024-05-29  8:03 ` jchrist at linux dot ibm.com
@ 2024-05-29  8:23 ` rguenth at gcc dot gnu.org
  2024-05-29  8:34 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looking again the reason for the "bad" vectorization with pcom applied is

t.c:23:23: missed:   Build SLP failed: operation unsupported _51 =
r__r0_lsm0.7_7;

that is, pcom leaves around SSA name copies which we do not handle.  We
could probably somehow ignore those during SLP build (but we've most of
the time just fixed whoever leaves those around).  Maybe it's time to
do this.  Note we do not want a plain copy in the SLP tree, instead
when looking for the def of the operands of the PHI.  Note it would be
better to avoid the SSA copy generated by predcom.

Sneaking in a copy_prop pass after pcom just for checking vectorizes
the thing just fine, including the added recurrence:

  <bb 4> [local count: 70429947]:
  _12 = {k_25(D), k_25(D)};
  vect__8.10_28 = MEM <vector(2) double> [(double *)r_26(D)];
  vect__8.11_5 = MEM <vector(2) double> [(double *)r_26(D) + 16B];
  ivtmp.24_58 = (unsigned long) in_27(D);
  _65 = (unsigned long) sz_24(D);
  _66 = _65 * 32;
  _68 = ivtmp.24_58 + _66;

  <bb 5> [local count: 640272252]:
  # vect_r__r0_lsm0.17_15 = PHI <vect__8.10_28(4), vect__45.18_16(5)>
  # vect_r__r0_lsm0.17_30 = PHI <vect__8.11_5(4), vect__45.18_17(5)>
  # ivtmp.24_51 = PHI <ivtmp.24_58(4), ivtmp.24_54(5)>
  _20 = (void *) ivtmp.24_51;
  vect__47.14_9 = MEM <vector(2) double> [(double *)_20];
  vect__47.15_11 = MEM <vector(2) double> [(double *)_20 + 16B];
  vect__46.16_13 = vect__47.14_9 * _12;
  vect__46.16_14 = vect__47.15_11 * _12;
  vect__45.18_16 = vect__46.16_13 + vect_r__r0_lsm0.17_15;
  vect__45.18_17 = vect__46.16_14 + vect_r__r0_lsm0.17_30;
  MEM <vector(2) double> [(double *)r_26(D)] = vect__45.18_16;
  MEM <vector(2) double> [(double *)r_26(D) + 16B] = vect__45.18_17;
  ivtmp.24_54 = ivtmp.24_51 + 32;
  if (ivtmp.24_54 != _68)
    goto <bb 5>; [89.00%]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (2 preceding siblings ...)
  2024-05-29  8:23 ` rguenth at gcc dot gnu.org
@ 2024-05-29  8:34 ` rguenth at gcc dot gnu.org
  2024-05-29  9:25 ` jchrist at linux dot ibm.com
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
[..]
> Note it would be
> better to avoid the SSA copy generated by predcom.

that looks a bit difficult with the way it operates.  pcom could
set PENDING_TODO_force_next_scalar_cleanup, this does the trick for me.

diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
index 75a4c85164c..9844fee1e97 100644
--- a/gcc/tree-predcom.cc
+++ b/gcc/tree-predcom.cc
@@ -3522,6 +3522,9 @@ tree_predictive_commoning (bool allow_unroll_p)
        }
     }

+  if (ret != 0)
+    cfun->pending_TODOs |= PENDING_TODO_force_next_scalar_cleanup;
+
   return ret;
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (3 preceding siblings ...)
  2024-05-29  8:34 ` rguenth at gcc dot gnu.org
@ 2024-05-29  9:25 ` jchrist at linux dot ibm.com
  2024-05-29 10:43 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jchrist at linux dot ibm.com @ 2024-05-29  9:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #5 from jchrist at linux dot ibm.com ---
I tried your patch and it leaves an awful amount of dead stores to the
accumulator within the loop.  I also still see the stores inside the loop in
gimple.  Is this really desired?  Or is this an artifact of our unrolling
setting on s390x?  But even in the gimple I see the store inside the loop.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (4 preceding siblings ...)
  2024-05-29  9:25 ` jchrist at linux dot ibm.com
@ 2024-05-29 10:43 ` rguenth at gcc dot gnu.org
  2024-05-29 10:55 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29 10:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to jchrist from comment #5)
> I tried your patch and it leaves an awful amount of dead stores to the
> accumulator within the loop.  I also still see the stores inside the loop in
> gimple.  Is this really desired?  Or is this an artifact of our unrolling
> setting on s390x?  But even in the gimple I see the store inside the loop.

The main issue is that we cannot do store-motion in the loop during
invariant motion.  I have not checked why.

So the (vector) accumulator update stays in the loop and if you unroll
this say during RTL then you'll see the duplicates.  Note that then
appearantly RTL DSE also cannot remove them (likely due to the same reason,
all memory accesses use alias-set 1).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (5 preceding siblings ...)
  2024-05-29 10:43 ` rguenth at gcc dot gnu.org
@ 2024-05-29 10:55 ` rguenth at gcc dot gnu.org
  2024-05-29 10:58 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29 10:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, the missed store motion is because of the IS_NAN (k) check which
makes the memory accesses only conditional executed and thus possibly
trap.  We "fix" that only during loop unswitching which hoists the
invariant check.  But there's no store-motion after unswitching.
Removing this check shows we can apply store-motion.

  /* If it can trap, it must be always executed in LOOP.
     Readonly memory locations may trap when storing to them, but
     tree_could_trap_p is a predicate for rvalues, so check that
     explicitly.  */
  base = get_base_address (ref->mem.ref);
  if ((tree_could_trap_p (ref->mem.ref)
       || (DECL_P (base) && TREE_READONLY (base)))
      /* ???  We can at least use false here, allowing loads?  We
         are forcing conditional stores if the ref is not always
         stored to later anyway.  So this would only guard
         the load we need to emit.  Thus when the ref is not
         loaded we can elide this completely?  */
      && !ref_always_accessed_p (loop, ref, true))
    return false;

as the comment explains we cannot do "conditional" initialization, thus
guard the load with the duplicated invariant condition.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (6 preceding siblings ...)
  2024-05-29 10:55 ` rguenth at gcc dot gnu.org
@ 2024-05-29 10:58 ` cvs-commit at gcc dot gnu.org
  2024-05-29 11:07 ` rguenth at gcc dot gnu.org
  2024-05-29 11:10 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-29 10:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #8 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:1065a7db6f2a69770a85b4d53b9123b090dd1771

commit r15-895-g1065a7db6f2a69770a85b4d53b9123b090dd1771
Author: Richard Biener <rguenther@suse.de>
Date:   Wed May 29 10:41:51 2024 +0200

    tree-optimization/114435 - pcom left around copies confusing SLP

    The following arranges for the pre-SLP vectorization scalar cleanup
    to be run when predictive commoning was applied to a loop in the
    function.  This is similar to the complete unroll situation and
    facilitating SLP vectorization.  Avoiding the SSA copies in predictive
    commoning itself isn't easy (and predcom also sometimes unrolls,
    asking for scalar cleanup).

            PR tree-optimization/114435
            * tree-predcom.cc (tree_predictive_commoning): Queue
            the next scalar cleanup sub-pipeline to be run when we
            did something.

            * gcc.dg/vect/bb-slp-pr114435.c: New testcase.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (7 preceding siblings ...)
  2024-05-29 10:58 ` cvs-commit at gcc dot gnu.org
@ 2024-05-29 11:07 ` rguenth at gcc dot gnu.org
  2024-05-29 11:10 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29 11:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the "pcom messes up SLP" part should be fixed now.  The pass dependence of
invariant/store motion and unswitching (and likely also loop splitting) is
something different.  We may want to track this in a seprate bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
                   ` (8 preceding siblings ...)
  2024-05-29 11:07 ` rguenth at gcc dot gnu.org
@ 2024-05-29 11:10 ` rguenth at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-29 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #9)
> So the "pcom messes up SLP" part should be fixed now.  The pass dependence
> of invariant/store motion and unswitching (and likely also loop splitting) is
> something different.  We may want to track this in a seprate bug.

Note there's a conditional (on graphite) LIM pass after high-level loop opts,
it might be an option to turn it into an unconditional instance.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-29 11:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
2024-05-29  8:03 ` jchrist at linux dot ibm.com
2024-05-29  8:23 ` rguenth at gcc dot gnu.org
2024-05-29  8:34 ` rguenth at gcc dot gnu.org
2024-05-29  9:25 ` jchrist at linux dot ibm.com
2024-05-29 10:43 ` rguenth at gcc dot gnu.org
2024-05-29 10:55 ` rguenth at gcc dot gnu.org
2024-05-29 10:58 ` cvs-commit at gcc dot gnu.org
2024-05-29 11:07 ` rguenth at gcc dot gnu.org
2024-05-29 11:10 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).