public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled.
@ 2024-03-22 17:17 jchrist at linux dot ibm.com
  2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
  2024-05-29  8:03 ` jchrist at linux dot ibm.com
  0 siblings, 2 replies; 3+ messages in thread
From: jchrist at linux dot ibm.com @ 2024-03-22 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

            Bug ID: 114435
           Summary: Bad code generated when SSA and PCOM are enabled.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jchrist at linux dot ibm.com
  Target Milestone: ---

Created attachment 57783
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57783&action=edit
Reproducer

While investigating a preformance difference between clang and gcc on
imagemagick I discovered that attached test case gets badly vectorized due to
pcom pass.  If I disable pcom and set the vector cost model to unlimited, SLP
produces exactly what I would expect.  With pcom, however, the code becomes
considerably bigger and, depending on the target, even mixes scalar and
vectorized operations while the whole body of the loop should be vectorizable
via SLP.

Difference is observed between the output of

```
gcc -O3 -fvect-cost-model=unlimited fma.c -c
```
and
```
gcc -O3 -fvect-cost-model=unlimited fma.c -c -fdisable-tree-pcom
```

I am wondering if the pcom pass should be after SLP vectorization?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
@ 2024-03-25  8:52 ` rguenth at gcc dot gnu.org
  2024-05-29  8:03 ` jchrist at linux dot ibm.com
  1 sibling, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-25  8:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
   Last reconfirmed|                            |2024-03-25
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We've moved pcom after loop vectorization because of this.  But yes,
moving pcom further down is a possibility although we lack DCE after SLP
(prefetching and IVOPTs might be also confused because of that, so it
seems like a separate problem).

So yes, moving pcom after SLP and before loop prefetching sounds reasonable
to me.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/114435] PCOM messes up vectorization some times
  2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
  2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
@ 2024-05-29  8:03 ` jchrist at linux dot ibm.com
  1 sibling, 0 replies; 3+ messages in thread
From: jchrist at linux dot ibm.com @ 2024-05-29  8:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435

--- Comment #2 from jchrist at linux dot ibm.com ---
I tried this, but it seems like pcom does not handle vectors at all:  In the
gimple input I have

  vectp.5_32 = r_26(D);
  # VUSE <.MEM_52>
  vect__51.6_1 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_32];
  # PT = nonlocal null 
  # ALIGN = 8, MISALIGN = 0
  vectp.5_2 = vectp.5_32 + 16;
  # VUSE <.MEM_52>
  vect__51.7_3 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_2];
[...]
  vectp.15_12 = r_26(D);
  # .MEM_13 = VDEF <.MEM_52>
  MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_12] = vect__45.13_11;
  # PT = nonlocal null 
  # ALIGN = 8, MISALIGN = 0
  vectp.15_14 = vectp.15_12 + 16;
  # .MEM_15 = VDEF <.MEM_13>
  MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_14] = vect__45.13_29;

But the analyzed data dependencies are:

(Data Dep: 
#(Data Ref: 
#  bb: 9 
#  stmt: vect__51.6_1 = MEM <vector(2) double> [(double *)vectp.5_32];
#  ref: MEM <vector(2) double> [(double *)vectp.5_32];
#  base_object: MEM <vector(2) double> [(double *)vectp.5_32];
#)
#(Data Ref: 
#  bb: 9 
#  stmt: MEM <vector(2) double> [(double *)vectp.15_12] = vect__45.13_11;
#  ref: MEM <vector(2) double> [(double *)vectp.15_12];
#  base_object: MEM <vector(2) double> [(double *)vectp.15_12];
#)
    (don't know)
)

(Data Dep: 
#(Data Ref: 
#  bb: 9 
#  stmt: vect__51.7_3 = MEM <vector(2) double> [(double *)vectp.5_2];
#  ref: MEM <vector(2) double> [(double *)vectp.5_2];
#  base_object: MEM <vector(2) double> [(double *)vectp.5_2];
#)
#(Data Ref: 
#  bb: 9 
#  stmt: MEM <vector(2) double> [(double *)vectp.15_14] = vect__45.13_29;
#  ref: MEM <vector(2) double> [(double *)vectp.15_14];
#  base_object: MEM <vector(2) double> [(double *)vectp.15_14];
#)
    (don't know)
)

Is this expected?  Because I think this is the reason why the generated code is
still not optimal.  In every loop iteration, we still load the two accumulation
vectors on s390x, just to use them for fma and then store them.  If I
understand commoning correctly, this would be one case where it should solve
this problem and improve the code by loading and storing the accumulator
outside of the loop.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-29  8:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-22 17:17 [Bug tree-optimization/114435] New: Bad code generated when SSA and PCOM are enabled jchrist at linux dot ibm.com
2024-03-25  8:52 ` [Bug tree-optimization/114435] PCOM messes up vectorization some times rguenth at gcc dot gnu.org
2024-05-29  8:03 ` jchrist at linux dot ibm.com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).