[Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute
@ 2023-08-07 13:28 rguenth at gcc dot gnu.org
  2023-08-07 13:30 ` [Bug tree-optimization/110935] " rguenth at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-07 13:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

            Bug ID: 110935
           Summary: Missed BB reduction vectorization because of missed
                    eliding of a permute
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

double vals[16];
double test ()
{
  vals[0]++;
  return vals[2] + vals[4] + vals[1] + vals[3];
}

has the reduction not vectorized with -ffast-math because

t.c:5:38: note:   === vect_slp_analyze_operations ===
t.c:5:38: note:   ==> examining statement: _8 = vals[3];
t.c:5:38: missed:   BB vectorization with gaps at the end of a load is not
supported
t.c:5:44: missed:   not vectorized: relevant stmt not supported: _8 = vals[3];
t.c:5:38: note:   removing SLP instance operations starting from: _11 = _7 +
_8;
t.c:5:38: missed:  not vectorized: bad operation in basic block.

we fail to elide the load permutation (BB vect allows a consecutive
sub-set):

t.c:5:38: note:   Final SLP tree for instance 0x51c8d60:
t.c:5:38: note:   node 0x5285860 (max_nunits=2, refcnt=2) vector(2) double
t.c:5:38: note:   op template: _8 = vals[3];
t.c:5:38: note:         stmt 0 _8 = vals[3];
t.c:5:38: note:         stmt 1 _6 = vals[1];
t.c:5:38: note:         stmt 2 _3 = vals[2];
t.c:5:38: note:         stmt 3 _4 = vals[4];
t.c:5:38: note:         load permutation { 3 1 2 4 }
t.c:5:38: note:    === vect_match_slp_patterns ===
t.c:5:38: note:    Analyzing SLP tree 0x5285860 for patterns
t.c:5:38: note:  SLP optimize permutations:
t.c:5:38: note:    1: { 2, 0, 1, 3 }
t.c:5:38: note:  SLP optimize partitions:
t.c:5:38: note:    -------------
t.c:5:38: note:    partition 0 (layout 0):
t.c:5:38: note:      nodes:
t.c:5:38: note:        - 0x5285860:
t.c:5:38: note:            weight: 1.000000
t.c:5:38: note:            op template: _8 = vals[3];
t.c:5:38: note:      edges:
t.c:5:38: note:      layout 0: (*)
t.c:5:38: note:          {depth: 0.000000, total: 0.000000}
t.c:5:38: note:        + {depth: 1.000000, total: 1.000000}
t.c:5:38: note:        + {depth: 0.000000, total: 0.000000}
t.c:5:38: note:        = {depth: 1.000000, total: 1.000000}
t.c:5:38: note:      layout 1:
t.c:5:38: note:          {depth: 0.000000, total: 0.000000}
t.c:5:38: note:        + {depth: 1.000000, total: 1.000000}
t.c:5:38: note:        + {depth: 0.000000, total: 0.000000}
t.c:5:38: note:        = {depth: 1.000000, total: 1.000000}
t.c:5:38: note:  recording new base alignment for &vals
  alignment:    32
  misalignment: 0
  based on:     _1 = vals[0];
t.c:5:38: note:   === vect_slp_analyze_instance_alignment ===

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute
  2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
@ 2023-08-07 13:30 ` rguenth at gcc dot gnu.org
  2023-09-05  9:01 ` rsandifo at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-07 13:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
      Known to fail|                            |13.2.1, 14.0
                 CC|                            |rsandifo at gcc dot gnu.org
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I didn't find where we make sure to elide the "outgoing" permute of a
reduction, but I think we only have testcases for the loop vectorization case. 
Can you suggest where we'd do this?  Note we do not represent the plus
reduction
operation but the whole SLP instance has just a single node (with load
permutation)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute
  2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
  2023-08-07 13:30 ` [Bug tree-optimization/110935] " rguenth at gcc dot gnu.org
@ 2023-09-05  9:01 ` rsandifo at gcc dot gnu.org
  2023-09-12  7:43 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2023-09-05  9:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
If we were going to do this in vect_optimize_slp_pass, I think
we'd need a node for the reduction in the pass's internal graph.
We could then record that all input layouts have zero cost.

What's the reason for not having an SLP node for the reduction?
Isn't it a similar kind of sink to a store or constructor?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute
  2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
  2023-08-07 13:30 ` [Bug tree-optimization/110935] " rguenth at gcc dot gnu.org
  2023-09-05  9:01 ` rsandifo at gcc dot gnu.org
@ 2023-09-12  7:43 ` rguenther at suse dot de
  2024-01-21  2:00 ` pinskia at gcc dot gnu.org
  2024-04-15 13:33 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenther at suse dot de @ 2023-09-12  7:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 5 Sep 2023, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935
> 
> --- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
> If we were going to do this in vect_optimize_slp_pass, I think
> we'd need a node for the reduction in the pass's internal graph.
> We could then record that all input layouts have zero cost.
> 
> What's the reason for not having an SLP node for the reduction?
> Isn't it a similar kind of sink to a store or constructor?

The difference is that the reduction reduces the number of incoming
lanes (to one).  For a loop SLP reduction chain we also do not have a SLP
node for that part (because it's in the epilog).  For a loop SLP
reduction there isn't a reduction operation.  For both cases we manage
to elide permutes into them - I wondered how we do that in the new code
and if we can leverage that for the BB reduction case.

I did think of representing the reduction op but wondered how to do
that in the most sensible way.  It's kind-of a permute node with
an associated operation.  Or, if we use .REDUC_*_SCAL, a regular
node with a scalar vectype?  I'm not sure we want to overload
the VEC_PERM_EXPR SLP node further.  But for example with x86
we have a SAD operation with 4 incoming lanes in op0, 16 incoming
lanes in op1 and 4 outgoing lanes.

That said, currently the reduction node is implicit in the
instance root stmt and can be identified by the SLP instance kind only.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute
  2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-09-12  7:43 ` rguenther at suse dot de
@ 2024-01-21  2:00 ` pinskia at gcc dot gnu.org
  2024-04-15 13:33 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-21  2:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-01-21
     Ever confirmed|0                           |1
                 CC|                            |pinskia at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.  I noticed the PERM issue when I was working on adding V4HI support
to the aarch64 backend.  I had to add PERM support for V4HI but it was not
obvious at the time why. This explains it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/110935] Missed BB reduction vectorization because of missed eliding of a permute
  2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-01-21  2:00 ` pinskia at gcc dot gnu.org
@ 2024-04-15 13:33 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-15 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110935

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So ideally we could special-case the "output" of the SLP instance root.  It
might be possible to insert the node just into the digraph.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-15 13:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-07 13:28 [Bug tree-optimization/110935] New: Missed BB reduction vectorization because of missed eliding of a permute rguenth at gcc dot gnu.org
2023-08-07 13:30 ` [Bug tree-optimization/110935] " rguenth at gcc dot gnu.org
2023-09-05  9:01 ` rsandifo at gcc dot gnu.org
2023-09-12  7:43 ` rguenther at suse dot de
2024-01-21  2:00 ` pinskia at gcc dot gnu.org
2024-04-15 13:33 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).