[Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction
@ 2024-03-01 16:44 acoplan at gcc dot gnu.org
  2024-03-01 17:00 ` [Bug tree-optimization/114192] " tnfchris at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-03-01 16:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

            Bug ID: 114192
           Summary: scalar code left around following early break
                    vectorization of reduction
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following testcase:

int a[1024];
int f4(int *x, int n)
{
    int sum = 0;
    for (int i = 0; i < n; i++)
    {
        sum += a[i];
        if (a[i] == 42)
            break;
    }
    return sum;
}

at -O3 on aarch64 we vectorize it and get the following vector loop:

.L4:
        cmp     x7, x2
        beq     .L23
.L6:
        ubfiz   x3, x2, 4, 32
        ldr     w6, [x4, x2, lsl 2]    // scalar load
        mov     v27.16b, v30.16b
        mov     w0, w5
        add     v30.4s, v30.4s, v25.4s
        add     w5, w5, w6             // scalar add
        ldr     q29, [x4, x3]
        add     x2, x2, 1
        cmeq    v31.4s, v29.4s, v26.4s
        add     v28.4s, v28.4s, v29.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbz     x3, .L4

but here the old scalar code has been left around.  If we remove the early exit
from the loop, then although we still leave the scalar code around in the
vectorizer, it gets optimized away immediately by the following DCE pass.

Without the early exit, in the vectorizer dump we have:

  <bb 3> [local count: 860067200]:
  # sum_10 = PHI <sum_6(6), 0(9)>
  # i_12 = PHI <i_7(6), 0(9)>
  # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)>
  # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)>
  # ivtmp_32 = PHI <ivtmp_33(6), 0(9)>
  vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26];
  _1 = a[i_12]; // scalar load
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  sum_6 = _1 + sum_10;
  i_7 = i_12 + 1;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
    goto <bb 6>; [89.00%]
  else
    goto <bb 11>; [11.00%]

i.e. the scalar load is left around, but it seems to get cleaned up by the
(immediately following) dce pass:

  <bb 3> [local count: 860067200]:
  # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)>
  # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)>
  # ivtmp_32 = PHI <ivtmp_33(6), 0(9)>
  vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26];
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
    goto <bb 6>; [89.00%]
  else
    goto <bb 11>; [11.00%]

perhaps the dce needs improving to clean up the dead scalar code in the early
exit case, too.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
@ 2024-03-01 17:00 ` tnfchris at gcc dot gnu.org
  2024-03-04  8:29 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-03-01 17:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-03-01

--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Confirmed.

It looks like DCE6 no longer thinks:

  # sum_10 = PHI <sum_7(7), 0(11)>

  _1 = aD.4432[i_12];
  sum_7 = _1 + sum_11;

is dead after vectorization.

it removes the only dead consumer of sum_7,
a PHI node left over in the guard block which becomes unused after the
reduction is vectorized.

DCE says:

marking necessary through sum_11 stmt sum_11 = PHI <sum_7(7), 0(11)>
processing: sum_11 = PHI <sum_7(7), 0(11)>

marking necessary through sum_7 stmt sum_7 = _1 + sum_11;
processing: sum_7 = _1 + sum_11;

marking necessary through _1 stmt _1 = a[i_12];
processing: _1 = a[i_12];

so it thinks the closed definition is needed?

This seems to only happen with reductions, other live operations look fine:

extern int a[1024];
int f4(int *x, int n)
{
    int sum = 0;
    for (int i = 0; i < n; i++)
    {
        sum = a[i];
        if (a[i] == 42)
            break;
    }
    return sum;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
  2024-03-01 17:00 ` [Bug tree-optimization/114192] " tnfchris at gcc dot gnu.org
@ 2024-03-04  8:29 ` rguenth at gcc dot gnu.org
  2024-03-04  9:00 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-04  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is the scalar reduction value is live from the main loop to the
epilog.  We don't seem to use the vector .REDUC_PLUS value on both paths.
Likely failure of reduction epilog generation.  Let me have a look.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
  2024-03-01 17:00 ` [Bug tree-optimization/114192] " tnfchris at gcc dot gnu.org
  2024-03-04  8:29 ` rguenth at gcc dot gnu.org
@ 2024-03-04  9:00 ` rguenth at gcc dot gnu.org
  2024-03-04  9:02 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-04  9:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 57600
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57600&action=edit
patch

Ah, so the issue is that we only replace the LHS, in this case we pass
the reduction stmt for the early exit but the live value is defined by
the PHI (we re-start the iteration).  That confuses the replacement process.
It looks like it might also be wrong for the peeled case on the main edge?

The following fixes it for me.  Didn't check what happens for the peeled
case with a reduction.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-03-04  9:00 ` rguenth at gcc dot gnu.org
@ 2024-03-04  9:02 ` rguenth at gcc dot gnu.org
  2024-03-04 10:45 ` cvs-commit at gcc dot gnu.org
  2024-03-04 10:45 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-04  9:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
vect-early-break_104-pr113373.c might be such case which ICEs then.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-03-04  9:02 ` rguenth at gcc dot gnu.org
@ 2024-03-04 10:45 ` cvs-commit at gcc dot gnu.org
  2024-03-04 10:45 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-03-04 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:324d2907c86f05e40dc52d226940308f53a956c2

commit r14-9292-g324d2907c86f05e40dc52d226940308f53a956c2
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Mar 4 09:46:13 2024 +0100

    tree-optimization/114192 - scalar reduction kept live with early break vect

    The following fixes a missing replacement of the reduction value
    used in the epilog, causing the scalar reduction to be kept live
    across the early break exit path.

            PR tree-optimization/114192
            * tree-vect-loop.cc (vect_create_epilog_for_reduction): Use the
            appropriate def for the live out stmt in case of an alternate
            exit.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
  2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-03-04 10:45 ` cvs-commit at gcc dot gnu.org
@ 2024-03-04 10:45 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-04 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-04 10:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-01 16:44 [Bug tree-optimization/114192] New: scalar code left around following early break vectorization of reduction acoplan at gcc dot gnu.org
2024-03-01 17:00 ` [Bug tree-optimization/114192] " tnfchris at gcc dot gnu.org
2024-03-04  8:29 ` rguenth at gcc dot gnu.org
2024-03-04  9:00 ` rguenth at gcc dot gnu.org
2024-03-04  9:02 ` rguenth at gcc dot gnu.org
2024-03-04 10:45 ` cvs-commit at gcc dot gnu.org
2024-03-04 10:45 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).