[Bug tree-optimization/49955] New: Fails to do partial basic-block SLP

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP
@ 2011-08-03  9:08 rguenth at gcc dot gnu.org
  2011-08-03 15:17 ` [Bug tree-optimization/49955] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-08-03  9:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

           Summary: Fails to do partial basic-block SLP
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: rguenth@gcc.gnu.org
                CC: irar@gcc.gnu.org

410.bwaves in shell_lam.f has a lot of arrays with inner dimension 5 operated
on in loops that are either unrolled by early unrolling or manually unrolled
in source.  All but one loop in shell_lam.f are not vectorized.

One reason is that basic-block vectorization gives up if it sees
interleaving size that is not a multiple of a supported vectorization
factor.  Testcase:

double a[1024], b[1024];

void foo (int k)
{
  int j;
  a[k*5 + 0] = a[k*5 + 0] + b[k*5 + 0];
  a[k*5 + 1] = a[k*5 + 1] + b[k*5 + 1];
  a[k*5 + 2] = a[k*5 + 2] + b[k*5 + 2];
  a[k*5 + 3] = a[k*5 + 3] + b[k*5 + 3];
  a[k*5 + 4] = a[k*5 + 4] + b[k*5 + 4];
}

taken from the last loop in shell_lam.f which has its innermost loop unrolled
(and loop SLP refuses to vectorize as well, see separate bug).

For the above we get:

t.c:6: note: === vect_analyze_data_ref_accesses ===
t.c:6: note: Detected interleaving of size 5
t.c:6: note: Detected interleaving of size 5
t.c:6: note: Detected interleaving of size 5
t.c:6: note: Vectorizing an unaligned access.
t.c:6: note: Vectorizing an unaligned access.
t.c:6: note: Vectorizing an unaligned access.
t.c:6: note: === vect_analyze_slp ===
t.c:6: note: get vectype with 2 units of type double
t.c:6: note: vectype: vector(2) double
t.c:6: note: Build SLP failed: unrolling required in basic block SLP
t.c:6: note: Failed to SLP the basic block.
t.c:6: note: not vectorized: failed to find SLP opportunities in basic block.

but of course we could simply vectorize with an interleaving size of 4
leaving the excess operations unvectorized (with optimization opportunity
if we can pick a properly sized and aligned set of accesses).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
@ 2011-08-03 15:17 ` rguenth at gcc dot gnu.org
  2011-08-05 10:41 ` irar at il dot ibm.com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-08-03 15:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.08.03 15:12:42
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-08-03 15:12:42 UTC ---
The loop that remains after fixing PR49957 in 410.bwaves is the following,
which loop SLP does not handle (well, I'm not exactly sure) because

t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60];

t.f:18: note: num. args = 4 (not unary/binary/ternary op).
t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60]
t.f:18: note: not ssa-name.
t.f:18: note: use not simple.
t.f:18: note: no array mode for V2DF[5]
t.f:18: note: the size of the group of strided accesses is not a power of 2
t.f:18: note: not vectorized: relevant stmt not supported: t1_62 =
*q_61(D)[D.1645_60];

t.f:18: note: bad operation or unsupported loop bound.
t.f:1: note: vectorized 0 loops in function.

probably the issue that we can't handle this kind of "invariants" in the
SLP group?  Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...)
which is size 4, q(1,..) should be treated as invariant. 


      subroutine shell(nx,ny,nz,q,dt,cfl,dx,dy,dz)
      implicit none
      integer nx,ny,nz,n,i,j,k
      real*8 cfl,dx,dy,dz,dt
      real*8 gm,Re,Pr,cfll,t1,t2,t3,t4,t5,t6,t7,t8,mu

      real*8 q(5,nx,ny,nz)

C       This particular problem is periodic only


      cfll=0.1d0+(n-1.0d0)*cfl/20.0d0
      if (cfll.ge.cfl) cfll=cfl
      t8=0.0d0

      do k=1,nz
         do j=1,ny
            do i=1,nx
               t1=q(1,i,j,k)
               t2=q(2,i,j,k)/t1
               t3=q(3,i,j,k)/t1
               t4=q(4,i,j,k)/t1
               t5=(gm-1.0d0)*(q(5,i,j,k)-0.5d0*t1*(t2*t2+t3*t3+t4*t4))
               t6=dSQRT(gm*t5/t1)
               mu=gm*Pr*(gm*t5/t1)**0.75d0*2.0d0/Re/t1
               t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 +
     1            ((dabs(t3)+t6)/dy+mu/dy**2)**2 +
     2            ((dabs(t4)+t6)/dz+mu/dz**2)**2
               t7=DSQRT(t7)
               t8=max(t8,t7)
            enddo
         enddo
      enddo
      dt=cfll / t8

      return
      end


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
  2011-08-03 15:17 ` [Bug tree-optimization/49955] " rguenth at gcc dot gnu.org
@ 2011-08-05 10:41 ` irar at il dot ibm.com
  2011-08-05 10:51 ` irar at il dot ibm.com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: irar at il dot ibm.com @ 2011-08-05 10:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

Ira Rosen <irar at il dot ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm.com

--- Comment #2 from Ira Rosen <irar at il dot ibm.com> 2011-08-05 10:38:53 UTC ---
(In reply to comment #0)

> but of course we could simply vectorize with an interleaving size of 4
> leaving the excess operations unvectorized (with optimization opportunity
> if we can pick a properly sized and aligned set of accesses).

Right. I even had a patch for this some time ago. I can try to bring it to
life.

Ira


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
  2011-08-03 15:17 ` [Bug tree-optimization/49955] " rguenth at gcc dot gnu.org
  2011-08-05 10:41 ` irar at il dot ibm.com
@ 2011-08-05 10:51 ` irar at il dot ibm.com
  2023-08-04 20:05 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: irar at il dot ibm.com @ 2011-08-05 10:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

--- Comment #3 from Ira Rosen <irar at il dot ibm.com> 2011-08-05 10:50:27 UTC ---
(In reply to comment #1)
> The loop that remains after fixing PR49957 in 410.bwaves is the following,
> which loop SLP does not handle (well, I'm not exactly sure) because
> 
> t.f:18: note: ==> examining statement: t1_62 = *q_61(D)[D.1645_60];
> 
> t.f:18: note: num. args = 4 (not unary/binary/ternary op).
> t.f:18: note: vect_is_simple_use: operand *q_61(D)[D.1645_60]
> t.f:18: note: not ssa-name.
> t.f:18: note: use not simple.
> t.f:18: note: no array mode for V2DF[5]
> t.f:18: note: the size of the group of strided accesses is not a power of 2
> t.f:18: note: not vectorized: relevant stmt not supported: t1_62 =
> *q_61(D)[D.1645_60];
> 
> t.f:18: note: bad operation or unsupported loop bound.
> t.f:1: note: vectorized 0 loops in function.
> 
> probably the issue that we can't handle this kind of "invariants" in the
> SLP group?  Thus, the SLP group should be q(2,..), q(3,...) ... q(5, ...)
> which is size 4, q(1,..) should be treated as invariant. 
> 

This loop is not SLPed because there is no SLP opportunity here besides the
loads. The only isomorphism after that is 
               t2=q(2,i,j,k)/t1
               t3=q(3,i,j,k)/t1
               t4=q(4,i,j,k)/t1

and somewhat here
              t7=((dabs(t2)+t6)/dx+mu/dx**2)**2 +
     1            ((dabs(t3)+t6)/dy+mu/dy**2)**2 +
     2            ((dabs(t4)+t6)/dz+mu/dz**2)**2

but these are groups of 3.

Moreover, the current implementation starts building SLP tree from a group of
strided stores, or a group of reductions, or a reduction chain. None of these
exist here.
But, again, even if we could start from a group of loads, it wouldn't help us
much here anyway.

Ira


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2011-08-05 10:51 ` irar at il dot ibm.com
@ 2023-08-04 20:05 ` pinskia at gcc dot gnu.org
  2023-08-07  9:10 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-08-04 20:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The testcase in comment #0 started to be vectorized in GCC 13 ....

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-08-04 20:05 ` pinskia at gcc dot gnu.org
@ 2023-08-07  9:10 ` rguenth at gcc dot gnu.org
  2023-08-08 12:38 ` cvs-commit at gcc dot gnu.org
  2023-08-08 12:38 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-07  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
The loop in comment#1 isn't vectorized because we do not have interleaving
support for a group size of 5:

t.f:18:17: missed:   the size of the group of accesses is not a power of 2 or
not equal to 3
t.f:18:17: missed:   not falling back to elementwise accesses
t.f:19:72: missed:   not vectorized: relevant stmt not supported: t1_83 =
(*q_82(D))[_21];
t.f:18:17: missed:  bad operation or unsupported loop bound.

we don't try to SLP this because there's just a single lane reduction.  There's
not really a loop vectorization opportunity and as comment#3 says there's at
most a BB reduction opportunity.  We try to analyze that now:

  _58 = powmult_9 + powmult_107;
  t7_108 = _58 + powmult_88;
  t7_109 = __builtin_sqrt (t7_108);
  M.7_110 = MAX_EXPR <t7_109, t8_126>;

and

t.f:28:72: note:   Starting SLP discovery for
t.f:28:72: note:     powmult_88 = _106 * _106;
t.f:28:72: note:     powmult_9 = _101 * _101;
t.f:28:72: note:     powmult_107 = _96 * _96;
t.f:28:72: note:   starting SLP discovery for node 0x50ef8a0
t.f:28:72: note:   Build SLP for powmult_88 = _106 * _106;
t.f:28:72: note:   get vectype for scalar type (group size 3): real(kind=8)
t.f:28:72: note:   vectype: vector(2) real(kind=8)
t.f:28:72: note:   nunits = 2
t.f:28:72: missed:   Build SLP failed: unrolling required in basic block SLP

we do not yet have code to limit a BB reduction vectorization to a subset
of lanes (in this case it's uniform so choosing any power-of-two elements
would work but ideally we'd let SLP discovery figure out the "best"
lane combination to vectorize - there's more missing support for BB
reduction vectorization).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-08-07  9:10 ` rguenth at gcc dot gnu.org
@ 2023-08-08 12:38 ` cvs-commit at gcc dot gnu.org
  2023-08-08 12:38 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-08-08 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d9f3ea61fe36e2de3354b90b65ff8245099114c9

commit r14-3078-gd9f3ea61fe36e2de3354b90b65ff8245099114c9
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Aug 7 14:44:20 2023 +0200

    tree-optimization/49955 - BB reduction with odd number of lanes

    The following enhances BB reduction vectorization to support
    vectorizing only a subset of the lanes, keeping the rest as
    scalar ops.  For now we try to make the number of lanes even
    by leaving alone the "last" lane.  That's because SLP discovery
    with all lanes will fail too soon to get us any hint on which
    lane to strip and likewise we don't know what vector modes the
    target supports so restricting ourselves to power-of-two or
    other cases isn't easy.

    This is enough to get at the vectorization opportunity for the
    testcase in the PR - albeit with the chosen lanes not optimal
    but at least vectorizable.

            PR tree-optimization/49955
            * tree-vectorizer.h (_slp_instance::remain_stmts): New.
            (SLP_INSTANCE_REMAIN_STMTS): Likewise.
            * tree-vect-slp.cc (vect_free_slp_instance): Release
            SLP_INSTANCE_REMAIN_STMTS.
            (vect_build_slp_instance): Make the number of lanes of
            a BB reduction even.
            (vectorize_slp_instance_root_stmt): Handle unvectorized
            defs of a BB reduction.

            * gfortran.dg/vect/pr49955.f: New testcase.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/49955] Fails to do partial basic-block SLP
  2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-08-08 12:38 ` cvs-commit at gcc dot gnu.org
@ 2023-08-08 12:38 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-08 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49955

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED
   Target Milestone|---                         |14.0

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is fixed now.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-08 12:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-03  9:08 [Bug tree-optimization/49955] New: Fails to do partial basic-block SLP rguenth at gcc dot gnu.org
2011-08-03 15:17 ` [Bug tree-optimization/49955] " rguenth at gcc dot gnu.org
2011-08-05 10:41 ` irar at il dot ibm.com
2011-08-05 10:51 ` irar at il dot ibm.com
2023-08-04 20:05 ` pinskia at gcc dot gnu.org
2023-08-07  9:10 ` rguenth at gcc dot gnu.org
2023-08-08 12:38 ` cvs-commit at gcc dot gnu.org
2023-08-08 12:38 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).