public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
@ 2021-08-10  9:41 tnfchris at gcc dot gnu.org
  2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10  9:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

            Bug ID: 101842
           Summary: Vectorizer doesn't vectorize when loop bound depends
                    on two independent variables that are unknown
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following example

float f(float *p, float d, int len, float lim)
{
  float m[4];
  for (int i = 0; i < len && d >= lim; i += 4)
  {
    m[0] = p[0] * p[0];
    m[1] = p[1] * p[1];
    m[2] = p[2] * p[2];
    m[3] = p[3] * p[3];
    d = d - m[0];
    d = d - m[1];
    d = d - m[2];
    d = d - m[3];
    p += 4;
  }

  return d;
}

isn't vectorized at -Ofast because

```
missed: not vectorized: number of iterations cannot be computed.
```

which seems odd because I would expect that it would be treated as just any
other loop with unbounded iterations.  Commenting out this check results in it
bailing out because of it not knowing how to deal with the reduction.

This loop should be easy to vectorize with vectorizing the multiplications of m
and then reducing the changes of `d - sum (m[0..3])`.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
  2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
@ 2021-08-10 10:06 ` rguenth at gcc dot gnu.org
  2021-08-10 10:22 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
             Blocks|                            |53947
   Last reconfirmed|                            |2021-08-10
     Ever confirmed|0                           |1
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that there's no symbolic expression to compute the number of
iterations since the number of iterations depends on data computed inside the
loop.  We require a symbolic number of iterations in various places since
we're using a canonical IV for loop control.  There's also the dynamic cost
check which depends on the number of vector iterations - I suppose for this
kind of loop we'd have to statically assert the vectorization is always
profitable.

But confirmed, we can't vectorize this loop.  But we should vectorize the
basic-block eventually.  We currently don't because the reduction handling
has the mixed +- case not implemented yet and we see

  _41 = powmult_3 + powmult_5;
  _42 = powmult_7 + _41;
  _43 = powmult_9 + _42;
  d_25 = d_35 - _43;

we detect this as reduction of 5 lanes and fail to see the opportunity to
reduce the 4 lanes with PLUS and then do the final minus with the remaining
(unvectorized) scalar.

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f9ca24415a2..33b21c8c247 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5666,10 +5666,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
                {
                  if (chain[i].dt != vect_internal_def)
                    invalid_cst = true;
-                 else if (chain[i].code != code)
-                   invalid_op = true;
                  else
-                   valid_lanes++;
+                   {
+                     valid_lanes++;
+                     if (chain[i].code != code)
+                       invalid_op = true;
+                   }
                }
              if (!invalid_op && !invalid_cst)
                {

then properly prints:

t.c:4:27: optimized:  BB reduction missed with 5 lanes

The one different op lane could be handled similar as to the yet unsupported
constant - we need to record this operand and apply the it to the reduction
int the epilogue.

Let me try sth.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
  2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
  2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
@ 2021-08-10 10:22 ` rguenth at gcc dot gnu.org
  2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so with a hack like the following we vectorize the BB as

  vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
  vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
  _60 = .REDUC_PLUS (vect_powmult_9.11_61);
  d_25 = d_35 - _60;
  p_26 = p_34 + 16;
  i_27 = i_37 + 4;
  _10 = len_20(D) > i_27;
  _11 = lim_21(D) <= d_25;
  _12 = _10 & _11;
  if (_12 != 0)

and on x86_64 we get

.L3:
        movups  (%rdi), %xmm2
        addl    $4, %eax
        addq    $16, %rdi
        mulps   %xmm2, %xmm2
        movaps  %xmm2, %xmm3
        movhlps %xmm2, %xmm3
        addps   %xmm2, %xmm3
        movaps  %xmm3, %xmm2
        shufps  $85, %xmm3, %xmm2
        addps   %xmm3, %xmm2
        subss   %xmm2, %xmm0
        cmpl    %eax, %esi
        jle     .L2
        comiss  %xmm1, %xmm0
        jnb     .L3
.L2:
        ret

or with AVX

.L3:
        vmovups (%rdi), %xmm4
        addl    $4, %eax
        addq    $16, %rdi
        vmulps  %xmm4, %xmm4, %xmm2
        vmovhlps        %xmm2, %xmm2, %xmm3
        vaddps  %xmm2, %xmm3, %xmm3
        vshufps $85, %xmm3, %xmm3, %xmm2
        vaddps  %xmm3, %xmm2, %xmm2
        vsubss  %xmm2, %xmm0, %xmm0
        cmpl    %eax, %esi
        jle     .L2
        vcomiss %xmm1, %xmm0
        jnb     .L3
.L2:
        ret


diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f9ca24415a2..0e14c164635 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5637,6 +5637,11 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
                           || (gimple_assign_rhs_code (use_stmt)
                               != (code == PLUS_EXPR ? MINUS_EXPR :
PLUS_EXPR))))))
        {
+         gassign *next_stmt = assign;
+         while (next_stmt)
+           {
+             assign = next_stmt;
+             next_stmt = NULL;
          /* We start the match at the end of a possible association
             chain.  */
          auto_vec<chain_op_t> chain;
@@ -5666,10 +5671,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
                {
                  if (chain[i].dt != vect_internal_def)
                    invalid_cst = true;
-                 else if (chain[i].code != code)
-                   invalid_op = true;
                  else
-                   valid_lanes++;
+                   {
+                     valid_lanes++;
+                     if (chain[i].code != code)
+                       invalid_op = true;
+                   }
                }
              if (!invalid_op && !invalid_cst)
                {
@@ -5707,8 +5714,13 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
                    statistics_counter_event (cfun, "BB reduction missed
(cst)", 1);
                  statistics_histogram_event (cfun, "BB reduction missed
lanes",
                                              valid_lanes);
+
+                 /* Try again.  */
+                 if (valid_lanes > 2)
+                   next_stmt = as_a <gassign *> (chain_stmts[1]);
                }
            }
+           }
        }
     }
 }


the hack simply re-starts reduction discovery at the "previous" stmt
(this breaks down after skipping the first stmt eventually).  As said,
it's a hack.  But is that the kind of vectorization you expect?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
  2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
  2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
  2021-08-10 10:22 ` rguenth at gcc dot gnu.org
@ 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
  2021-08-10 10:54 ` rguenth at gcc dot gnu.org
  2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> OK, so with a hack like the following we vectorize the BB as
> 
>   vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
>   vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
>   _60 = .REDUC_PLUS (vect_powmult_9.11_61);
>   d_25 = d_35 - _60;
>   p_26 = p_34 + 16;
>   i_27 = i_37 + 4;
>   _10 = len_20(D) > i_27;
>   _11 = lim_21(D) <= d_25;
>   _12 = _10 & _11;
>   if (_12 != 0)
> 

Ah awesome!

> 
> the hack simply re-starts reduction discovery at the "previous" stmt
> (this breaks down after skipping the first stmt eventually).  As said,
> it's a hack.  But is that the kind of vectorization you expect?

Yeah that looks perfect, the patch seems to be based on a different code than
upstream so couldn't apply it to test the full loop, but this looks perfect!
(We already vectorize a similar loop without the `&& d >= lim` condition).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
  2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
@ 2021-08-10 10:54 ` rguenth at gcc dot gnu.org
  2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #3)
> (In reply to Richard Biener from comment #2)
> > OK, so with a hack like the following we vectorize the BB as
> > 
> >   vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
> >   vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
> >   _60 = .REDUC_PLUS (vect_powmult_9.11_61);
> >   d_25 = d_35 - _60;
> >   p_26 = p_34 + 16;
> >   i_27 = i_37 + 4;
> >   _10 = len_20(D) > i_27;
> >   _11 = lim_21(D) <= d_25;
> >   _12 = _10 & _11;
> >   if (_12 != 0)
> > 
> 
> Ah awesome!
> 
> > 
> > the hack simply re-starts reduction discovery at the "previous" stmt
> > (this breaks down after skipping the first stmt eventually).  As said,
> > it's a hack.  But is that the kind of vectorization you expect?
> 
> Yeah that looks perfect, the patch seems to be based on a different code
> than upstream so couldn't apply it to test the full loop, but this looks
> perfect! (We already vectorize a similar loop without the `&& d >= lim`
> condition).

It's applied to my working tree so that's possible.  Note it doesn't
vectorize the loop but the loop body in basic-block vectorization.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
  2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-08-10 10:54 ` rguenth at gcc dot gnu.org
@ 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842

--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> (In reply to Tamar Christina from comment #3)
> > (In reply to Richard Biener from comment #2)
> > > OK, so with a hack like the following we vectorize the BB as
> 
> It's applied to my working tree so that's possible.  Note it doesn't
> vectorize the loop but the loop body in basic-block vectorization.

I think that's fine, the actual loop also doesn't use any cross iteration
information so vectorizing the BB is good enough.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-10 11:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
2021-08-10 10:22 ` rguenth at gcc dot gnu.org
2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
2021-08-10 10:54 ` rguenth at gcc dot gnu.org
2021-08-10 11:16 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).