public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
@ 2021-08-10 9:41 tnfchris at gcc dot gnu.org
2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10 9:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
Bug ID: 101842
Summary: Vectorizer doesn't vectorize when loop bound depends
on two independent variables that are unknown
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following example
float f(float *p, float d, int len, float lim)
{
float m[4];
for (int i = 0; i < len && d >= lim; i += 4)
{
m[0] = p[0] * p[0];
m[1] = p[1] * p[1];
m[2] = p[2] * p[2];
m[3] = p[3] * p[3];
d = d - m[0];
d = d - m[1];
d = d - m[2];
d = d - m[3];
p += 4;
}
return d;
}
isn't vectorized at -Ofast because
```
missed: not vectorized: number of iterations cannot be computed.
```
which seems odd because I would expect that it would be treated as just any
other loop with unbounded iterations. Commenting out this check results in it
bailing out because of it not knowing how to deal with the reduction.
This loop should be easy to vectorize with vectorizing the multiplications of m
and then reducing the changes of `d - sum (m[0..3])`.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
@ 2021-08-10 10:06 ` rguenth at gcc dot gnu.org
2021-08-10 10:22 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Blocks| |53947
Last reconfirmed| |2021-08-10
Ever confirmed|0 |1
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that there's no symbolic expression to compute the number of
iterations since the number of iterations depends on data computed inside the
loop. We require a symbolic number of iterations in various places since
we're using a canonical IV for loop control. There's also the dynamic cost
check which depends on the number of vector iterations - I suppose for this
kind of loop we'd have to statically assert the vectorization is always
profitable.
But confirmed, we can't vectorize this loop. But we should vectorize the
basic-block eventually. We currently don't because the reduction handling
has the mixed +- case not implemented yet and we see
_41 = powmult_3 + powmult_5;
_42 = powmult_7 + _41;
_43 = powmult_9 + _42;
d_25 = d_35 - _43;
we detect this as reduction of 5 lanes and fail to see the opportunity to
reduce the 4 lanes with PLUS and then do the final minus with the remaining
(unvectorized) scalar.
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f9ca24415a2..33b21c8c247 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5666,10 +5666,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
{
if (chain[i].dt != vect_internal_def)
invalid_cst = true;
- else if (chain[i].code != code)
- invalid_op = true;
else
- valid_lanes++;
+ {
+ valid_lanes++;
+ if (chain[i].code != code)
+ invalid_op = true;
+ }
}
if (!invalid_op && !invalid_cst)
{
then properly prints:
t.c:4:27: optimized: BB reduction missed with 5 lanes
The one different op lane could be handled similar as to the yet unsupported
constant - we need to record this operand and apply the it to the reduction
int the epilogue.
Let me try sth.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
@ 2021-08-10 10:22 ` rguenth at gcc dot gnu.org
2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so with a hack like the following we vectorize the BB as
vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
_60 = .REDUC_PLUS (vect_powmult_9.11_61);
d_25 = d_35 - _60;
p_26 = p_34 + 16;
i_27 = i_37 + 4;
_10 = len_20(D) > i_27;
_11 = lim_21(D) <= d_25;
_12 = _10 & _11;
if (_12 != 0)
and on x86_64 we get
.L3:
movups (%rdi), %xmm2
addl $4, %eax
addq $16, %rdi
mulps %xmm2, %xmm2
movaps %xmm2, %xmm3
movhlps %xmm2, %xmm3
addps %xmm2, %xmm3
movaps %xmm3, %xmm2
shufps $85, %xmm3, %xmm2
addps %xmm3, %xmm2
subss %xmm2, %xmm0
cmpl %eax, %esi
jle .L2
comiss %xmm1, %xmm0
jnb .L3
.L2:
ret
or with AVX
.L3:
vmovups (%rdi), %xmm4
addl $4, %eax
addq $16, %rdi
vmulps %xmm4, %xmm4, %xmm2
vmovhlps %xmm2, %xmm2, %xmm3
vaddps %xmm2, %xmm3, %xmm3
vshufps $85, %xmm3, %xmm3, %xmm2
vaddps %xmm3, %xmm2, %xmm2
vsubss %xmm2, %xmm0, %xmm0
cmpl %eax, %esi
jle .L2
vcomiss %xmm1, %xmm0
jnb .L3
.L2:
ret
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f9ca24415a2..0e14c164635 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5637,6 +5637,11 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
|| (gimple_assign_rhs_code (use_stmt)
!= (code == PLUS_EXPR ? MINUS_EXPR :
PLUS_EXPR))))))
{
+ gassign *next_stmt = assign;
+ while (next_stmt)
+ {
+ assign = next_stmt;
+ next_stmt = NULL;
/* We start the match at the end of a possible association
chain. */
auto_vec<chain_op_t> chain;
@@ -5666,10 +5671,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
{
if (chain[i].dt != vect_internal_def)
invalid_cst = true;
- else if (chain[i].code != code)
- invalid_op = true;
else
- valid_lanes++;
+ {
+ valid_lanes++;
+ if (chain[i].code != code)
+ invalid_op = true;
+ }
}
if (!invalid_op && !invalid_cst)
{
@@ -5707,8 +5714,13 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo)
statistics_counter_event (cfun, "BB reduction missed
(cst)", 1);
statistics_histogram_event (cfun, "BB reduction missed
lanes",
valid_lanes);
+
+ /* Try again. */
+ if (valid_lanes > 2)
+ next_stmt = as_a <gassign *> (chain_stmts[1]);
}
}
+ }
}
}
}
the hack simply re-starts reduction discovery at the "previous" stmt
(this breaks down after skipping the first stmt eventually). As said,
it's a hack. But is that the kind of vectorization you expect?
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
2021-08-10 10:22 ` rguenth at gcc dot gnu.org
@ 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
2021-08-10 10:54 ` rguenth at gcc dot gnu.org
2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10 10:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> OK, so with a hack like the following we vectorize the BB as
>
> vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
> vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
> _60 = .REDUC_PLUS (vect_powmult_9.11_61);
> d_25 = d_35 - _60;
> p_26 = p_34 + 16;
> i_27 = i_37 + 4;
> _10 = len_20(D) > i_27;
> _11 = lim_21(D) <= d_25;
> _12 = _10 & _11;
> if (_12 != 0)
>
Ah awesome!
>
> the hack simply re-starts reduction discovery at the "previous" stmt
> (this breaks down after skipping the first stmt eventually). As said,
> it's a hack. But is that the kind of vectorization you expect?
Yeah that looks perfect, the patch seems to be based on a different code than
upstream so couldn't apply it to test the full loop, but this looks perfect!
(We already vectorize a similar loop without the `&& d >= lim` condition).
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
` (2 preceding siblings ...)
2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
@ 2021-08-10 10:54 ` rguenth at gcc dot gnu.org
2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-10 10:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #3)
> (In reply to Richard Biener from comment #2)
> > OK, so with a hack like the following we vectorize the BB as
> >
> > vect__1.10_62 = MEM <vector(4) float> [(float *)p_34];
> > vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62;
> > _60 = .REDUC_PLUS (vect_powmult_9.11_61);
> > d_25 = d_35 - _60;
> > p_26 = p_34 + 16;
> > i_27 = i_37 + 4;
> > _10 = len_20(D) > i_27;
> > _11 = lim_21(D) <= d_25;
> > _12 = _10 & _11;
> > if (_12 != 0)
> >
>
> Ah awesome!
>
> >
> > the hack simply re-starts reduction discovery at the "previous" stmt
> > (this breaks down after skipping the first stmt eventually). As said,
> > it's a hack. But is that the kind of vectorization you expect?
>
> Yeah that looks perfect, the patch seems to be based on a different code
> than upstream so couldn't apply it to test the full loop, but this looks
> perfect! (We already vectorize a similar loop without the `&& d >= lim`
> condition).
It's applied to my working tree so that's possible. Note it doesn't
vectorize the loop but the loop body in basic-block vectorization.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
` (3 preceding siblings ...)
2021-08-10 10:54 ` rguenth at gcc dot gnu.org
@ 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-10 11:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842
--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> (In reply to Tamar Christina from comment #3)
> > (In reply to Richard Biener from comment #2)
> > > OK, so with a hack like the following we vectorize the BB as
>
> It's applied to my working tree so that's possible. Note it doesn't
> vectorize the loop but the loop body in basic-block vectorization.
I think that's fine, the actual loop also doesn't use any cross iteration
information so vectorizing the BB is good enough.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-08-10 11:16 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org
2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org
2021-08-10 10:22 ` rguenth at gcc dot gnu.org
2021-08-10 10:44 ` tnfchris at gcc dot gnu.org
2021-08-10 10:54 ` rguenth at gcc dot gnu.org
2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).