public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown @ 2021-08-10 9:41 tnfchris at gcc dot gnu.org 2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org ` (4 more replies) 0 siblings, 5 replies; 6+ messages in thread From: tnfchris at gcc dot gnu.org @ 2021-08-10 9:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 Bug ID: 101842 Summary: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following example float f(float *p, float d, int len, float lim) { float m[4]; for (int i = 0; i < len && d >= lim; i += 4) { m[0] = p[0] * p[0]; m[1] = p[1] * p[1]; m[2] = p[2] * p[2]; m[3] = p[3] * p[3]; d = d - m[0]; d = d - m[1]; d = d - m[2]; d = d - m[3]; p += 4; } return d; } isn't vectorized at -Ofast because ``` missed: not vectorized: number of iterations cannot be computed. ``` which seems odd because I would expect that it would be treated as just any other loop with unbounded iterations. Commenting out this check results in it bailing out because of it not knowing how to deal with the reduction. This loop should be easy to vectorize with vectorizing the multiplications of m and then reducing the changes of `d - sum (m[0..3])`. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org @ 2021-08-10 10:06 ` rguenth at gcc dot gnu.org 2021-08-10 10:22 ` rguenth at gcc dot gnu.org ` (3 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: rguenth at gcc dot gnu.org @ 2021-08-10 10:06 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Blocks| |53947 Last reconfirmed| |2021-08-10 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is that there's no symbolic expression to compute the number of iterations since the number of iterations depends on data computed inside the loop. We require a symbolic number of iterations in various places since we're using a canonical IV for loop control. There's also the dynamic cost check which depends on the number of vector iterations - I suppose for this kind of loop we'd have to statically assert the vectorization is always profitable. But confirmed, we can't vectorize this loop. But we should vectorize the basic-block eventually. We currently don't because the reduction handling has the mixed +- case not implemented yet and we see _41 = powmult_3 + powmult_5; _42 = powmult_7 + _41; _43 = powmult_9 + _42; d_25 = d_35 - _43; we detect this as reduction of 5 lanes and fail to see the opportunity to reduce the 4 lanes with PLUS and then do the final minus with the remaining (unvectorized) scalar. diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index f9ca24415a2..33b21c8c247 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -5666,10 +5666,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) { if (chain[i].dt != vect_internal_def) invalid_cst = true; - else if (chain[i].code != code) - invalid_op = true; else - valid_lanes++; + { + valid_lanes++; + if (chain[i].code != code) + invalid_op = true; + } } if (!invalid_op && !invalid_cst) { then properly prints: t.c:4:27: optimized: BB reduction missed with 5 lanes The one different op lane could be handled similar as to the yet unsupported constant - we need to record this operand and apply the it to the reduction int the epilogue. Let me try sth. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org 2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org @ 2021-08-10 10:22 ` rguenth at gcc dot gnu.org 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org ` (2 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: rguenth at gcc dot gnu.org @ 2021-08-10 10:22 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- OK, so with a hack like the following we vectorize the BB as vect__1.10_62 = MEM <vector(4) float> [(float *)p_34]; vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62; _60 = .REDUC_PLUS (vect_powmult_9.11_61); d_25 = d_35 - _60; p_26 = p_34 + 16; i_27 = i_37 + 4; _10 = len_20(D) > i_27; _11 = lim_21(D) <= d_25; _12 = _10 & _11; if (_12 != 0) and on x86_64 we get .L3: movups (%rdi), %xmm2 addl $4, %eax addq $16, %rdi mulps %xmm2, %xmm2 movaps %xmm2, %xmm3 movhlps %xmm2, %xmm3 addps %xmm2, %xmm3 movaps %xmm3, %xmm2 shufps $85, %xmm3, %xmm2 addps %xmm3, %xmm2 subss %xmm2, %xmm0 cmpl %eax, %esi jle .L2 comiss %xmm1, %xmm0 jnb .L3 .L2: ret or with AVX .L3: vmovups (%rdi), %xmm4 addl $4, %eax addq $16, %rdi vmulps %xmm4, %xmm4, %xmm2 vmovhlps %xmm2, %xmm2, %xmm3 vaddps %xmm2, %xmm3, %xmm3 vshufps $85, %xmm3, %xmm3, %xmm2 vaddps %xmm3, %xmm2, %xmm2 vsubss %xmm2, %xmm0, %xmm0 cmpl %eax, %esi jle .L2 vcomiss %xmm1, %xmm0 jnb .L3 .L2: ret diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index f9ca24415a2..0e14c164635 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -5637,6 +5637,11 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) || (gimple_assign_rhs_code (use_stmt) != (code == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR)))))) { + gassign *next_stmt = assign; + while (next_stmt) + { + assign = next_stmt; + next_stmt = NULL; /* We start the match at the end of a possible association chain. */ auto_vec<chain_op_t> chain; @@ -5666,10 +5671,12 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) { if (chain[i].dt != vect_internal_def) invalid_cst = true; - else if (chain[i].code != code) - invalid_op = true; else - valid_lanes++; + { + valid_lanes++; + if (chain[i].code != code) + invalid_op = true; + } } if (!invalid_op && !invalid_cst) { @@ -5707,8 +5714,13 @@ vect_slp_check_for_constructors (bb_vec_info bb_vinfo) statistics_counter_event (cfun, "BB reduction missed (cst)", 1); statistics_histogram_event (cfun, "BB reduction missed lanes", valid_lanes); + + /* Try again. */ + if (valid_lanes > 2) + next_stmt = as_a <gassign *> (chain_stmts[1]); } } + } } } } the hack simply re-starts reduction discovery at the "previous" stmt (this breaks down after skipping the first stmt eventually). As said, it's a hack. But is that the kind of vectorization you expect? ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org 2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org 2021-08-10 10:22 ` rguenth at gcc dot gnu.org @ 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org 2021-08-10 10:54 ` rguenth at gcc dot gnu.org 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: tnfchris at gcc dot gnu.org @ 2021-08-10 10:44 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 --- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #2) > OK, so with a hack like the following we vectorize the BB as > > vect__1.10_62 = MEM <vector(4) float> [(float *)p_34]; > vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62; > _60 = .REDUC_PLUS (vect_powmult_9.11_61); > d_25 = d_35 - _60; > p_26 = p_34 + 16; > i_27 = i_37 + 4; > _10 = len_20(D) > i_27; > _11 = lim_21(D) <= d_25; > _12 = _10 & _11; > if (_12 != 0) > Ah awesome! > > the hack simply re-starts reduction discovery at the "previous" stmt > (this breaks down after skipping the first stmt eventually). As said, > it's a hack. But is that the kind of vectorization you expect? Yeah that looks perfect, the patch seems to be based on a different code than upstream so couldn't apply it to test the full loop, but this looks perfect! (We already vectorize a similar loop without the `&& d >= lim` condition). ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org ` (2 preceding siblings ...) 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org @ 2021-08-10 10:54 ` rguenth at gcc dot gnu.org 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: rguenth at gcc dot gnu.org @ 2021-08-10 10:54 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #3) > (In reply to Richard Biener from comment #2) > > OK, so with a hack like the following we vectorize the BB as > > > > vect__1.10_62 = MEM <vector(4) float> [(float *)p_34]; > > vect_powmult_9.11_61 = vect__1.10_62 * vect__1.10_62; > > _60 = .REDUC_PLUS (vect_powmult_9.11_61); > > d_25 = d_35 - _60; > > p_26 = p_34 + 16; > > i_27 = i_37 + 4; > > _10 = len_20(D) > i_27; > > _11 = lim_21(D) <= d_25; > > _12 = _10 & _11; > > if (_12 != 0) > > > > Ah awesome! > > > > > the hack simply re-starts reduction discovery at the "previous" stmt > > (this breaks down after skipping the first stmt eventually). As said, > > it's a hack. But is that the kind of vectorization you expect? > > Yeah that looks perfect, the patch seems to be based on a different code > than upstream so couldn't apply it to test the full loop, but this looks > perfect! (We already vectorize a similar loop without the `&& d >= lim` > condition). It's applied to my working tree so that's possible. Note it doesn't vectorize the loop but the loop body in basic-block vectorization. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug tree-optimization/101842] Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org ` (3 preceding siblings ...) 2021-08-10 10:54 ` rguenth at gcc dot gnu.org @ 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org 4 siblings, 0 replies; 6+ messages in thread From: tnfchris at gcc dot gnu.org @ 2021-08-10 11:16 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101842 --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #4) > (In reply to Tamar Christina from comment #3) > > (In reply to Richard Biener from comment #2) > > > OK, so with a hack like the following we vectorize the BB as > > It's applied to my working tree so that's possible. Note it doesn't > vectorize the loop but the loop body in basic-block vectorization. I think that's fine, the actual loop also doesn't use any cross iteration information so vectorizing the BB is good enough. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-08-10 11:16 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-10 9:41 [Bug tree-optimization/101842] New: Vectorizer doesn't vectorize when loop bound depends on two independent variables that are unknown tnfchris at gcc dot gnu.org 2021-08-10 10:06 ` [Bug tree-optimization/101842] " rguenth at gcc dot gnu.org 2021-08-10 10:22 ` rguenth at gcc dot gnu.org 2021-08-10 10:44 ` tnfchris at gcc dot gnu.org 2021-08-10 10:54 ` rguenth at gcc dot gnu.org 2021-08-10 11:16 ` tnfchris at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).