* [Bug c/109088] GCC fail auto-vectorization
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
@ 2023-03-10 10:39 ` ubizjak at gmail dot com
2023-03-10 12:39 ` [Bug tree-optimization/109088] " rguenth at gcc dot gnu.org
` (20 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: ubizjak at gmail dot com @ 2023-03-10 10:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> ---
Please read https://gcc.gnu.org/bugs/ on how to correctly report a bug.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC fail auto-vectorization
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
2023-03-10 10:39 ` [Bug c/109088] " ubizjak at gmail dot com
@ 2023-03-10 12:39 ` rguenth at gcc dot gnu.org
2023-03-10 13:16 ` juzhe.zhong at rivai dot ai
` (19 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-03-10 12:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Component|c |tree-optimization
Blocks| |53947
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think we can do conditional reduction but maybe not for variable-length
vectors? Otherwise I'm sure there's a duplicate bug about this.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC fail auto-vectorization
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
2023-03-10 10:39 ` [Bug c/109088] " ubizjak at gmail dot com
2023-03-10 12:39 ` [Bug tree-optimization/109088] " rguenth at gcc dot gnu.org
@ 2023-03-10 13:16 ` juzhe.zhong at rivai dot ai
2023-03-10 14:04 ` pinskia at gcc dot gnu.org
` (18 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-03-10 13:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #2)
> I think we can do conditional reduction but maybe not for variable-length
> vectors? Otherwise I'm sure there's a duplicate bug about this.
No, GCC still can not vectorize when I specify the fixed-length vectors
(-msve-vector-bits=512).
If it's a known issue, do you have any suggestion? I am willing to fix it when
I finished the RVV auto-vectorization.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC fail auto-vectorization
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (2 preceding siblings ...)
2023-03-10 13:16 ` juzhe.zhong at rivai dot ai
@ 2023-03-10 14:04 ` pinskia at gcc dot gnu.org
2023-03-10 14:09 ` [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction pinskia at gcc dot gnu.org
` (17 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-10 14:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 54635
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54635&action=edit
testcase
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (3 preceding siblings ...)
2023-03-10 14:04 ` pinskia at gcc dot gnu.org
@ 2023-03-10 14:09 ` pinskia at gcc dot gnu.org
2023-09-26 12:14 ` juzhe.zhong at rivai dot ai
` (16 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-03-10 14:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-03-10
Severity|normal |enhancement
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Summary|GCC fail auto-vectorization |GCC does not always
| |vectorize conditional
| |reduction
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC does vectorize:
if (a[i] > b[i]) {
result += a[i];
}
Even for variable-length vectors.
Just we don't vectorize when there is an extra conditional operation.
GCC will even vectorize with variable-length vectors:
#include <stdint.h>
uint64_t single_loop_with_if_condition(
uint64_t * restrict a,
uint64_t * restrict b,
uint64_t * restrict c,
int loop_size) {
uint64_t result = 0;
for (int i = 0; i < loop_size; i++) {
c[i] = a[i] + 1;
if (a[i] > b[i]) {
result += c[i];
}
}
return result;
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (4 preceding siblings ...)
2023-03-10 14:09 ` [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction pinskia at gcc dot gnu.org
@ 2023-09-26 12:14 ` juzhe.zhong at rivai dot ai
2023-09-27 2:45 ` juzhe.zhong at rivai dot ai
` (15 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-26 12:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
After investigations:
GCC failed to vectorize reduction with multiple conditional operations:
ifcvt dump:
# result_20 = PHI <result_9(8), 0(18)>
...
_11 = result_20 + 10;
result_17 = _4 + _11;
_23 = _4 > _7;
result_9 = _23 ? result_17 : result_20;
It's odd that GCC failed to vectorize it since they are not complicate
statements.
In LLVM, it will vectorize them into:
vector_ssa_2 = <vector_ssa_result, 0>
...
vector_ssa_1 = vector_ssa_2 + 10;
vector_ssa_3 = vector_ssa_1 + 10;
mask_ssa_1 = vector_ssa_4 > vector_ssa_5;
vector_ssa_result = select <mask_ssa_1, vector_ssa_3, vector_ssa_2>
I think GCC should be able to vectorize it like LLVM:
vector_ssa_2 = <vector_ssa_result, 0>
...
vector_ssa_1 = vector_ssa_2 + 10;
vector_ssa_3 = vector_ssa_1 + 10;
mask_ssa_1 = vector_ssa_4 > vector_ssa_5;
vector_ssa_result = VCOND_MASK <mask_ssa_1, vector_ssa_3, vector_ssa_2>
I saw this code disable the vectorization:
else if (!bbs.is_empty ()
&& bb->loop_father->header == bb
&& bb->loop_father->dont_vectorize)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"splitting region at dont-vectorize loop %d "
"entry at bb%d\n",
bb->loop_father->num, bb->index);
split = true;
}
I am not familiar with these codes, any ideas ? Thanks.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (5 preceding siblings ...)
2023-09-26 12:14 ` juzhe.zhong at rivai dot ai
@ 2023-09-27 2:45 ` juzhe.zhong at rivai dot ai
2023-09-27 2:58 ` juzhe.zhong at rivai dot ai
` (14 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-27 2:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Update the analysis:
We failed to recognize it as reduction because, the PHI result has 2 uses:
# result_20 = PHI <result_9(8), 0(18)>
...
_11 = result_20 + 10; --------> first use
result_17 = _4 + _11;
_23 = _4 > _7;
result_9 = _23 ? result_17 : result_20; -----> second use
It seems that it is the if-conversion issue which makes loop vectorizer failed
to vectorize it.
I have checked LLVM implementation, their "result" ssa always has single use no
matter how I modify the codes (for example, result += a[i] + b[i] + c[i]).
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (6 preceding siblings ...)
2023-09-27 2:45 ` juzhe.zhong at rivai dot ai
@ 2023-09-27 2:58 ` juzhe.zhong at rivai dot ai
2023-09-27 7:15 ` rguenth at gcc dot gnu.org
` (13 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-27 2:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It's because the order of the operations we are doing:
For code as follows:
result += mask ? a[i] + x : 0;
GCC:
result_ssa_1 = PHI <result_ssa_2, 0>
...
STMT 1. tmp = a[i] + x;
STMT 2. tmp2 = tmp + result_ssa_1;
STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
we end up with 2 uses of the PHI result. Then, we failed to vectorize.
Wheras LLVM:
result_ssa_1 = PHI <result_ssa_2, 0>
...
IR 1. tmp = a[i] + x;
IR 2. tmp2 = mask ? tmp : 0;
IR 3. result_ssa_2 = tmp2 + result_ssa_1.
LLVM only has 1 use.
Is it reasonable to swap the order in match.pd ?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (7 preceding siblings ...)
2023-09-27 2:58 ` juzhe.zhong at rivai dot ai
@ 2023-09-27 7:15 ` rguenth at gcc dot gnu.org
2023-09-27 7:34 ` juzhe.zhong at rivai dot ai
` (12 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-27 7:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rdapp at gcc dot gnu.org
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #8)
> It's because the order of the operations we are doing:
>
> For code as follows:
>
> result += mask ? a[i] + x : 0;
>
> GCC:
> result_ssa_1 = PHI <result_ssa_2, 0>
> ...
> STMT 1. tmp = a[i] + x;
> STMT 2. tmp2 = tmp + result_ssa_1;
> STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
>
> Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> we end up with 2 uses of the PHI result. Then, we failed to vectorize.
>
> Wheras LLVM:
>
> result_ssa_1 = PHI <result_ssa_2, 0>
> ...
> IR 1. tmp = a[i] + x;
> IR 2. tmp2 = mask ? tmp : 0;
> IR 3. result_ssa_2 = tmp2 + result_ssa_1.
For floating point these are not equivalent (adding zero isn't a no-op).
> LLVM only has 1 use.
>
> Is it reasonable to swap the order in match.pd ?
if-conversion could be teached to swap this (it's if-conversion creating
the IL for conditional reductions) when valid. IIRC Robin Dapp also has
a patch to make if-conversion emit .COND_ADD instead which should make
it even better to vectorize.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (8 preceding siblings ...)
2023-09-27 7:15 ` rguenth at gcc dot gnu.org
@ 2023-09-27 7:34 ` juzhe.zhong at rivai dot ai
2023-09-27 9:06 ` rguenth at gcc dot gnu.org
` (11 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-27 7:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #9)
> (In reply to JuzheZhong from comment #8)
> > It's because the order of the operations we are doing:
> >
> > For code as follows:
> >
> > result += mask ? a[i] + x : 0;
> >
> > GCC:
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > STMT 1. tmp = a[i] + x;
> > STMT 2. tmp2 = tmp + result_ssa_1;
> > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
> >
> > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> > we end up with 2 uses of the PHI result. Then, we failed to vectorize.
> >
> > Wheras LLVM:
> >
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > IR 1. tmp = a[i] + x;
> > IR 2. tmp2 = mask ? tmp : 0;
> > IR 3. result_ssa_2 = tmp2 + result_ssa_1.
>
> For floating point these are not equivalent (adding zero isn't a no-op).
Yes, I agree these are not equivalent for floating-point.
But I they are equivalent if we specify -ffast-math.
I have double checked LLVM, they failed to vectorize conditionl
floating-point reduction too by default.
However, if we specify LLVM -ffast-math, it will generate the same
if-conversion IR sequence as integer, then vectorization succeed.
>
> > LLVM only has 1 use.
> >
> > Is it reasonable to swap the order in match.pd ?
>
> if-conversion could be teached to swap this (it's if-conversion creating
> the IL for conditional reductions) when valid. IIRC Robin Dapp also has
> a patch to make if-conversion emit .COND_ADD instead which should make
> it even better to vectorize.
I knew that patch, Robin is trying fixing the issue (in-order reduction)that I
posted.
I have confirm that patch can't help since it didn't modify the code for this
case, we will end up with multiple use in conditional reduction.
The reduction failed since:
/* If this isn't a nested cycle or if the nested cycle reduction value
is used ouside of the inner loop we cannot handle uses of the reduction
value. */
if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"reduction used in loop.\n");
return NULL;
}
when nphi_def_loop_uses > 1, we failed to vectorize.
I have checked LLVM codes, and I think we can extend this function:
strip_nop_cond_scalar_reduction
We should be able to strip all the statement until we can reach the
use of PHI result, like this:
LLVM is able to handle this case:
for ()
if (cond)
result += a[i] + b[i] + c[i] + ....
No matter how many variables are added in the condition reduction.
They well handle that since they keep iterating all the statement until
reach the result:
result_ssa_1 = PHI <>
tmp1 = result_ssa_1 + a[i];
tmp2 = tmp1 + b[i];
tmp3 = tmp2 + c[i];
....
We keep iterating until find the result_ssa_1 to hold the reduction variable.
Is this LLVM's approach reasonable to GCC?
If yes, I can translate LLVM code into GCC.
Thanks.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (9 preceding siblings ...)
2023-09-27 7:34 ` juzhe.zhong at rivai dot ai
@ 2023-09-27 9:06 ` rguenth at gcc dot gnu.org
2023-09-27 9:27 ` juzhe.zhong at rivai dot ai
` (10 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-27 9:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
I don't think strip_nop_cond_scalar_reduction is the place to adjust here,
maybe it's the caller. I don't have time to dig into the specific issue right
now but if we require scalar code adjustments then we need to perform those in
if-conversion.
But to me it looks like allowing
> > STMT 1. tmp = a[i] + x;
> > STMT 2. tmp2 = tmp + result_ssa_1;
> > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
in vect_is_simple_reduction might also be a reasonable approach. The
use in the COND_EXPR isn't really a use - it's a conditional update.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (10 preceding siblings ...)
2023-09-27 9:06 ` rguenth at gcc dot gnu.org
@ 2023-09-27 9:27 ` juzhe.zhong at rivai dot ai
2023-09-27 14:11 ` juzhe.zhong at rivai dot ai
` (9 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-27 9:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #11)
> I don't think strip_nop_cond_scalar_reduction is the place to adjust here,
> maybe it's the caller. I don't have time to dig into the specific issue
> right now but if we require scalar code adjustments then we need to perform
> those in if-conversion.
>
> But to me it looks like allowing
>
> > > STMT 1. tmp = a[i] + x;
> > > STMT 2. tmp2 = tmp + result_ssa_1;
> > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
>
> in vect_is_simple_reduction might also be a reasonable approach. The
> use in the COND_EXPR isn't really a use - it's a conditional update.
Thanks Richi.
Enhancing vect_is_simple_reduction in loop vectorizer is also a good approach.
But I think it's better to recognize the scalar condition reduction
(if-conversion) as early as possible. Obviously, current if-conversion failed
to
recognize it as a feasible conditional reduction.
I think enhancing vect_is_simple_reduction is the approach that it's unlikely
we
can simplify the scalar code in if-converison to fit current loop vectorizer.
I believe we will eventually have to enhance both if-converison and loop
vectorizer in the future. And I prefer improving the if-conversion and working
on it. Will keep you posted.
Thanks a lot!
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (11 preceding siblings ...)
2023-09-27 9:27 ` juzhe.zhong at rivai dot ai
@ 2023-09-27 14:11 ` juzhe.zhong at rivai dot ai
2023-10-06 9:44 ` rguenth at gcc dot gnu.org
` (8 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-09-27 14:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #13 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Richi. This is my draft approach to enhance the finding more potential
condtional reduction.
diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index a8c915913ae..c25d2038f16 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -1790,8 +1790,72 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc,
tree arg_0, tree arg_1,
std::swap (r_op1, r_op2);
std::swap (r_nop1, r_nop2);
}
- else if (r_nop1 != PHI_RESULT (header_phi))
- return false;
+ else if (r_nop1 == PHI_RESULT (header_phi))
+ ;
+ else
+ {
+ /* Analyze the statement chain of STMT so that we could teach generate
+ better if-converison code sequence. We are trying to catch this
+ following situation:
+
+ loop-header:
+ reduc_1 = PHI <..., reduc_2>
+ ...
+ if (...)
+ tmp1 = reduc_1 + rhs1;
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+ ...
+ reduc_3 = tmpN-1 + rhsN-1;
+
+ reduc_2 = PHI <reduc_1, reduc_3>
+
+ and convert to
+
+ reduc_2 = PHI <0, reduc_1>
+ tmp1 = rhs1 + rhs2;
+ tmp2 = tmp1 + rhs3;
+ tmp3 = tmp2 + rhs4;
+ ...
+ tmpN-1 = tmpN-2 + rhsN;
+ ifcvt = cond_expr ? tmpN-1 : 0
+ reduc_1 = tmpN-1 +/- ifcvt; */
+ if (num_imm_uses (PHI_RESULT (header_phi)) != 2)
+ return false;
+ FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi))
+ {
+ gimple *use_stmt = USE_STMT (use_p);
+ if (is_gimple_assign (use_stmt))
+ {
+ if (gimple_assign_rhs_code (use_stmt) != reduction_op)
+ return false;
+ if (TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME)
+ return false;
+
+ bool visited_p = false;
+ while (!visited_p)
+ {
+ use_operand_p use;
+ if (!single_imm_use (gimple_assign_lhs (use_stmt), &use,
+ &use_stmt)
+ || gimple_bb (use_stmt) != gimple_bb (stmt)
+ || !is_gimple_assign (use_stmt)
+ || TREE_CODE (gimple_assign_lhs (use_stmt)) != SSA_NAME
+ || gimple_assign_rhs_code (use_stmt) != reduction_op)
+ return false;
+
+ if (gimple_assign_lhs (use_stmt) == gimple_assign_lhs (stmt))
+ {
+ r_op2 = r_op1;
+ r_op1 = PHI_RESULT (header_phi);
+ visited_p = true;
+ }
+ }
+ }
+ else if (use_stmt != phi)
+ return false;
+ }
+ }
My approach is doing the check as follows:
tmp1 = reduc_1 + rhs1;
tmp2 = tmp1 + rhs2;
tmp3 = tmp2 + rhs3;
...
reduc_3 = tmpN-1 + rhsN-1;
Start the iteration check from "tmp1 = reduc_1 + rhs1;" until "reduc_3 = tmpN-1
+ rhsN-1;"
Make sure each statement are PLUS_EXPR for reduction sum.
Does it look reasonable ?
It succeed on vectorization.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (12 preceding siblings ...)
2023-09-27 14:11 ` juzhe.zhong at rivai dot ai
@ 2023-10-06 9:44 ` rguenth at gcc dot gnu.org
2023-11-10 12:02 ` juzhe.zhong at rivai dot ai
` (7 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-06 9:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Sorry for the delay - but this looks exactly like Robins transform to COND_ADD,
no? But sure, the current code doesn't handle a reduction path through
multiple stmts but when if-conversion would convert the final add to a COND_ADD
then
it should be a matter of teaching tree-vect-loop.cc:check_reduction_path
about conditional operations?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (13 preceding siblings ...)
2023-10-06 9:44 ` rguenth at gcc dot gnu.org
@ 2023-11-10 12:02 ` juzhe.zhong at rivai dot ai
2023-11-10 13:10 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-11-10 12:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #15 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi,Richard.
Confirmed Robin's patch doesn't help with this issue.
The root cause of this issue is failed to recognize it as possible
vectorization of conditional reduction which means is_cond_scalar_reduction
is FALSE.
I have this following patch which bootstrap on X86 and regtest passed
also passed on aarch64. This following patch can enhance if-conv
conditional reduction recognition.
diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index a8c915913ae..2bdd3710a65 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -1784,14 +1784,119 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc,
tree arg_0, tree arg_1,
r_nop2 = strip_nop_cond_scalar_reduction (*has_nop, r_op2);
/* Make R_OP1 to hold reduction variable. */
+ gimple *nonphi_use_stmt = NULL;
if (r_nop2 == PHI_RESULT (header_phi)
&& commutative_tree_code (reduction_op))
{
std::swap (r_op1, r_op2);
std::swap (r_nop1, r_nop2);
}
- else if (r_nop1 != PHI_RESULT (header_phi))
- return false;
+ else if (r_nop1 == PHI_RESULT (header_phi))
+ ;
+ else
+ {
+ /* Analyze the statement chain of STMT so that we could teach generate
+ better if-converison code sequence. We are trying to catch this
+ following situation:
+
+ loop-header:
+ reduc_1 = PHI <..., reduc_2>
+ ...
+ if (...)
+ tmp1 = reduc_1 + rhs1;
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+ ...
+ reduc_3 = tmpN-1 + rhsN-1;
+
+ reduc_2 = PHI <reduc_1, reduc_3>
+
+ and convert to
+
+ reduc_2 = PHI <0, reduc_1>
+ tmp1 = rhs1;
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+ ...
+ reduc_3 = tmpN-1 + rhsN-1;
+ ifcvt = cond_expr ? reduc_3 : 0;
+ reduc_1 = reduc_1 +/- ifcvt; */
+ if (num_imm_uses (PHI_RESULT (header_phi)) != 2)
+ return false;
+ if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+ && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))
+ return false;
+ gimple *prev_use_stmt, *curr_use_stmt;
+ use_operand_p use;
+ FOR_EACH_IMM_USE_FAST (use_p, imm_iter, PHI_RESULT (header_phi))
+ {
+ prev_use_stmt = curr_use_stmt = USE_STMT (use_p);
+ if (is_gimple_assign (curr_use_stmt))
+ {
+ if (TREE_CODE (gimple_assign_lhs (curr_use_stmt)) != SSA_NAME)
+ return false;
+ if (*has_nop)
+ {
+ if (!CONVERT_EXPR_CODE_P (
+ gimple_assign_rhs_code (curr_use_stmt)))
+ return false;
+ }
+ else
+ {
+ if (gimple_assign_rhs_code (curr_use_stmt) != reduction_op)
+ return false;
+ }
+
+ bool visited_p = false;
+ nonphi_use_stmt = curr_use_stmt;
+ while (!visited_p)
+ {
+ if (!single_imm_use (gimple_assign_lhs (prev_use_stmt), &use,
+ &curr_use_stmt)
+ || gimple_bb (curr_use_stmt) != gimple_bb (stmt)
+ || !is_gimple_assign (curr_use_stmt)
+ || TREE_CODE (gimple_assign_lhs (curr_use_stmt))
+ != SSA_NAME
+ || gimple_assign_rhs_code (curr_use_stmt) !=
reduction_op)
+ return false;
+ if (curr_use_stmt == stmt)
+ {
+ if (*has_nop)
+ {
+ if (!single_imm_use (gimple_assign_lhs (
+ nonphi_use_stmt),
+ &use, &curr_use_stmt))
+ return false;
+ r_op1 = gimple_assign_lhs (nonphi_use_stmt);
+ r_nop1 = PHI_RESULT (header_phi);
+ nonphi_use_stmt = curr_use_stmt;
+ }
+ else
+ r_op1 = PHI_RESULT (header_phi);
+
+ if (*has_nop)
+ {
+ if (!single_imm_use (gimple_assign_lhs (stmt), &use,
+ &curr_use_stmt))
+ return false;
+ r_op2 = gimple_assign_lhs (stmt);
+ r_nop2 = gimple_assign_lhs (curr_use_stmt);
+ }
+ else
+ r_op2 = gimple_assign_lhs (stmt);
+ visited_p = true;
+ }
+ else
+ prev_use_stmt = curr_use_stmt;
+ }
+ }
+ else if (curr_use_stmt != phi)
+ return false;
+ }
+ }
if (*has_nop)
{
@@ -1816,12 +1921,41 @@ is_cond_scalar_reduction (gimple *phi, gimple **reduc,
tree arg_0, tree arg_1,
continue;
if (use_stmt == stmt)
continue;
+ if (use_stmt == nonphi_use_stmt)
+ continue;
if (gimple_code (use_stmt) != GIMPLE_PHI)
return false;
}
*op0 = r_op1; *op1 = r_op2;
*reduc = stmt;
+
+ if (nonphi_use_stmt)
+ {
+ /* Transform:
+
+ if (...)
+ tmp1 = reduc_1 + rhs1;
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+
+ into:
+
+ tmp1 = rhs1 + 0; ---> We replace reduc_1 into '0'
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+ ...
+ reduc_3 = tmpN-1 + rhsN-1;
+ ifcvt = cond_expr ? reduc_3 : 0; */
+ gcc_assert (gimple_assign_rhs_code (nonphi_use_stmt) == reduction_op);
+ if (gimple_assign_rhs1 (nonphi_use_stmt) == r_op1)
+ gimple_assign_set_rhs1 (nonphi_use_stmt,
+ build_zero_cst (TREE_TYPE (r_op1)));
+ else if (gimple_assign_rhs2 (nonphi_use_stmt) == r_op1)
+ gimple_assign_set_rhs2 (nonphi_use_stmt,
+ build_zero_cst (TREE_TYPE (r_op1)));
+ update_stmt (nonphi_use_stmt);
+ }
return true;
}
@@ -1886,12 +2020,17 @@ convert_scalar_cond_reduction (gimple *reduc,
gimple_stmt_iterator *gsi,
gsi_remove (&stmt_it, true);
release_defs (nop_reduc);
}
+
gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
/* Delete original reduction stmt. */
- stmt_it = gsi_for_stmt (reduc);
- gsi_remove (&stmt_it, true);
- release_defs (reduc);
+ if (op1 != gimple_assign_lhs (reduc))
+ {
+ stmt_it = gsi_for_stmt (reduc);
+ gsi_remove (&stmt_it, true);
+ release_defs (reduc);
+ }
+
return rhs;
}
I have fully tested it with different kinds of condtional reduction.
All can be vectorized.
I am not sure whether it is a feasible solution.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (14 preceding siblings ...)
2023-11-10 12:02 ` juzhe.zhong at rivai dot ai
@ 2023-11-10 13:10 ` rguenth at gcc dot gnu.org
2023-11-10 13:42 ` juzhe.zhong at rivai dot ai
` (5 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-10 13:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
it's not exactly clear what the transform is you propose. it looks like a
re-association but SSA names do not match up but the transform only replaces
a single op with 0!?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (15 preceding siblings ...)
2023-11-10 13:10 ` rguenth at gcc dot gnu.org
@ 2023-11-10 13:42 ` juzhe.zhong at rivai dot ai
2023-11-15 14:09 ` rguenth at gcc dot gnu.org
` (4 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-11-10 13:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #17 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Sorry for confusing and not enough information.
I am trying to transform:
+ reduc_1 = PHI <..., reduc_2>
+ ...
+ if (...)
+ tmp1 = reduc_1 + rhs1;
+ tmp2 = tmp1 + rhs2;
+ tmp3 = tmp2 + rhs3;
+ ...
+ reduc_3 = tmpN-1 + rhsN-1;
+
+ reduc_2 = PHI <reduc_1, reduc_3>
First transform the first statement:
tmp1 = reduc_1 + rhs1; into tmp1 = rhs1 + 0;
Then it will become bogus data move assignment: tmp1 = rhs1.
The later PASS will eliminate it.
Then, transform the reduction PHI:
reduc_1 = PHI <..., reduc_2>
into if-convert statement:
reduc_1 = PHI <_ifc__35(8), 0(18)>
Thid, transform
+ reduc_3 = tmpN-1 + rhsN-1;
+
+ reduc_2 = PHI <reduc_1, reduc_3>
into :
reduc_3 = tmpN-1 + rhsN-1;
_ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1);
So finally:
result_1 = PHI <_ifc__35(8), 0(18)>
...
tmp1 = rhs1;
tmp2 = tmp1 + rhs2;
tmp3 = tmp2 + rhs3;
...
reduc_3 = tmpN-1 + rhsN-1;
_ifc__35 = .COND_ADD (condition, reduc_1, reduc_3, reduc_1);
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (16 preceding siblings ...)
2023-11-10 13:42 ` juzhe.zhong at rivai dot ai
@ 2023-11-15 14:09 ` rguenth at gcc dot gnu.org
2023-11-15 14:38 ` juzhe.zhong at rivai dot ai
` (3 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-15 14:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, so this is for
int foo (int *a, int x)
{
int sum = 0;
for (int i = 0; i < 1024; ++i)
if (a[i] < 10)
sum = sum + a[i] + x;
return sum;
}
transforming it to
int foo (int *a, int x)
{
int sum = 0;
for (int i = 0; i < 1024; ++i)
{
tem = a[i] + x;
if (a[i] < 10)
sum = sum + tem;
}
return sum;
}
note this is re-associating and thus not always valid. It also executes
stmts unconditionally (also not always valid, for FP spurious exceptions
might be raised).
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (17 preceding siblings ...)
2023-11-15 14:09 ` rguenth at gcc dot gnu.org
@ 2023-11-15 14:38 ` juzhe.zhong at rivai dot ai
2023-11-15 14:42 ` rguenther at suse dot de
` (2 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-11-15 14:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #19 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
I have added:
+ if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+ && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
+ && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))
+ return false;
for floating-point. I failed to see which situation will cause FP exceptions ?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (18 preceding siblings ...)
2023-11-15 14:38 ` juzhe.zhong at rivai dot ai
@ 2023-11-15 14:42 ` rguenther at suse dot de
2023-11-16 1:06 ` juzhe.zhong at rivai dot ai
2023-11-16 6:50 ` rguenther at suse dot de
21 siblings, 0 replies; 23+ messages in thread
From: rguenther at suse dot de @ 2023-11-15 14:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 15 Nov 2023, juzhe.zhong at rivai dot ai wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
>
> --- Comment #19 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> I have added:
>
> + if (!ANY_INTEGRAL_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
For TYPE_OVERFLOW_UNDEFINED you have to convert the ops to unsigned
to avoid spurious undefined overflow.
> + && !(FLOAT_TYPE_P (TREE_TYPE (PHI_RESULT (phi)))
> + && !HONOR_SIGNED_ZEROS (TREE_TYPE (PHI_RESULT (phi)))
> + && !HONOR_SIGN_DEPENDENT_ROUNDING (TREE_TYPE (PHI_RESULT (phi)))
> + && !HONOR_NANS (TREE_TYPE (PHI_RESULT (phi)))))
You should check flag_associative_math which covers one half
and !flag_trapping_math which covers spurious FP exceptions.
> + return false;
>
> for floating-point. I failed to see which situation will cause FP exceptions ?
Ops with NaN cause INVALID, but there's also INEXACT which can be
set differently after re-association.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (19 preceding siblings ...)
2023-11-15 14:42 ` rguenther at suse dot de
@ 2023-11-16 1:06 ` juzhe.zhong at rivai dot ai
2023-11-16 6:50 ` rguenther at suse dot de
21 siblings, 0 replies; 23+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-11-16 1:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #21 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Thanks Richi.
Does re-associating (with eliminating exceptions) in if-convert is a reasonable
approach ?
If yes, I am gonna to revisit this PR on GCC-15 and refine the codes.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Bug tree-optimization/109088] GCC does not always vectorize conditional reduction
2023-03-10 9:24 [Bug c/109088] New: GCC fail auto-vectorization juzhe.zhong at rivai dot ai
` (20 preceding siblings ...)
2023-11-16 1:06 ` juzhe.zhong at rivai dot ai
@ 2023-11-16 6:50 ` rguenther at suse dot de
21 siblings, 0 replies; 23+ messages in thread
From: rguenther at suse dot de @ 2023-11-16 6:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #22 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 16 Nov 2023, juzhe.zhong at rivai dot ai wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
>
> --- Comment #21 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> Thanks Richi.
>
> Does re-associating (with eliminating exceptions) in if-convert is a reasonable
> approach ?
Yeah, I don't think we have a much better place at the moment.
^ permalink raw reply [flat|nested] 23+ messages in thread