* [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
@ 2023-07-28 7:10 Juzhe-Zhong
2023-07-31 9:26 ` Richard Biener
2023-07-31 10:19 ` Richard Sandiford
0 siblings, 2 replies; 18+ messages in thread
From: Juzhe-Zhong @ 2023-07-28 7:10 UTC (permalink / raw)
To: gcc-patches; +Cc: richard.sandiford, rguenther, Juzhe-Zhong
Hi, Richard and Richi.
Base on the suggestions from Richard:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
This patch choose (1) approach that Richard provided, meaning:
RVV implements cond_* optabs as expanders. RVV therefore supports
both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
are needed at the gimple level.
Such approach can make codes much cleaner and reasonable.
Consider this following case:
void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
{
for (int i = 0; i < n; i++)
if (cond[i])
a[i] = b[i] + a[i];
}
Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
<source>:5:21: missed: couldn't vectorize loop
<source>:5:21: missed: not vectorized: control flow in loop.
ARM SVE:
...
mask__27.10_51 = vect__4.9_49 != { 0, ... };
...
vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
...
vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
For RVV, we want IR as follows:
...
_68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
...
mask__27.10_51 = vect__4.9_49 != { 0, ... };
...
vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
...
Both len and mask of COND_LEN_ADD are real not dummy.
This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
And also, Bootstrap and Regression on X86 passed.
OK for trunk?
gcc/ChangeLog:
* internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
(get_len_internal_fn): New function.
(CASE): Ditto.
* internal-fn.h (get_len_internal_fn): Ditto.
* tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
---
gcc/internal-fn.cc | 46 ++++++++++++++++++++++
gcc/internal-fn.h | 1 +
gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
3 files changed, 125 insertions(+), 9 deletions(-)
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 8e294286388..379220bebc7 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
}
}
+/* Invoke T(IFN) for each internal function IFN that also has an
+ IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
+#define FOR_EACH_LEN_FN_PAIR(T) \
+ T (MASK_LOAD, MASK_LEN_LOAD) \
+ T (MASK_STORE, MASK_LEN_STORE) \
+ T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
+ T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
+ T (COND_ADD, COND_LEN_ADD) \
+ T (COND_SUB, COND_LEN_SUB) \
+ T (COND_MUL, COND_LEN_MUL) \
+ T (COND_DIV, COND_LEN_DIV) \
+ T (COND_MOD, COND_LEN_MOD) \
+ T (COND_RDIV, COND_LEN_RDIV) \
+ T (COND_FMIN, COND_LEN_FMIN) \
+ T (COND_FMAX, COND_LEN_FMAX) \
+ T (COND_MIN, COND_LEN_MIN) \
+ T (COND_MAX, COND_LEN_MAX) \
+ T (COND_AND, COND_LEN_AND) \
+ T (COND_IOR, COND_LEN_IOR) \
+ T (COND_XOR, COND_LEN_XOR) \
+ T (COND_SHL, COND_LEN_SHL) \
+ T (COND_SHR, COND_LEN_SHR) \
+ T (COND_NEG, COND_LEN_NEG) \
+ T (COND_FMA, COND_LEN_FMA) \
+ T (COND_FMS, COND_LEN_FMS) \
+ T (COND_FNMA, COND_LEN_FNMA) \
+ T (COND_FNMS, COND_LEN_FNMS)
+
+/* If there exists an internal function like IFN that operates on vectors,
+ but with additional length and bias parameters, return the internal_fn
+ for that function, otherwise return IFN_LAST. */
+internal_fn
+get_len_internal_fn (internal_fn fn)
+{
+ switch (fn)
+ {
+#define CASE(NAME, LEN_NAME) \
+ case IFN_##NAME: \
+ return IFN_##LEN_NAME;
+ FOR_EACH_LEN_FN_PAIR (CASE)
+#undef CASE
+ default:
+ return IFN_LAST;
+ }
+}
+
/* If IFN implements the conditional form of an unconditional internal
function, return that unconditional function, otherwise return IFN_LAST. */
diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
index a5c3f4765ff..410c1b623d6 100644
--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void);
extern internal_fn get_conditional_internal_fn (tree_code);
extern internal_fn get_conditional_internal_fn (internal_fn);
+extern internal_fn get_len_internal_fn (internal_fn);
extern internal_fn get_conditional_len_internal_fn (tree_code);
extern tree_code conditional_internal_fn_code (internal_fn);
extern internal_fn get_unconditional_internal_fn (internal_fn);
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 6a4e8fce126..ae5b0b09c08 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -3540,7 +3540,10 @@ vectorizable_call (vec_info *vinfo,
int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
internal_fn cond_fn = get_conditional_internal_fn (ifn);
+ internal_fn cond_len_fn = get_len_internal_fn (ifn);
+ int len_opno = internal_fn_len_index (cond_len_fn);
vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
+ vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL);
if (!vec_stmt) /* transformation not required. */
{
if (slp_node)
@@ -3586,8 +3589,14 @@ vectorizable_call (vec_info *vinfo,
tree scalar_mask = NULL_TREE;
if (mask_opno >= 0)
scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
- vect_record_loop_mask (loop_vinfo, masks, nvectors,
- vectype_out, scalar_mask);
+ if (cond_len_fn != IFN_LAST
+ && direct_internal_fn_supported_p (cond_len_fn, vectype_out,
+ OPTIMIZE_FOR_SPEED))
+ vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_out,
+ 1);
+ else
+ vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out,
+ scalar_mask);
}
}
return true;
@@ -3603,8 +3612,24 @@ vectorizable_call (vec_info *vinfo,
vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+ bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
unsigned int vect_nargs = nargs;
- if (masked_loop_p && reduc_idx >= 0)
+ if (len_loop_p)
+ {
+ if (len_opno >= 0)
+ {
+ ifn = cond_len_fn;
+ /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
+ vect_nargs += 2;
+ }
+ else if (reduc_idx >= 0)
+ {
+ /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
+ ifn = get_len_internal_fn (cond_fn);
+ vect_nargs += 4;
+ }
+ }
+ else if (masked_loop_p && reduc_idx >= 0)
{
ifn = cond_fn;
vect_nargs += 2;
@@ -3629,7 +3654,18 @@ vectorizable_call (vec_info *vinfo,
FOR_EACH_VEC_ELT (vec_oprnds0, i, vec_oprnd0)
{
int varg = 0;
- if (masked_loop_p && reduc_idx >= 0)
+ if (len_loop_p && reduc_idx >= 0)
+ {
+ /* Always true for SLP. */
+ gcc_assert (ncopies == 1);
+ /* For COND_LEN_* operations used by reduction of
+ CALL vectorization, the LEN argument is the real
+ loop len produced by SELECT_VL or MIN wheras the
+ MASK argument here is the dummy mask. */
+ vargs[varg++]
+ = build_minus_one_cst (truth_type_for (vectype_out));
+ }
+ else if (masked_loop_p && reduc_idx >= 0)
{
unsigned int vec_num = vec_oprnds0.length ();
/* Always true for SLP. */
@@ -3644,7 +3680,7 @@ vectorizable_call (vec_info *vinfo,
vec<tree> vec_oprndsk = vec_defs[k];
vargs[varg++] = vec_oprndsk[i];
}
- if (masked_loop_p && reduc_idx >= 0)
+ if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
vargs[varg++] = vargs[reduc_idx + 1];
gimple *new_stmt;
if (modifier == NARROW)
@@ -3671,7 +3707,21 @@ vectorizable_call (vec_info *vinfo,
}
else
{
- if (mask_opno >= 0 && masked_loop_p)
+ if (len_opno >= 0 && len_loop_p)
+ {
+ unsigned int vec_num = vec_oprnds0.length ();
+ /* Always true for SLP. */
+ gcc_assert (ncopies == 1);
+ tree len
+ = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num,
+ vectype_out, i, 1);
+ signed char biasval
+ = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
+ tree bias = build_int_cst (intQI_type_node, biasval);
+ vargs[len_opno] = len;
+ vargs[len_opno + 1] = bias;
+ }
+ else if (mask_opno >= 0 && masked_loop_p)
{
unsigned int vec_num = vec_oprnds0.length ();
/* Always true for SLP. */
@@ -3701,7 +3751,16 @@ vectorizable_call (vec_info *vinfo,
}
int varg = 0;
- if (masked_loop_p && reduc_idx >= 0)
+ if (len_loop_p && reduc_idx >= 0)
+ {
+ /* For COND_LEN_* operations used by reduction of
+ CALL vectorization, the LEN argument is the real
+ loop len produced by SELECT_VL or MIN wheras the
+ MASK argument here is the dummy mask. */
+ vargs[varg++]
+ = build_minus_one_cst (truth_type_for (vectype_out));
+ }
+ else if (masked_loop_p && reduc_idx >= 0)
vargs[varg++] = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
vectype_out, j);
for (i = 0; i < nargs; i++)
@@ -3716,10 +3775,20 @@ vectorizable_call (vec_info *vinfo,
}
vargs[varg++] = vec_defs[i][j];
}
- if (masked_loop_p && reduc_idx >= 0)
+ if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
vargs[varg++] = vargs[reduc_idx + 1];
- if (mask_opno >= 0 && masked_loop_p)
+ if (len_opno >= 0 && len_loop_p)
+ {
+ tree len = vect_get_loop_len (loop_vinfo, gsi, lens, ncopies,
+ vectype_out, j, 1);
+ signed char biasval
+ = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
+ tree bias = build_int_cst (intQI_type_node, biasval);
+ vargs[len_opno] = len;
+ vargs[len_opno + 1] = bias;
+ }
+ else if (mask_opno >= 0 && masked_loop_p)
{
tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
vectype_out, j);
--
2.36.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-28 7:10 [PATCH V2] VECT: Support CALL vectorization for COND_LEN_* Juzhe-Zhong
@ 2023-07-31 9:26 ` Richard Biener
2023-07-31 9:52 ` juzhe.zhong
2023-07-31 10:19 ` Richard Sandiford
1 sibling, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-07-31 9:26 UTC (permalink / raw)
To: Juzhe-Zhong; +Cc: gcc-patches, richard.sandiford
On Fri, 28 Jul 2023, Juzhe-Zhong wrote:
> Hi, Richard and Richi.
>
> Base on the suggestions from Richard:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
>
> This patch choose (1) approach that Richard provided, meaning:
>
> RVV implements cond_* optabs as expanders. RVV therefore supports
> both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> are needed at the gimple level.
>
> Such approach can make codes much cleaner and reasonable.
>
> Consider this following case:
> void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> {
> for (int i = 0; i < n; i++)
> if (cond[i])
> a[i] = b[i] + a[i];
> }
>
>
> Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> <source>:5:21: missed: couldn't vectorize loop
> <source>:5:21: missed: not vectorized: control flow in loop.
>
> ARM SVE:
>
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> ...
> vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
>
> For RVV, we want IR as follows:
>
> ...
> _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> ...
>
> Both len and mask of COND_LEN_ADD are real not dummy.
>
> This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
>
> And also, Bootstrap and Regression on X86 passed.
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> (get_len_internal_fn): New function.
> (CASE): Ditto.
> * internal-fn.h (get_len_internal_fn): Ditto.
> * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
>
> ---
> gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> gcc/internal-fn.h | 1 +
> gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> 3 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8e294286388..379220bebc7 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> }
> }
>
> +/* Invoke T(IFN) for each internal function IFN that also has an
> + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> +#define FOR_EACH_LEN_FN_PAIR(T) \
> + T (MASK_LOAD, MASK_LEN_LOAD) \
> + T (MASK_STORE, MASK_LEN_STORE) \
> + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> + T (COND_ADD, COND_LEN_ADD) \
> + T (COND_SUB, COND_LEN_SUB) \
> + T (COND_MUL, COND_LEN_MUL) \
> + T (COND_DIV, COND_LEN_DIV) \
> + T (COND_MOD, COND_LEN_MOD) \
> + T (COND_RDIV, COND_LEN_RDIV) \
> + T (COND_FMIN, COND_LEN_FMIN) \
> + T (COND_FMAX, COND_LEN_FMAX) \
> + T (COND_MIN, COND_LEN_MIN) \
> + T (COND_MAX, COND_LEN_MAX) \
> + T (COND_AND, COND_LEN_AND) \
> + T (COND_IOR, COND_LEN_IOR) \
> + T (COND_XOR, COND_LEN_XOR) \
> + T (COND_SHL, COND_LEN_SHL) \
> + T (COND_SHR, COND_LEN_SHR) \
> + T (COND_NEG, COND_LEN_NEG) \
> + T (COND_FMA, COND_LEN_FMA) \
> + T (COND_FMS, COND_LEN_FMS) \
> + T (COND_FNMA, COND_LEN_FNMA) \
> + T (COND_FNMS, COND_LEN_FNMS)
> +
> +/* If there exists an internal function like IFN that operates on vectors,
> + but with additional length and bias parameters, return the internal_fn
> + for that function, otherwise return IFN_LAST. */
> +internal_fn
> +get_len_internal_fn (internal_fn fn)
> +{
> + switch (fn)
> + {
> +#define CASE(NAME, LEN_NAME) \
> + case IFN_##NAME: \
> + return IFN_##LEN_NAME;
> + FOR_EACH_LEN_FN_PAIR (CASE)
> +#undef CASE
> + default:
> + return IFN_LAST;
> + }
> +}
> +
> /* If IFN implements the conditional form of an unconditional internal
> function, return that unconditional function, otherwise return IFN_LAST. */
>
> diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> index a5c3f4765ff..410c1b623d6 100644
> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void);
>
> extern internal_fn get_conditional_internal_fn (tree_code);
> extern internal_fn get_conditional_internal_fn (internal_fn);
> +extern internal_fn get_len_internal_fn (internal_fn);
> extern internal_fn get_conditional_len_internal_fn (tree_code);
> extern tree_code conditional_internal_fn_code (internal_fn);
> extern internal_fn get_unconditional_internal_fn (internal_fn);
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 6a4e8fce126..ae5b0b09c08 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -3540,7 +3540,10 @@ vectorizable_call (vec_info *vinfo,
>
> int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
> internal_fn cond_fn = get_conditional_internal_fn (ifn);
> + internal_fn cond_len_fn = get_len_internal_fn (ifn);
> + int len_opno = internal_fn_len_index (cond_len_fn);
> vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
> + vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL);
> if (!vec_stmt) /* transformation not required. */
> {
> if (slp_node)
> @@ -3586,8 +3589,14 @@ vectorizable_call (vec_info *vinfo,
Above for reduc_idx >= 0 there's a check whether cond_fn is supported,
don't you need to amend that with a check for cond_len_fn?
> tree scalar_mask = NULL_TREE;
> if (mask_opno >= 0)
> scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
> - vect_record_loop_mask (loop_vinfo, masks, nvectors,
> - vectype_out, scalar_mask);
> + if (cond_len_fn != IFN_LAST
> + && direct_internal_fn_supported_p (cond_len_fn, vectype_out,
> + OPTIMIZE_FOR_SPEED))
> + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_out,
> + 1);
> + else
> + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out,
> + scalar_mask);
> }
> }
> return true;
> @@ -3603,8 +3612,24 @@ vectorizable_call (vec_info *vinfo,
> vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>
> bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> + bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
> unsigned int vect_nargs = nargs;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p)
> + {
> + if (len_opno >= 0)
> + {
> + ifn = cond_len_fn;
> + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> + vect_nargs += 2;
> + }
> + else if (reduc_idx >= 0)
> + {
> + /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> + ifn = get_len_internal_fn (cond_fn);
> + vect_nargs += 4;
I'm a bit confused (but also by the existing mask code), whether
vect_nargs needs adjustment depends on the IFN in the IL we analyze.
If if-conversion recognizes a .COND_ADD then we need to add nothing
for masking (that is, ifn == cond_fn already). In your code above
you either use cond_len_fn or get_len_internal_fn (cond_fn) but
isn't that the very same?! So how come you in one case add two
and in the other add four args?
Please make sure to place gcc_unreachable () in each arm and check
you have test coverage. I believe that the else arm is unreachable
but when you vectorize .FMA you will need to add 4 and when you
vectorize .COND_FMA you will need to add two arguments (as said,
no idea why we special case reduc_idx >= 0 at the moment).
Otherwise the patch looks OK to me.
Thanks,
Richard.
> + }
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> {
> ifn = cond_fn;
> vect_nargs += 2;
> @@ -3629,7 +3654,18 @@ vectorizable_call (vec_info *vinfo,
> FOR_EACH_VEC_ELT (vec_oprnds0, i, vec_oprnd0)
> {
> int varg = 0;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p && reduc_idx >= 0)
> + {
> + /* Always true for SLP. */
> + gcc_assert (ncopies == 1);
> + /* For COND_LEN_* operations used by reduction of
> + CALL vectorization, the LEN argument is the real
> + loop len produced by SELECT_VL or MIN wheras the
> + MASK argument here is the dummy mask. */
> + vargs[varg++]
> + = build_minus_one_cst (truth_type_for (vectype_out));
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> {
> unsigned int vec_num = vec_oprnds0.length ();
> /* Always true for SLP. */
> @@ -3644,7 +3680,7 @@ vectorizable_call (vec_info *vinfo,
> vec<tree> vec_oprndsk = vec_defs[k];
> vargs[varg++] = vec_oprndsk[i];
> }
> - if (masked_loop_p && reduc_idx >= 0)
> + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> vargs[varg++] = vargs[reduc_idx + 1];
> gimple *new_stmt;
> if (modifier == NARROW)
> @@ -3671,7 +3707,21 @@ vectorizable_call (vec_info *vinfo,
> }
> else
> {
> - if (mask_opno >= 0 && masked_loop_p)
> + if (len_opno >= 0 && len_loop_p)
> + {
> + unsigned int vec_num = vec_oprnds0.length ();
> + /* Always true for SLP. */
> + gcc_assert (ncopies == 1);
> + tree len
> + = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num,
> + vectype_out, i, 1);
> + signed char biasval
> + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> + tree bias = build_int_cst (intQI_type_node, biasval);
> + vargs[len_opno] = len;
> + vargs[len_opno + 1] = bias;
> + }
> + else if (mask_opno >= 0 && masked_loop_p)
> {
> unsigned int vec_num = vec_oprnds0.length ();
> /* Always true for SLP. */
> @@ -3701,7 +3751,16 @@ vectorizable_call (vec_info *vinfo,
> }
>
> int varg = 0;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p && reduc_idx >= 0)
> + {
> + /* For COND_LEN_* operations used by reduction of
> + CALL vectorization, the LEN argument is the real
> + loop len produced by SELECT_VL or MIN wheras the
> + MASK argument here is the dummy mask. */
> + vargs[varg++]
> + = build_minus_one_cst (truth_type_for (vectype_out));
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> vargs[varg++] = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> vectype_out, j);
> for (i = 0; i < nargs; i++)
> @@ -3716,10 +3775,20 @@ vectorizable_call (vec_info *vinfo,
> }
> vargs[varg++] = vec_defs[i][j];
> }
> - if (masked_loop_p && reduc_idx >= 0)
> + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> vargs[varg++] = vargs[reduc_idx + 1];
>
> - if (mask_opno >= 0 && masked_loop_p)
> + if (len_opno >= 0 && len_loop_p)
> + {
> + tree len = vect_get_loop_len (loop_vinfo, gsi, lens, ncopies,
> + vectype_out, j, 1);
> + signed char biasval
> + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> + tree bias = build_int_cst (intQI_type_node, biasval);
> + vargs[len_opno] = len;
> + vargs[len_opno + 1] = bias;
> + }
> + else if (mask_opno >= 0 && masked_loop_p)
> {
> tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> vectype_out, j);
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 9:26 ` Richard Biener
@ 2023-07-31 9:52 ` juzhe.zhong
2023-07-31 10:45 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: juzhe.zhong @ 2023-07-31 9:52 UTC (permalink / raw)
To: rguenther; +Cc: gcc-patches, richard.sandiford
[-- Attachment #1: Type: text/plain, Size: 16603 bytes --]
Hi, Richard. Thanks a lot for the comment
>> In your code above
>> you either use cond_len_fn or get_len_internal_fn (cond_fn) but
>> isn't that the very same?! So how come you in one case add two
>> and in the other add four args?
cond_len_fn is not the same as get_len_internal_fn (cond_fn) when vectorizing FMA into COND_LEN_FMA.
since "internal_fn cond_len_fn = get_len_internal_fn (ifn);"
and the iterators:
> +#define FOR_EACH_LEN_FN_PAIR(T) \
> + T (MASK_LOAD, MASK_LEN_LOAD) \
> + T (MASK_STORE, MASK_LEN_STORE) \
> + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> + T (COND_ADD, COND_LEN_ADD) \
> + T (COND_SUB, COND_LEN_SUB) \
> + T (COND_MUL, COND_LEN_MUL) \
> + T (COND_DIV, COND_LEN_DIV) \
> + T (COND_MOD, COND_LEN_MOD) \
> + T (COND_RDIV, COND_LEN_RDIV) \
> + T (COND_FMIN, COND_LEN_FMIN) \
> + T (COND_FMAX, COND_LEN_FMAX) \
> + T (COND_MIN, COND_LEN_MIN) \
> + T (COND_MAX, COND_LEN_MAX) \
> + T (COND_AND, COND_LEN_AND) \
> + T (COND_IOR, COND_LEN_IOR) \
> + T (COND_XOR, COND_LEN_XOR) \
> + T (COND_SHL, COND_LEN_SHL) \
> + T (COND_SHR, COND_LEN_SHR) \
> + T (COND_NEG, COND_LEN_NEG) \
> + T (COND_FMA, COND_LEN_FMA) \
> + T (COND_FMS, COND_LEN_FMS) \
> + T (COND_FNMA, COND_LEN_FNMA) \
> + T (COND_FNMS, COND_LEN_FNMS)
So, cond_len_fn will be IFN_LAST when ifn = FMA.
Maybe is it reasonable that I add 4 more iterators here?
> + T (FMA, COND_LEN_FMA) \
> + T (FMS, COND_LEN_FMS) \
> + T (FNMA, COND_LEN_FNMA) \
> + T (FNMS, COND_LEN_FNMS)
So that we won't need to have "get_len_internal_fn (cond_fn)"
When vectorizing COND_ADD into COND_LEN_ADD we already have "mask" and "else" value.
So we only need to add 2 arguments.
But when vectorizing FMA into COND_LEN_FMA, we need to add 4 arguments (mask,else,len,bias).
>>as said,
>>no idea why we special case reduc_idx >= 0 at the moment
I also want to vectorize FMA into COND_LEN_FMA even reduc_idx < 0.
Could I relax this condition for COND_LEN_* since it will improve RVV codegen a lot.
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-07-31 17:26
To: Juzhe-Zhong
CC: gcc-patches; richard.sandiford
Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Fri, 28 Jul 2023, Juzhe-Zhong wrote:
> Hi, Richard and Richi.
>
> Base on the suggestions from Richard:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
>
> This patch choose (1) approach that Richard provided, meaning:
>
> RVV implements cond_* optabs as expanders. RVV therefore supports
> both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> are needed at the gimple level.
>
> Such approach can make codes much cleaner and reasonable.
>
> Consider this following case:
> void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> {
> for (int i = 0; i < n; i++)
> if (cond[i])
> a[i] = b[i] + a[i];
> }
>
>
> Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> <source>:5:21: missed: couldn't vectorize loop
> <source>:5:21: missed: not vectorized: control flow in loop.
>
> ARM SVE:
>
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> ...
> vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
>
> For RVV, we want IR as follows:
>
> ...
> _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> ...
>
> Both len and mask of COND_LEN_ADD are real not dummy.
>
> This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
>
> And also, Bootstrap and Regression on X86 passed.
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> (get_len_internal_fn): New function.
> (CASE): Ditto.
> * internal-fn.h (get_len_internal_fn): Ditto.
> * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
>
> ---
> gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> gcc/internal-fn.h | 1 +
> gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> 3 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8e294286388..379220bebc7 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> }
> }
>
> +/* Invoke T(IFN) for each internal function IFN that also has an
> + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> +#define FOR_EACH_LEN_FN_PAIR(T) \
> + T (MASK_LOAD, MASK_LEN_LOAD) \
> + T (MASK_STORE, MASK_LEN_STORE) \
> + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> + T (COND_ADD, COND_LEN_ADD) \
> + T (COND_SUB, COND_LEN_SUB) \
> + T (COND_MUL, COND_LEN_MUL) \
> + T (COND_DIV, COND_LEN_DIV) \
> + T (COND_MOD, COND_LEN_MOD) \
> + T (COND_RDIV, COND_LEN_RDIV) \
> + T (COND_FMIN, COND_LEN_FMIN) \
> + T (COND_FMAX, COND_LEN_FMAX) \
> + T (COND_MIN, COND_LEN_MIN) \
> + T (COND_MAX, COND_LEN_MAX) \
> + T (COND_AND, COND_LEN_AND) \
> + T (COND_IOR, COND_LEN_IOR) \
> + T (COND_XOR, COND_LEN_XOR) \
> + T (COND_SHL, COND_LEN_SHL) \
> + T (COND_SHR, COND_LEN_SHR) \
> + T (COND_NEG, COND_LEN_NEG) \
> + T (COND_FMA, COND_LEN_FMA) \
> + T (COND_FMS, COND_LEN_FMS) \
> + T (COND_FNMA, COND_LEN_FNMA) \
> + T (COND_FNMS, COND_LEN_FNMS)
> +
> +/* If there exists an internal function like IFN that operates on vectors,
> + but with additional length and bias parameters, return the internal_fn
> + for that function, otherwise return IFN_LAST. */
> +internal_fn
> +get_len_internal_fn (internal_fn fn)
> +{
> + switch (fn)
> + {
> +#define CASE(NAME, LEN_NAME) \
> + case IFN_##NAME: \
> + return IFN_##LEN_NAME;
> + FOR_EACH_LEN_FN_PAIR (CASE)
> +#undef CASE
> + default:
> + return IFN_LAST;
> + }
> +}
> +
> /* If IFN implements the conditional form of an unconditional internal
> function, return that unconditional function, otherwise return IFN_LAST. */
>
> diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> index a5c3f4765ff..410c1b623d6 100644
> --- a/gcc/internal-fn.h
> +++ b/gcc/internal-fn.h
> @@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void);
>
> extern internal_fn get_conditional_internal_fn (tree_code);
> extern internal_fn get_conditional_internal_fn (internal_fn);
> +extern internal_fn get_len_internal_fn (internal_fn);
> extern internal_fn get_conditional_len_internal_fn (tree_code);
> extern tree_code conditional_internal_fn_code (internal_fn);
> extern internal_fn get_unconditional_internal_fn (internal_fn);
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 6a4e8fce126..ae5b0b09c08 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -3540,7 +3540,10 @@ vectorizable_call (vec_info *vinfo,
>
> int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
> internal_fn cond_fn = get_conditional_internal_fn (ifn);
> + internal_fn cond_len_fn = get_len_internal_fn (ifn);
> + int len_opno = internal_fn_len_index (cond_len_fn);
> vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
> + vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL);
> if (!vec_stmt) /* transformation not required. */
> {
> if (slp_node)
> @@ -3586,8 +3589,14 @@ vectorizable_call (vec_info *vinfo,
Above for reduc_idx >= 0 there's a check whether cond_fn is supported,
don't you need to amend that with a check for cond_len_fn?
> tree scalar_mask = NULL_TREE;
> if (mask_opno >= 0)
> scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
> - vect_record_loop_mask (loop_vinfo, masks, nvectors,
> - vectype_out, scalar_mask);
> + if (cond_len_fn != IFN_LAST
> + && direct_internal_fn_supported_p (cond_len_fn, vectype_out,
> + OPTIMIZE_FOR_SPEED))
> + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_out,
> + 1);
> + else
> + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out,
> + scalar_mask);
> }
> }
> return true;
> @@ -3603,8 +3612,24 @@ vectorizable_call (vec_info *vinfo,
> vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>
> bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> + bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
> unsigned int vect_nargs = nargs;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p)
> + {
> + if (len_opno >= 0)
> + {
> + ifn = cond_len_fn;
> + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> + vect_nargs += 2;
> + }
> + else if (reduc_idx >= 0)
> + {
> + /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> + ifn = get_len_internal_fn (cond_fn);
> + vect_nargs += 4;
I'm a bit confused (but also by the existing mask code), whether
vect_nargs needs adjustment depends on the IFN in the IL we analyze.
If if-conversion recognizes a .COND_ADD then we need to add nothing
for masking (that is, ifn == cond_fn already). In your code above
you either use cond_len_fn or get_len_internal_fn (cond_fn) but
isn't that the very same?! So how come you in one case add two
and in the other add four args?
Please make sure to place gcc_unreachable () in each arm and check
you have test coverage. I believe that the else arm is unreachable
but when you vectorize .FMA you will need to add 4 and when you
vectorize .COND_FMA you will need to add two arguments (as said,
no idea why we special case reduc_idx >= 0 at the moment).
Otherwise the patch looks OK to me.
Thanks,
Richard.
> + }
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> {
> ifn = cond_fn;
> vect_nargs += 2;
> @@ -3629,7 +3654,18 @@ vectorizable_call (vec_info *vinfo,
> FOR_EACH_VEC_ELT (vec_oprnds0, i, vec_oprnd0)
> {
> int varg = 0;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p && reduc_idx >= 0)
> + {
> + /* Always true for SLP. */
> + gcc_assert (ncopies == 1);
> + /* For COND_LEN_* operations used by reduction of
> + CALL vectorization, the LEN argument is the real
> + loop len produced by SELECT_VL or MIN wheras the
> + MASK argument here is the dummy mask. */
> + vargs[varg++]
> + = build_minus_one_cst (truth_type_for (vectype_out));
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> {
> unsigned int vec_num = vec_oprnds0.length ();
> /* Always true for SLP. */
> @@ -3644,7 +3680,7 @@ vectorizable_call (vec_info *vinfo,
> vec<tree> vec_oprndsk = vec_defs[k];
> vargs[varg++] = vec_oprndsk[i];
> }
> - if (masked_loop_p && reduc_idx >= 0)
> + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> vargs[varg++] = vargs[reduc_idx + 1];
> gimple *new_stmt;
> if (modifier == NARROW)
> @@ -3671,7 +3707,21 @@ vectorizable_call (vec_info *vinfo,
> }
> else
> {
> - if (mask_opno >= 0 && masked_loop_p)
> + if (len_opno >= 0 && len_loop_p)
> + {
> + unsigned int vec_num = vec_oprnds0.length ();
> + /* Always true for SLP. */
> + gcc_assert (ncopies == 1);
> + tree len
> + = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num,
> + vectype_out, i, 1);
> + signed char biasval
> + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> + tree bias = build_int_cst (intQI_type_node, biasval);
> + vargs[len_opno] = len;
> + vargs[len_opno + 1] = bias;
> + }
> + else if (mask_opno >= 0 && masked_loop_p)
> {
> unsigned int vec_num = vec_oprnds0.length ();
> /* Always true for SLP. */
> @@ -3701,7 +3751,16 @@ vectorizable_call (vec_info *vinfo,
> }
>
> int varg = 0;
> - if (masked_loop_p && reduc_idx >= 0)
> + if (len_loop_p && reduc_idx >= 0)
> + {
> + /* For COND_LEN_* operations used by reduction of
> + CALL vectorization, the LEN argument is the real
> + loop len produced by SELECT_VL or MIN wheras the
> + MASK argument here is the dummy mask. */
> + vargs[varg++]
> + = build_minus_one_cst (truth_type_for (vectype_out));
> + }
> + else if (masked_loop_p && reduc_idx >= 0)
> vargs[varg++] = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> vectype_out, j);
> for (i = 0; i < nargs; i++)
> @@ -3716,10 +3775,20 @@ vectorizable_call (vec_info *vinfo,
> }
> vargs[varg++] = vec_defs[i][j];
> }
> - if (masked_loop_p && reduc_idx >= 0)
> + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> vargs[varg++] = vargs[reduc_idx + 1];
>
> - if (mask_opno >= 0 && masked_loop_p)
> + if (len_opno >= 0 && len_loop_p)
> + {
> + tree len = vect_get_loop_len (loop_vinfo, gsi, lens, ncopies,
> + vectype_out, j, 1);
> + signed char biasval
> + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> + tree bias = build_int_cst (intQI_type_node, biasval);
> + vargs[len_opno] = len;
> + vargs[len_opno + 1] = bias;
> + }
> + else if (mask_opno >= 0 && masked_loop_p)
> {
> tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> vectype_out, j);
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-28 7:10 [PATCH V2] VECT: Support CALL vectorization for COND_LEN_* Juzhe-Zhong
2023-07-31 9:26 ` Richard Biener
@ 2023-07-31 10:19 ` Richard Sandiford
2023-07-31 11:52 ` juzhe.zhong
1 sibling, 1 reply; 18+ messages in thread
From: Richard Sandiford @ 2023-07-31 10:19 UTC (permalink / raw)
To: Juzhe-Zhong; +Cc: gcc-patches, rguenther
Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> Hi, Richard and Richi.
>
> Base on the suggestions from Richard:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
>
> This patch choose (1) approach that Richard provided, meaning:
>
> RVV implements cond_* optabs as expanders. RVV therefore supports
> both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> are needed at the gimple level.
>
> Such approach can make codes much cleaner and reasonable.
>
> Consider this following case:
> void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> {
> for (int i = 0; i < n; i++)
> if (cond[i])
> a[i] = b[i] + a[i];
> }
>
>
> Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> <source>:5:21: missed: couldn't vectorize loop
> <source>:5:21: missed: not vectorized: control flow in loop.
>
> ARM SVE:
>
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> ...
> vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
>
> For RVV, we want IR as follows:
>
> ...
> _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> ...
>
> Both len and mask of COND_LEN_ADD are real not dummy.
>
> This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
>
> And also, Bootstrap and Regression on X86 passed.
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> (get_len_internal_fn): New function.
> (CASE): Ditto.
> * internal-fn.h (get_len_internal_fn): Ditto.
> * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
>
> ---
> gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> gcc/internal-fn.h | 1 +
> gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> 3 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8e294286388..379220bebc7 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> }
> }
>
> +/* Invoke T(IFN) for each internal function IFN that also has an
> + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> +#define FOR_EACH_LEN_FN_PAIR(T) \
> + T (MASK_LOAD, MASK_LEN_LOAD) \
> + T (MASK_STORE, MASK_LEN_STORE) \
> + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> + T (COND_ADD, COND_LEN_ADD) \
> + T (COND_SUB, COND_LEN_SUB) \
> + T (COND_MUL, COND_LEN_MUL) \
> + T (COND_DIV, COND_LEN_DIV) \
> + T (COND_MOD, COND_LEN_MOD) \
> + T (COND_RDIV, COND_LEN_RDIV) \
> + T (COND_FMIN, COND_LEN_FMIN) \
> + T (COND_FMAX, COND_LEN_FMAX) \
> + T (COND_MIN, COND_LEN_MIN) \
> + T (COND_MAX, COND_LEN_MAX) \
> + T (COND_AND, COND_LEN_AND) \
> + T (COND_IOR, COND_LEN_IOR) \
> + T (COND_XOR, COND_LEN_XOR) \
> + T (COND_SHL, COND_LEN_SHL) \
> + T (COND_SHR, COND_LEN_SHR) \
> + T (COND_NEG, COND_LEN_NEG) \
> + T (COND_FMA, COND_LEN_FMA) \
> + T (COND_FMS, COND_LEN_FMS) \
> + T (COND_FNMA, COND_LEN_FNMA) \
> + T (COND_FNMS, COND_LEN_FNMS)
With the earlier patch to add DEF_INTERNAL_COND_FN and
DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
the COND_* cases, rather than putting them in this macro.
Thanks,
Richard
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 9:52 ` juzhe.zhong
@ 2023-07-31 10:45 ` Richard Biener
2023-07-31 10:59 ` juzhe.zhong
0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-07-31 10:45 UTC (permalink / raw)
To: juzhe.zhong; +Cc: gcc-patches, richard.sandiford
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Hi, Richard. Thanks a lot for the comment
>
> >> In your code above
> >> you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> >> isn't that the very same?! So how come you in one case add two
> >> and in the other add four args?
>
> cond_len_fn is not the same as get_len_internal_fn (cond_fn) when vectorizing FMA into COND_LEN_FMA.
>
> since "internal_fn cond_len_fn = get_len_internal_fn (ifn);"
>
> and the iterators:
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
>
> So, cond_len_fn will be IFN_LAST when ifn = FMA.
Ah. So then just feed it cond_fn? I mean, we don't have
LEN_FMA, the only LEN-but-not-MASK ifns are those used by
power/s390, LEN_LOAD and LEN_STORE?
> Maybe is it reasonable that I add 4 more iterators here?
> > + T (FMA, COND_LEN_FMA) \
> > + T (FMS, COND_LEN_FMS) \
> > + T (FNMA, COND_LEN_FNMA) \
> > + T (FNMS, COND_LEN_FNMS)
>
> So that we won't need to have "get_len_internal_fn (cond_fn)"
No, as said we don't have LEN_FMA.
> When vectorizing COND_ADD into COND_LEN_ADD we already have "mask" and "else" value.
> So we only need to add 2 arguments.
>
> But when vectorizing FMA into COND_LEN_FMA, we need to add 4 arguments (mask,else,len,bias).
Yes, but all of this depends on what the original ifn is, no?
> >>as said,
> >>no idea why we special case reduc_idx >= 0 at the moment
>
> I also want to vectorize FMA into COND_LEN_FMA even reduc_idx < 0.
> Could I relax this condition for COND_LEN_* since it will improve RVV codegen a lot.
reduc_idx < 0 means this stmt isn't part of a reduction. So yes,
you can vectorize FMA as COND_LEN_FMA with dummy mask and len if you
don't have FMA expanders?
Richard.
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 17:26
> To: Juzhe-Zhong
> CC: gcc-patches; richard.sandiford
> Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Fri, 28 Jul 2023, Juzhe-Zhong wrote:
>
> > Hi, Richard and Richi.
> >
> > Base on the suggestions from Richard:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> >
> > This patch choose (1) approach that Richard provided, meaning:
> >
> > RVV implements cond_* optabs as expanders. RVV therefore supports
> > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > are needed at the gimple level.
> >
> > Such approach can make codes much cleaner and reasonable.
> >
> > Consider this following case:
> > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > {
> > for (int i = 0; i < n; i++)
> > if (cond[i])
> > a[i] = b[i] + a[i];
> > }
> >
> >
> > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > <source>:5:21: missed: couldn't vectorize loop
> > <source>:5:21: missed: not vectorized: control flow in loop.
> >
> > ARM SVE:
> >
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > ...
> > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> >
> > For RVV, we want IR as follows:
> >
> > ...
> > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > ...
> >
> > Both len and mask of COND_LEN_ADD are real not dummy.
> >
> > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> >
> > And also, Bootstrap and Regression on X86 passed.
> >
> > OK for trunk?
> >
> > gcc/ChangeLog:
> >
> > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > (get_len_internal_fn): New function.
> > (CASE): Ditto.
> > * internal-fn.h (get_len_internal_fn): Ditto.
> > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> >
> > ---
> > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > gcc/internal-fn.h | 1 +
> > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > 3 files changed, 125 insertions(+), 9 deletions(-)
> >
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 8e294286388..379220bebc7 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > }
> > }
> >
> > +/* Invoke T(IFN) for each internal function IFN that also has an
> > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
> > +
> > +/* If there exists an internal function like IFN that operates on vectors,
> > + but with additional length and bias parameters, return the internal_fn
> > + for that function, otherwise return IFN_LAST. */
> > +internal_fn
> > +get_len_internal_fn (internal_fn fn)
> > +{
> > + switch (fn)
> > + {
> > +#define CASE(NAME, LEN_NAME) \
> > + case IFN_##NAME: \
> > + return IFN_##LEN_NAME;
> > + FOR_EACH_LEN_FN_PAIR (CASE)
> > +#undef CASE
> > + default:
> > + return IFN_LAST;
> > + }
> > +}
> > +
> > /* If IFN implements the conditional form of an unconditional internal
> > function, return that unconditional function, otherwise return IFN_LAST. */
> >
> > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > index a5c3f4765ff..410c1b623d6 100644
> > --- a/gcc/internal-fn.h
> > +++ b/gcc/internal-fn.h
> > @@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void);
> >
> > extern internal_fn get_conditional_internal_fn (tree_code);
> > extern internal_fn get_conditional_internal_fn (internal_fn);
> > +extern internal_fn get_len_internal_fn (internal_fn);
> > extern internal_fn get_conditional_len_internal_fn (tree_code);
> > extern tree_code conditional_internal_fn_code (internal_fn);
> > extern internal_fn get_unconditional_internal_fn (internal_fn);
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 6a4e8fce126..ae5b0b09c08 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -3540,7 +3540,10 @@ vectorizable_call (vec_info *vinfo,
> >
> > int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
> > internal_fn cond_fn = get_conditional_internal_fn (ifn);
> > + internal_fn cond_len_fn = get_len_internal_fn (ifn);
> > + int len_opno = internal_fn_len_index (cond_len_fn);
> > vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
> > + vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL);
> > if (!vec_stmt) /* transformation not required. */
> > {
> > if (slp_node)
> > @@ -3586,8 +3589,14 @@ vectorizable_call (vec_info *vinfo,
>
> Above for reduc_idx >= 0 there's a check whether cond_fn is supported,
> don't you need to amend that with a check for cond_len_fn?
>
> > tree scalar_mask = NULL_TREE;
> > if (mask_opno >= 0)
> > scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
> > - vect_record_loop_mask (loop_vinfo, masks, nvectors,
> > - vectype_out, scalar_mask);
> > + if (cond_len_fn != IFN_LAST
> > + && direct_internal_fn_supported_p (cond_len_fn, vectype_out,
> > + OPTIMIZE_FOR_SPEED))
> > + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_out,
> > + 1);
> > + else
> > + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out,
> > + scalar_mask);
> > }
> > }
> > return true;
> > @@ -3603,8 +3612,24 @@ vectorizable_call (vec_info *vinfo,
> > vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >
> > bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> > + bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
> > unsigned int vect_nargs = nargs;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p)
> > + {
> > + if (len_opno >= 0)
> > + {
> > + ifn = cond_len_fn;
> > + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> > + vect_nargs += 2;
> > + }
> > + else if (reduc_idx >= 0)
> > + {
> > + /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> > + ifn = get_len_internal_fn (cond_fn);
> > + vect_nargs += 4;
>
> I'm a bit confused (but also by the existing mask code), whether
> vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> If if-conversion recognizes a .COND_ADD then we need to add nothing
> for masking (that is, ifn == cond_fn already). In your code above
> you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> isn't that the very same?! So how come you in one case add two
> and in the other add four args?
>
> Please make sure to place gcc_unreachable () in each arm and check
> you have test coverage. I believe that the else arm is unreachable
> but when you vectorize .FMA you will need to add 4 and when you
> vectorize .COND_FMA you will need to add two arguments (as said,
> no idea why we special case reduc_idx >= 0 at the moment).
>
> Otherwise the patch looks OK to me.
>
> Thanks,
> Richard.
>
> > + }
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > {
> > ifn = cond_fn;
> > vect_nargs += 2;
> > @@ -3629,7 +3654,18 @@ vectorizable_call (vec_info *vinfo,
> > FOR_EACH_VEC_ELT (vec_oprnds0, i, vec_oprnd0)
> > {
> > int varg = 0;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p && reduc_idx >= 0)
> > + {
> > + /* Always true for SLP. */
> > + gcc_assert (ncopies == 1);
> > + /* For COND_LEN_* operations used by reduction of
> > + CALL vectorization, the LEN argument is the real
> > + loop len produced by SELECT_VL or MIN wheras the
> > + MASK argument here is the dummy mask. */
> > + vargs[varg++]
> > + = build_minus_one_cst (truth_type_for (vectype_out));
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > {
> > unsigned int vec_num = vec_oprnds0.length ();
> > /* Always true for SLP. */
> > @@ -3644,7 +3680,7 @@ vectorizable_call (vec_info *vinfo,
> > vec<tree> vec_oprndsk = vec_defs[k];
> > vargs[varg++] = vec_oprndsk[i];
> > }
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> > vargs[varg++] = vargs[reduc_idx + 1];
> > gimple *new_stmt;
> > if (modifier == NARROW)
> > @@ -3671,7 +3707,21 @@ vectorizable_call (vec_info *vinfo,
> > }
> > else
> > {
> > - if (mask_opno >= 0 && masked_loop_p)
> > + if (len_opno >= 0 && len_loop_p)
> > + {
> > + unsigned int vec_num = vec_oprnds0.length ();
> > + /* Always true for SLP. */
> > + gcc_assert (ncopies == 1);
> > + tree len
> > + = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num,
> > + vectype_out, i, 1);
> > + signed char biasval
> > + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> > + tree bias = build_int_cst (intQI_type_node, biasval);
> > + vargs[len_opno] = len;
> > + vargs[len_opno + 1] = bias;
> > + }
> > + else if (mask_opno >= 0 && masked_loop_p)
> > {
> > unsigned int vec_num = vec_oprnds0.length ();
> > /* Always true for SLP. */
> > @@ -3701,7 +3751,16 @@ vectorizable_call (vec_info *vinfo,
> > }
> >
> > int varg = 0;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p && reduc_idx >= 0)
> > + {
> > + /* For COND_LEN_* operations used by reduction of
> > + CALL vectorization, the LEN argument is the real
> > + loop len produced by SELECT_VL or MIN wheras the
> > + MASK argument here is the dummy mask. */
> > + vargs[varg++]
> > + = build_minus_one_cst (truth_type_for (vectype_out));
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > vargs[varg++] = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> > vectype_out, j);
> > for (i = 0; i < nargs; i++)
> > @@ -3716,10 +3775,20 @@ vectorizable_call (vec_info *vinfo,
> > }
> > vargs[varg++] = vec_defs[i][j];
> > }
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> > vargs[varg++] = vargs[reduc_idx + 1];
> >
> > - if (mask_opno >= 0 && masked_loop_p)
> > + if (len_opno >= 0 && len_loop_p)
> > + {
> > + tree len = vect_get_loop_len (loop_vinfo, gsi, lens, ncopies,
> > + vectype_out, j, 1);
> > + signed char biasval
> > + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> > + tree bias = build_int_cst (intQI_type_node, biasval);
> > + vargs[len_opno] = len;
> > + vargs[len_opno + 1] = bias;
> > + }
> > + else if (mask_opno >= 0 && masked_loop_p)
> > {
> > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> > vectype_out, j);
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 10:45 ` Richard Biener
@ 2023-07-31 10:59 ` juzhe.zhong
0 siblings, 0 replies; 18+ messages in thread
From: juzhe.zhong @ 2023-07-31 10:59 UTC (permalink / raw)
To: rguenther; +Cc: gcc-patches, richard.sandiford
[-- Attachment #1: Type: text/plain, Size: 18918 bytes --]
>> Ah. So then just feed it cond_fn? I mean, we don't have
>> LEN_FMA, the only LEN-but-not-MASK ifns are those used by
>> power/s390, LEN_LOAD and LEN_STORE?
Yes, that's why I feed cond_fn with get_len_internal_fn (cond_fn)
>> Yes, but all of this depends on what the original ifn is, no?
Yes.
>> reduc_idx < 0 means this stmt isn't part of a reduction. So yes,
>> you can vectorize FMA as COND_LEN_FMA with dummy mask and len if you
>> don't have FMA expanders?
Could you give me an example that reduction >= 0 when vectorizing FMA into COND_LEN_FMA?
Actually, I failed to produce such circumstance in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625697.html
I only fully tested vectorizing COND_* into COND_LEN_*
but I failed to produce the case that:
FMA ---> COND_LEN_FMA.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-07-31 18:45
To: juzhe.zhong@rivai.ai
CC: gcc-patches; richard.sandiford
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Hi, Richard. Thanks a lot for the comment
>
> >> In your code above
> >> you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> >> isn't that the very same?! So how come you in one case add two
> >> and in the other add four args?
>
> cond_len_fn is not the same as get_len_internal_fn (cond_fn) when vectorizing FMA into COND_LEN_FMA.
>
> since "internal_fn cond_len_fn = get_len_internal_fn (ifn);"
>
> and the iterators:
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
>
> So, cond_len_fn will be IFN_LAST when ifn = FMA.
Ah. So then just feed it cond_fn? I mean, we don't have
LEN_FMA, the only LEN-but-not-MASK ifns are those used by
power/s390, LEN_LOAD and LEN_STORE?
> Maybe is it reasonable that I add 4 more iterators here?
> > + T (FMA, COND_LEN_FMA) \
> > + T (FMS, COND_LEN_FMS) \
> > + T (FNMA, COND_LEN_FNMA) \
> > + T (FNMS, COND_LEN_FNMS)
>
> So that we won't need to have "get_len_internal_fn (cond_fn)"
No, as said we don't have LEN_FMA.
> When vectorizing COND_ADD into COND_LEN_ADD we already have "mask" and "else" value.
> So we only need to add 2 arguments.
>
> But when vectorizing FMA into COND_LEN_FMA, we need to add 4 arguments (mask,else,len,bias).
Yes, but all of this depends on what the original ifn is, no?
> >>as said,
> >>no idea why we special case reduc_idx >= 0 at the moment
>
> I also want to vectorize FMA into COND_LEN_FMA even reduc_idx < 0.
> Could I relax this condition for COND_LEN_* since it will improve RVV codegen a lot.
reduc_idx < 0 means this stmt isn't part of a reduction. So yes,
you can vectorize FMA as COND_LEN_FMA with dummy mask and len if you
don't have FMA expanders?
Richard.
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 17:26
> To: Juzhe-Zhong
> CC: gcc-patches; richard.sandiford
> Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Fri, 28 Jul 2023, Juzhe-Zhong wrote:
>
> > Hi, Richard and Richi.
> >
> > Base on the suggestions from Richard:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> >
> > This patch choose (1) approach that Richard provided, meaning:
> >
> > RVV implements cond_* optabs as expanders. RVV therefore supports
> > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > are needed at the gimple level.
> >
> > Such approach can make codes much cleaner and reasonable.
> >
> > Consider this following case:
> > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > {
> > for (int i = 0; i < n; i++)
> > if (cond[i])
> > a[i] = b[i] + a[i];
> > }
> >
> >
> > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > <source>:5:21: missed: couldn't vectorize loop
> > <source>:5:21: missed: not vectorized: control flow in loop.
> >
> > ARM SVE:
> >
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > ...
> > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> >
> > For RVV, we want IR as follows:
> >
> > ...
> > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > ...
> >
> > Both len and mask of COND_LEN_ADD are real not dummy.
> >
> > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> >
> > And also, Bootstrap and Regression on X86 passed.
> >
> > OK for trunk?
> >
> > gcc/ChangeLog:
> >
> > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > (get_len_internal_fn): New function.
> > (CASE): Ditto.
> > * internal-fn.h (get_len_internal_fn): Ditto.
> > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> >
> > ---
> > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > gcc/internal-fn.h | 1 +
> > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > 3 files changed, 125 insertions(+), 9 deletions(-)
> >
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 8e294286388..379220bebc7 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > }
> > }
> >
> > +/* Invoke T(IFN) for each internal function IFN that also has an
> > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
> > +
> > +/* If there exists an internal function like IFN that operates on vectors,
> > + but with additional length and bias parameters, return the internal_fn
> > + for that function, otherwise return IFN_LAST. */
> > +internal_fn
> > +get_len_internal_fn (internal_fn fn)
> > +{
> > + switch (fn)
> > + {
> > +#define CASE(NAME, LEN_NAME) \
> > + case IFN_##NAME: \
> > + return IFN_##LEN_NAME;
> > + FOR_EACH_LEN_FN_PAIR (CASE)
> > +#undef CASE
> > + default:
> > + return IFN_LAST;
> > + }
> > +}
> > +
> > /* If IFN implements the conditional form of an unconditional internal
> > function, return that unconditional function, otherwise return IFN_LAST. */
> >
> > diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
> > index a5c3f4765ff..410c1b623d6 100644
> > --- a/gcc/internal-fn.h
> > +++ b/gcc/internal-fn.h
> > @@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void);
> >
> > extern internal_fn get_conditional_internal_fn (tree_code);
> > extern internal_fn get_conditional_internal_fn (internal_fn);
> > +extern internal_fn get_len_internal_fn (internal_fn);
> > extern internal_fn get_conditional_len_internal_fn (tree_code);
> > extern tree_code conditional_internal_fn_code (internal_fn);
> > extern internal_fn get_unconditional_internal_fn (internal_fn);
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 6a4e8fce126..ae5b0b09c08 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -3540,7 +3540,10 @@ vectorizable_call (vec_info *vinfo,
> >
> > int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
> > internal_fn cond_fn = get_conditional_internal_fn (ifn);
> > + internal_fn cond_len_fn = get_len_internal_fn (ifn);
> > + int len_opno = internal_fn_len_index (cond_len_fn);
> > vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
> > + vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL);
> > if (!vec_stmt) /* transformation not required. */
> > {
> > if (slp_node)
> > @@ -3586,8 +3589,14 @@ vectorizable_call (vec_info *vinfo,
>
> Above for reduc_idx >= 0 there's a check whether cond_fn is supported,
> don't you need to amend that with a check for cond_len_fn?
>
> > tree scalar_mask = NULL_TREE;
> > if (mask_opno >= 0)
> > scalar_mask = gimple_call_arg (stmt_info->stmt, mask_opno);
> > - vect_record_loop_mask (loop_vinfo, masks, nvectors,
> > - vectype_out, scalar_mask);
> > + if (cond_len_fn != IFN_LAST
> > + && direct_internal_fn_supported_p (cond_len_fn, vectype_out,
> > + OPTIMIZE_FOR_SPEED))
> > + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_out,
> > + 1);
> > + else
> > + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_out,
> > + scalar_mask);
> > }
> > }
> > return true;
> > @@ -3603,8 +3612,24 @@ vectorizable_call (vec_info *vinfo,
> > vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >
> > bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> > + bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
> > unsigned int vect_nargs = nargs;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p)
> > + {
> > + if (len_opno >= 0)
> > + {
> > + ifn = cond_len_fn;
> > + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> > + vect_nargs += 2;
> > + }
> > + else if (reduc_idx >= 0)
> > + {
> > + /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> > + ifn = get_len_internal_fn (cond_fn);
> > + vect_nargs += 4;
>
> I'm a bit confused (but also by the existing mask code), whether
> vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> If if-conversion recognizes a .COND_ADD then we need to add nothing
> for masking (that is, ifn == cond_fn already). In your code above
> you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> isn't that the very same?! So how come you in one case add two
> and in the other add four args?
>
> Please make sure to place gcc_unreachable () in each arm and check
> you have test coverage. I believe that the else arm is unreachable
> but when you vectorize .FMA you will need to add 4 and when you
> vectorize .COND_FMA you will need to add two arguments (as said,
> no idea why we special case reduc_idx >= 0 at the moment).
>
> Otherwise the patch looks OK to me.
>
> Thanks,
> Richard.
>
> > + }
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > {
> > ifn = cond_fn;
> > vect_nargs += 2;
> > @@ -3629,7 +3654,18 @@ vectorizable_call (vec_info *vinfo,
> > FOR_EACH_VEC_ELT (vec_oprnds0, i, vec_oprnd0)
> > {
> > int varg = 0;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p && reduc_idx >= 0)
> > + {
> > + /* Always true for SLP. */
> > + gcc_assert (ncopies == 1);
> > + /* For COND_LEN_* operations used by reduction of
> > + CALL vectorization, the LEN argument is the real
> > + loop len produced by SELECT_VL or MIN wheras the
> > + MASK argument here is the dummy mask. */
> > + vargs[varg++]
> > + = build_minus_one_cst (truth_type_for (vectype_out));
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > {
> > unsigned int vec_num = vec_oprnds0.length ();
> > /* Always true for SLP. */
> > @@ -3644,7 +3680,7 @@ vectorizable_call (vec_info *vinfo,
> > vec<tree> vec_oprndsk = vec_defs[k];
> > vargs[varg++] = vec_oprndsk[i];
> > }
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> > vargs[varg++] = vargs[reduc_idx + 1];
> > gimple *new_stmt;
> > if (modifier == NARROW)
> > @@ -3671,7 +3707,21 @@ vectorizable_call (vec_info *vinfo,
> > }
> > else
> > {
> > - if (mask_opno >= 0 && masked_loop_p)
> > + if (len_opno >= 0 && len_loop_p)
> > + {
> > + unsigned int vec_num = vec_oprnds0.length ();
> > + /* Always true for SLP. */
> > + gcc_assert (ncopies == 1);
> > + tree len
> > + = vect_get_loop_len (loop_vinfo, gsi, lens, vec_num,
> > + vectype_out, i, 1);
> > + signed char biasval
> > + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> > + tree bias = build_int_cst (intQI_type_node, biasval);
> > + vargs[len_opno] = len;
> > + vargs[len_opno + 1] = bias;
> > + }
> > + else if (mask_opno >= 0 && masked_loop_p)
> > {
> > unsigned int vec_num = vec_oprnds0.length ();
> > /* Always true for SLP. */
> > @@ -3701,7 +3751,16 @@ vectorizable_call (vec_info *vinfo,
> > }
> >
> > int varg = 0;
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if (len_loop_p && reduc_idx >= 0)
> > + {
> > + /* For COND_LEN_* operations used by reduction of
> > + CALL vectorization, the LEN argument is the real
> > + loop len produced by SELECT_VL or MIN wheras the
> > + MASK argument here is the dummy mask. */
> > + vargs[varg++]
> > + = build_minus_one_cst (truth_type_for (vectype_out));
> > + }
> > + else if (masked_loop_p && reduc_idx >= 0)
> > vargs[varg++] = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> > vectype_out, j);
> > for (i = 0; i < nargs; i++)
> > @@ -3716,10 +3775,20 @@ vectorizable_call (vec_info *vinfo,
> > }
> > vargs[varg++] = vec_defs[i][j];
> > }
> > - if (masked_loop_p && reduc_idx >= 0)
> > + if ((masked_loop_p || len_loop_p) && reduc_idx >= 0)
> > vargs[varg++] = vargs[reduc_idx + 1];
> >
> > - if (mask_opno >= 0 && masked_loop_p)
> > + if (len_opno >= 0 && len_loop_p)
> > + {
> > + tree len = vect_get_loop_len (loop_vinfo, gsi, lens, ncopies,
> > + vectype_out, j, 1);
> > + signed char biasval
> > + = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
> > + tree bias = build_int_cst (intQI_type_node, biasval);
> > + vargs[len_opno] = len;
> > + vargs[len_opno + 1] = bias;
> > + }
> > + else if (mask_opno >= 0 && masked_loop_p)
> > {
> > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> > vectype_out, j);
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 10:19 ` Richard Sandiford
@ 2023-07-31 11:52 ` juzhe.zhong
2023-07-31 12:00 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: juzhe.zhong @ 2023-07-31 11:52 UTC (permalink / raw)
To: richard.sandiford; +Cc: gcc-patches, rguenther
[-- Attachment #1: Type: text/plain, Size: 5255 bytes --]
Ok . Thanks Richard.
Could you give me a case that SVE can vectorize a reduction with FMA?
Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
I tried many times to reproduce such cases but I failed.
Thanks.
juzhe.zhong@rivai.ai
From: Richard Sandiford
Date: 2023-07-31 18:19
To: Juzhe-Zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> Hi, Richard and Richi.
>
> Base on the suggestions from Richard:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
>
> This patch choose (1) approach that Richard provided, meaning:
>
> RVV implements cond_* optabs as expanders. RVV therefore supports
> both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> are needed at the gimple level.
>
> Such approach can make codes much cleaner and reasonable.
>
> Consider this following case:
> void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> {
> for (int i = 0; i < n; i++)
> if (cond[i])
> a[i] = b[i] + a[i];
> }
>
>
> Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> <source>:5:21: missed: couldn't vectorize loop
> <source>:5:21: missed: not vectorized: control flow in loop.
>
> ARM SVE:
>
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> ...
> vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
>
> For RVV, we want IR as follows:
>
> ...
> _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> ...
> mask__27.10_51 = vect__4.9_49 != { 0, ... };
> ...
> vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> ...
>
> Both len and mask of COND_LEN_ADD are real not dummy.
>
> This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
>
> And also, Bootstrap and Regression on X86 passed.
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> (get_len_internal_fn): New function.
> (CASE): Ditto.
> * internal-fn.h (get_len_internal_fn): Ditto.
> * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
>
> ---
> gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> gcc/internal-fn.h | 1 +
> gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> 3 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8e294286388..379220bebc7 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> }
> }
>
> +/* Invoke T(IFN) for each internal function IFN that also has an
> + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> +#define FOR_EACH_LEN_FN_PAIR(T) \
> + T (MASK_LOAD, MASK_LEN_LOAD) \
> + T (MASK_STORE, MASK_LEN_STORE) \
> + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> + T (COND_ADD, COND_LEN_ADD) \
> + T (COND_SUB, COND_LEN_SUB) \
> + T (COND_MUL, COND_LEN_MUL) \
> + T (COND_DIV, COND_LEN_DIV) \
> + T (COND_MOD, COND_LEN_MOD) \
> + T (COND_RDIV, COND_LEN_RDIV) \
> + T (COND_FMIN, COND_LEN_FMIN) \
> + T (COND_FMAX, COND_LEN_FMAX) \
> + T (COND_MIN, COND_LEN_MIN) \
> + T (COND_MAX, COND_LEN_MAX) \
> + T (COND_AND, COND_LEN_AND) \
> + T (COND_IOR, COND_LEN_IOR) \
> + T (COND_XOR, COND_LEN_XOR) \
> + T (COND_SHL, COND_LEN_SHL) \
> + T (COND_SHR, COND_LEN_SHR) \
> + T (COND_NEG, COND_LEN_NEG) \
> + T (COND_FMA, COND_LEN_FMA) \
> + T (COND_FMS, COND_LEN_FMS) \
> + T (COND_FNMA, COND_LEN_FNMA) \
> + T (COND_FNMS, COND_LEN_FNMS)
With the earlier patch to add DEF_INTERNAL_COND_FN and
DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
the COND_* cases, rather than putting them in this macro.
Thanks,
Richard
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 11:52 ` juzhe.zhong
@ 2023-07-31 12:00 ` Richard Biener
2023-07-31 12:09 ` juzhe.zhong
0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-07-31 12:00 UTC (permalink / raw)
To: juzhe.zhong; +Cc: richard.sandiford, gcc-patches
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Ok . Thanks Richard.
>
> Could you give me a case that SVE can vectorize a reduction with FMA?
> Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
>
> I tried many times to reproduce such cases but I failed.
I think you need to use fma from math.h together with -ffast-math
to get fma.
Richard.
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Sandiford
> Date: 2023-07-31 18:19
> To: Juzhe-Zhong
> CC: gcc-patches; rguenther
> Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > Hi, Richard and Richi.
> >
> > Base on the suggestions from Richard:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> >
> > This patch choose (1) approach that Richard provided, meaning:
> >
> > RVV implements cond_* optabs as expanders. RVV therefore supports
> > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > are needed at the gimple level.
> >
> > Such approach can make codes much cleaner and reasonable.
> >
> > Consider this following case:
> > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > {
> > for (int i = 0; i < n; i++)
> > if (cond[i])
> > a[i] = b[i] + a[i];
> > }
> >
> >
> > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > <source>:5:21: missed: couldn't vectorize loop
> > <source>:5:21: missed: not vectorized: control flow in loop.
> >
> > ARM SVE:
> >
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > ...
> > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> >
> > For RVV, we want IR as follows:
> >
> > ...
> > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > ...
> >
> > Both len and mask of COND_LEN_ADD are real not dummy.
> >
> > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> >
> > And also, Bootstrap and Regression on X86 passed.
> >
> > OK for trunk?
> >
> > gcc/ChangeLog:
> >
> > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > (get_len_internal_fn): New function.
> > (CASE): Ditto.
> > * internal-fn.h (get_len_internal_fn): Ditto.
> > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> >
> > ---
> > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > gcc/internal-fn.h | 1 +
> > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > 3 files changed, 125 insertions(+), 9 deletions(-)
> >
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 8e294286388..379220bebc7 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > }
> > }
> >
> > +/* Invoke T(IFN) for each internal function IFN that also has an
> > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
>
> With the earlier patch to add DEF_INTERNAL_COND_FN and
> DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> the COND_* cases, rather than putting them in this macro.
>
> Thanks,
> Richard
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 12:00 ` Richard Biener
@ 2023-07-31 12:09 ` juzhe.zhong
2023-07-31 13:33 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: juzhe.zhong @ 2023-07-31 12:09 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 6337 bytes --]
Hi, Richi.
>> I think you need to use fma from math.h together with -ffast-math
>>to get fma.
As you said, this is one of the case I tried:
https://godbolt.org/z/xMzrrv5dT
GCC failed to vectorize.
Could you help me with this?
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-07-31 20:00
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Ok . Thanks Richard.
>
> Could you give me a case that SVE can vectorize a reduction with FMA?
> Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
>
> I tried many times to reproduce such cases but I failed.
I think you need to use fma from math.h together with -ffast-math
to get fma.
Richard.
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Sandiford
> Date: 2023-07-31 18:19
> To: Juzhe-Zhong
> CC: gcc-patches; rguenther
> Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > Hi, Richard and Richi.
> >
> > Base on the suggestions from Richard:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> >
> > This patch choose (1) approach that Richard provided, meaning:
> >
> > RVV implements cond_* optabs as expanders. RVV therefore supports
> > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > are needed at the gimple level.
> >
> > Such approach can make codes much cleaner and reasonable.
> >
> > Consider this following case:
> > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > {
> > for (int i = 0; i < n; i++)
> > if (cond[i])
> > a[i] = b[i] + a[i];
> > }
> >
> >
> > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > <source>:5:21: missed: couldn't vectorize loop
> > <source>:5:21: missed: not vectorized: control flow in loop.
> >
> > ARM SVE:
> >
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > ...
> > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> >
> > For RVV, we want IR as follows:
> >
> > ...
> > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > ...
> > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > ...
> > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > ...
> >
> > Both len and mask of COND_LEN_ADD are real not dummy.
> >
> > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> >
> > And also, Bootstrap and Regression on X86 passed.
> >
> > OK for trunk?
> >
> > gcc/ChangeLog:
> >
> > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > (get_len_internal_fn): New function.
> > (CASE): Ditto.
> > * internal-fn.h (get_len_internal_fn): Ditto.
> > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> >
> > ---
> > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > gcc/internal-fn.h | 1 +
> > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > 3 files changed, 125 insertions(+), 9 deletions(-)
> >
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 8e294286388..379220bebc7 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > }
> > }
> >
> > +/* Invoke T(IFN) for each internal function IFN that also has an
> > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > + T (MASK_STORE, MASK_LEN_STORE) \
> > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > + T (COND_ADD, COND_LEN_ADD) \
> > + T (COND_SUB, COND_LEN_SUB) \
> > + T (COND_MUL, COND_LEN_MUL) \
> > + T (COND_DIV, COND_LEN_DIV) \
> > + T (COND_MOD, COND_LEN_MOD) \
> > + T (COND_RDIV, COND_LEN_RDIV) \
> > + T (COND_FMIN, COND_LEN_FMIN) \
> > + T (COND_FMAX, COND_LEN_FMAX) \
> > + T (COND_MIN, COND_LEN_MIN) \
> > + T (COND_MAX, COND_LEN_MAX) \
> > + T (COND_AND, COND_LEN_AND) \
> > + T (COND_IOR, COND_LEN_IOR) \
> > + T (COND_XOR, COND_LEN_XOR) \
> > + T (COND_SHL, COND_LEN_SHL) \
> > + T (COND_SHR, COND_LEN_SHR) \
> > + T (COND_NEG, COND_LEN_NEG) \
> > + T (COND_FMA, COND_LEN_FMA) \
> > + T (COND_FMS, COND_LEN_FMS) \
> > + T (COND_FNMA, COND_LEN_FNMA) \
> > + T (COND_FNMS, COND_LEN_FNMS)
>
> With the earlier patch to add DEF_INTERNAL_COND_FN and
> DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> the COND_* cases, rather than putting them in this macro.
>
> Thanks,
> Richard
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 12:09 ` juzhe.zhong
@ 2023-07-31 13:33 ` Richard Biener
2023-07-31 13:43 ` 钟居哲
0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-07-31 13:33 UTC (permalink / raw)
To: juzhe.zhong; +Cc: richard.sandiford, gcc-patches
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Hi, Richi.
>
> >> I think you need to use fma from math.h together with -ffast-math
> >>to get fma.
>
> As you said, this is one of the case I tried:
> https://godbolt.org/z/xMzrrv5dT
> GCC failed to vectorize.
>
> Could you help me with this?
double foo (double *a, double *b, double *c)
{
double result = 0.0;
for (int i = 0; i < 1024; ++i)
result += __builtin_fma (a[i], b[i], c[i]);
return result;
}
with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
...
vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
_9 = *_8;
vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
...
but ifcvt still shows
_9 = *_8;
_10 = __builtin_fma (_7, _9, _4);
result_17 = _10 + result_20;
still vectorizable_call has IFN_FMA with
/* First try using an internal function. */
code_helper convert_code = MAX_TREE_CODES;
if (cfn != CFN_LAST
&& (modifier == NONE
|| (modifier == NARROW
&& simple_integer_narrowing (vectype_out, vectype_in,
&convert_code))))
ifn = vectorizable_internal_function (cfn, callee, vectype_out,
vectype_in);
from CFN_BUILT_IN_FMA
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 20:00
> To: juzhe.zhong@rivai.ai
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
>
> > Ok . Thanks Richard.
> >
> > Could you give me a case that SVE can vectorize a reduction with FMA?
> > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> >
> > I tried many times to reproduce such cases but I failed.
>
> I think you need to use fma from math.h together with -ffast-math
> to get fma.
>
> Richard.
>
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Sandiford
> > Date: 2023-07-31 18:19
> > To: Juzhe-Zhong
> > CC: gcc-patches; rguenther
> > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > Hi, Richard and Richi.
> > >
> > > Base on the suggestions from Richard:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > >
> > > This patch choose (1) approach that Richard provided, meaning:
> > >
> > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > are needed at the gimple level.
> > >
> > > Such approach can make codes much cleaner and reasonable.
> > >
> > > Consider this following case:
> > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > {
> > > for (int i = 0; i < n; i++)
> > > if (cond[i])
> > > a[i] = b[i] + a[i];
> > > }
> > >
> > >
> > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > <source>:5:21: missed: couldn't vectorize loop
> > > <source>:5:21: missed: not vectorized: control flow in loop.
> > >
> > > ARM SVE:
> > >
> > > ...
> > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > ...
> > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > ...
> > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > >
> > > For RVV, we want IR as follows:
> > >
> > > ...
> > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > ...
> > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > ...
> > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > ...
> > >
> > > Both len and mask of COND_LEN_ADD are real not dummy.
> > >
> > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > >
> > > And also, Bootstrap and Regression on X86 passed.
> > >
> > > OK for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > (get_len_internal_fn): New function.
> > > (CASE): Ditto.
> > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > >
> > > ---
> > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > gcc/internal-fn.h | 1 +
> > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index 8e294286388..379220bebc7 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > }
> > > }
> > >
> > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > + T (COND_ADD, COND_LEN_ADD) \
> > > + T (COND_SUB, COND_LEN_SUB) \
> > > + T (COND_MUL, COND_LEN_MUL) \
> > > + T (COND_DIV, COND_LEN_DIV) \
> > > + T (COND_MOD, COND_LEN_MOD) \
> > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > + T (COND_MIN, COND_LEN_MIN) \
> > > + T (COND_MAX, COND_LEN_MAX) \
> > > + T (COND_AND, COND_LEN_AND) \
> > > + T (COND_IOR, COND_LEN_IOR) \
> > > + T (COND_XOR, COND_LEN_XOR) \
> > > + T (COND_SHL, COND_LEN_SHL) \
> > > + T (COND_SHR, COND_LEN_SHR) \
> > > + T (COND_NEG, COND_LEN_NEG) \
> > > + T (COND_FMA, COND_LEN_FMA) \
> > > + T (COND_FMS, COND_LEN_FMS) \
> > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > + T (COND_FNMS, COND_LEN_FNMS)
> >
> > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > the COND_* cases, rather than putting them in this macro.
> >
> > Thanks,
> > Richard
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 13:33 ` Richard Biener
@ 2023-07-31 13:43 ` 钟居哲
2023-07-31 13:58 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: 钟居哲 @ 2023-07-31 13:43 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 8338 bytes --]
Yeah. I have tried this case too.
But this case doesn't need to be vectorized as COND_FMA, am I right?
The thing I wonder is that whether this condtion:
if (mask_opno >= 0 && reduc_idx >= 0)
or similar as len
if (len_opno >= 0 && reduc_idx >= 0)
Whether they are redundant in vectorizable_call ?
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-07-31 21:33
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> Hi, Richi.
>
> >> I think you need to use fma from math.h together with -ffast-math
> >>to get fma.
>
> As you said, this is one of the case I tried:
> https://godbolt.org/z/xMzrrv5dT
> GCC failed to vectorize.
>
> Could you help me with this?
double foo (double *a, double *b, double *c)
{
double result = 0.0;
for (int i = 0; i < 1024; ++i)
result += __builtin_fma (a[i], b[i], c[i]);
return result;
}
with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
...
vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
_9 = *_8;
vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
...
but ifcvt still shows
_9 = *_8;
_10 = __builtin_fma (_7, _9, _4);
result_17 = _10 + result_20;
still vectorizable_call has IFN_FMA with
/* First try using an internal function. */
code_helper convert_code = MAX_TREE_CODES;
if (cfn != CFN_LAST
&& (modifier == NONE
|| (modifier == NARROW
&& simple_integer_narrowing (vectype_out, vectype_in,
&convert_code))))
ifn = vectorizable_internal_function (cfn, callee, vectype_out,
vectype_in);
from CFN_BUILT_IN_FMA
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 20:00
> To: juzhe.zhong@rivai.ai
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
>
> > Ok . Thanks Richard.
> >
> > Could you give me a case that SVE can vectorize a reduction with FMA?
> > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> >
> > I tried many times to reproduce such cases but I failed.
>
> I think you need to use fma from math.h together with -ffast-math
> to get fma.
>
> Richard.
>
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Sandiford
> > Date: 2023-07-31 18:19
> > To: Juzhe-Zhong
> > CC: gcc-patches; rguenther
> > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > Hi, Richard and Richi.
> > >
> > > Base on the suggestions from Richard:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > >
> > > This patch choose (1) approach that Richard provided, meaning:
> > >
> > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > are needed at the gimple level.
> > >
> > > Such approach can make codes much cleaner and reasonable.
> > >
> > > Consider this following case:
> > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > {
> > > for (int i = 0; i < n; i++)
> > > if (cond[i])
> > > a[i] = b[i] + a[i];
> > > }
> > >
> > >
> > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > <source>:5:21: missed: couldn't vectorize loop
> > > <source>:5:21: missed: not vectorized: control flow in loop.
> > >
> > > ARM SVE:
> > >
> > > ...
> > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > ...
> > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > ...
> > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > >
> > > For RVV, we want IR as follows:
> > >
> > > ...
> > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > ...
> > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > ...
> > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > ...
> > >
> > > Both len and mask of COND_LEN_ADD are real not dummy.
> > >
> > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > >
> > > And also, Bootstrap and Regression on X86 passed.
> > >
> > > OK for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > (get_len_internal_fn): New function.
> > > (CASE): Ditto.
> > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > >
> > > ---
> > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > gcc/internal-fn.h | 1 +
> > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index 8e294286388..379220bebc7 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > }
> > > }
> > >
> > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > + T (COND_ADD, COND_LEN_ADD) \
> > > + T (COND_SUB, COND_LEN_SUB) \
> > > + T (COND_MUL, COND_LEN_MUL) \
> > > + T (COND_DIV, COND_LEN_DIV) \
> > > + T (COND_MOD, COND_LEN_MOD) \
> > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > + T (COND_MIN, COND_LEN_MIN) \
> > > + T (COND_MAX, COND_LEN_MAX) \
> > > + T (COND_AND, COND_LEN_AND) \
> > > + T (COND_IOR, COND_LEN_IOR) \
> > > + T (COND_XOR, COND_LEN_XOR) \
> > > + T (COND_SHL, COND_LEN_SHL) \
> > > + T (COND_SHR, COND_LEN_SHR) \
> > > + T (COND_NEG, COND_LEN_NEG) \
> > > + T (COND_FMA, COND_LEN_FMA) \
> > > + T (COND_FMS, COND_LEN_FMS) \
> > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > + T (COND_FNMS, COND_LEN_FNMS)
> >
> > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > the COND_* cases, rather than putting them in this macro.
> >
> > Thanks,
> > Richard
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 13:43 ` 钟居哲
@ 2023-07-31 13:58 ` Richard Biener
2023-07-31 14:29 ` 钟居哲
0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-07-31 13:58 UTC (permalink / raw)
To: 钟居哲; +Cc: richard.sandiford, gcc-patches
On Mon, 31 Jul 2023, ??? wrote:
> Yeah. I have tried this case too.
>
> But this case doesn't need to be vectorized as COND_FMA, am I right?
Only when you enable loop masking. Alternatively use
double foo (double *a, double *b, double *c)
{
double result = 0.0;
for (int i = 0; i < 1024; ++i)
result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
return result;
}
but then for me if-conversion produces
iftmp.0_18 = __builtin_fma (_8, _10, _5);
_ifc__43 = _26 ? iftmp.0_18 : 0.0;
with -ffast-math (probably rightfully so). I then get .FMAs
vectorized and .COND_FMA folded.
> The thing I wonder is that whether this condtion:
>
> if (mask_opno >= 0 && reduc_idx >= 0)
>
> or similar as len
> if (len_opno >= 0 && reduc_idx >= 0)
>
> Whether they are redundant in vectorizable_call ?
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 21:33
> To: juzhe.zhong@rivai.ai
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
>
> > Hi, Richi.
> >
> > >> I think you need to use fma from math.h together with -ffast-math
> > >>to get fma.
> >
> > As you said, this is one of the case I tried:
> > https://godbolt.org/z/xMzrrv5dT
> > GCC failed to vectorize.
> >
> > Could you help me with this?
>
> double foo (double *a, double *b, double *c)
> {
> double result = 0.0;
> for (int i = 0; i < 1024; ++i)
> result += __builtin_fma (a[i], b[i], c[i]);
> return result;
> }
>
> with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
>
> ...
> vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> _9 = *_8;
> vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> ...
>
> but ifcvt still shows
>
> _9 = *_8;
> _10 = __builtin_fma (_7, _9, _4);
> result_17 = _10 + result_20;
>
> still vectorizable_call has IFN_FMA with
>
> /* First try using an internal function. */
> code_helper convert_code = MAX_TREE_CODES;
> if (cfn != CFN_LAST
> && (modifier == NONE
> || (modifier == NARROW
> && simple_integer_narrowing (vectype_out, vectype_in,
> &convert_code))))
> ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> vectype_in);
>
> from CFN_BUILT_IN_FMA
>
>
>
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 20:00
> > To: juzhe.zhong@rivai.ai
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> >
> > > Ok . Thanks Richard.
> > >
> > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > >
> > > I tried many times to reproduce such cases but I failed.
> >
> > I think you need to use fma from math.h together with -ffast-math
> > to get fma.
> >
> > Richard.
> >
> > > Thanks.
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Sandiford
> > > Date: 2023-07-31 18:19
> > > To: Juzhe-Zhong
> > > CC: gcc-patches; rguenther
> > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > Hi, Richard and Richi.
> > > >
> > > > Base on the suggestions from Richard:
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > >
> > > > This patch choose (1) approach that Richard provided, meaning:
> > > >
> > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > are needed at the gimple level.
> > > >
> > > > Such approach can make codes much cleaner and reasonable.
> > > >
> > > > Consider this following case:
> > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > {
> > > > for (int i = 0; i < n; i++)
> > > > if (cond[i])
> > > > a[i] = b[i] + a[i];
> > > > }
> > > >
> > > >
> > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > <source>:5:21: missed: couldn't vectorize loop
> > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > >
> > > > ARM SVE:
> > > >
> > > > ...
> > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > ...
> > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > ...
> > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > >
> > > > For RVV, we want IR as follows:
> > > >
> > > > ...
> > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > ...
> > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > ...
> > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > ...
> > > >
> > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > >
> > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > >
> > > > And also, Bootstrap and Regression on X86 passed.
> > > >
> > > > OK for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > (get_len_internal_fn): New function.
> > > > (CASE): Ditto.
> > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > >
> > > > ---
> > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > gcc/internal-fn.h | 1 +
> > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > index 8e294286388..379220bebc7 100644
> > > > --- a/gcc/internal-fn.cc
> > > > +++ b/gcc/internal-fn.cc
> > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > }
> > > > }
> > > >
> > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > + T (COND_AND, COND_LEN_AND) \
> > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > + T (COND_FNMS, COND_LEN_FNMS)
> > >
> > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > the COND_* cases, rather than putting them in this macro.
> > >
> > > Thanks,
> > > Richard
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 13:58 ` Richard Biener
@ 2023-07-31 14:29 ` 钟居哲
2023-08-02 7:49 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: 钟居哲 @ 2023-07-31 14:29 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 10895 bytes --]
Oh, Thanks a lot.
I can test it in RISC-V backend now.
But I have another questions:
>> I'm a bit confused (but also by the existing mask code), whether
>>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
>>If if-conversion recognizes a .COND_ADD then we need to add nothing
>>for masking (that is, ifn == cond_fn already). In your code above
>>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
>>isn't that the very same?! So how come you in one case add two
>>and in the other add four args?
>>Please make sure to place gcc_unreachable () in each arm and check
>>you have test coverage. I believe that the else arm is unreachable
>>but when you vectorize .FMA you will need to add 4 and when you
>>vectorize .COND_FMA you will need to add two arguments (as said,
>>no idea why we special case reduc_idx >= 0 at the moment).
Do you mean I add gcc_unreachable in else like this:
if (len_loop_p)
{
if (len_opno >= 0)
{
ifn = cond_len_fn;
/* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
vect_nargs += 2;
}
else if (reduc_idx >= 0)
{
/* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
ifn = get_len_internal_fn (cond_fn);
vect_nargs += 4;
}
else
gcc_unreachable ();
}
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-07-31 21:58
To: 钟居哲
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Mon, 31 Jul 2023, ??? wrote:
> Yeah. I have tried this case too.
>
> But this case doesn't need to be vectorized as COND_FMA, am I right?
Only when you enable loop masking. Alternatively use
double foo (double *a, double *b, double *c)
{
double result = 0.0;
for (int i = 0; i < 1024; ++i)
result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
return result;
}
but then for me if-conversion produces
iftmp.0_18 = __builtin_fma (_8, _10, _5);
_ifc__43 = _26 ? iftmp.0_18 : 0.0;
with -ffast-math (probably rightfully so). I then get .FMAs
vectorized and .COND_FMA folded.
> The thing I wonder is that whether this condtion:
>
> if (mask_opno >= 0 && reduc_idx >= 0)
>
> or similar as len
> if (len_opno >= 0 && reduc_idx >= 0)
>
> Whether they are redundant in vectorizable_call ?
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 21:33
> To: juzhe.zhong@rivai.ai
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
>
> > Hi, Richi.
> >
> > >> I think you need to use fma from math.h together with -ffast-math
> > >>to get fma.
> >
> > As you said, this is one of the case I tried:
> > https://godbolt.org/z/xMzrrv5dT
> > GCC failed to vectorize.
> >
> > Could you help me with this?
>
> double foo (double *a, double *b, double *c)
> {
> double result = 0.0;
> for (int i = 0; i < 1024; ++i)
> result += __builtin_fma (a[i], b[i], c[i]);
> return result;
> }
>
> with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
>
> ...
> vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> _9 = *_8;
> vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> ...
>
> but ifcvt still shows
>
> _9 = *_8;
> _10 = __builtin_fma (_7, _9, _4);
> result_17 = _10 + result_20;
>
> still vectorizable_call has IFN_FMA with
>
> /* First try using an internal function. */
> code_helper convert_code = MAX_TREE_CODES;
> if (cfn != CFN_LAST
> && (modifier == NONE
> || (modifier == NARROW
> && simple_integer_narrowing (vectype_out, vectype_in,
> &convert_code))))
> ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> vectype_in);
>
> from CFN_BUILT_IN_FMA
>
>
>
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 20:00
> > To: juzhe.zhong@rivai.ai
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> >
> > > Ok . Thanks Richard.
> > >
> > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > >
> > > I tried many times to reproduce such cases but I failed.
> >
> > I think you need to use fma from math.h together with -ffast-math
> > to get fma.
> >
> > Richard.
> >
> > > Thanks.
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Sandiford
> > > Date: 2023-07-31 18:19
> > > To: Juzhe-Zhong
> > > CC: gcc-patches; rguenther
> > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > Hi, Richard and Richi.
> > > >
> > > > Base on the suggestions from Richard:
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > >
> > > > This patch choose (1) approach that Richard provided, meaning:
> > > >
> > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > are needed at the gimple level.
> > > >
> > > > Such approach can make codes much cleaner and reasonable.
> > > >
> > > > Consider this following case:
> > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > {
> > > > for (int i = 0; i < n; i++)
> > > > if (cond[i])
> > > > a[i] = b[i] + a[i];
> > > > }
> > > >
> > > >
> > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > <source>:5:21: missed: couldn't vectorize loop
> > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > >
> > > > ARM SVE:
> > > >
> > > > ...
> > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > ...
> > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > ...
> > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > >
> > > > For RVV, we want IR as follows:
> > > >
> > > > ...
> > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > ...
> > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > ...
> > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > ...
> > > >
> > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > >
> > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > >
> > > > And also, Bootstrap and Regression on X86 passed.
> > > >
> > > > OK for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > (get_len_internal_fn): New function.
> > > > (CASE): Ditto.
> > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > >
> > > > ---
> > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > gcc/internal-fn.h | 1 +
> > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > index 8e294286388..379220bebc7 100644
> > > > --- a/gcc/internal-fn.cc
> > > > +++ b/gcc/internal-fn.cc
> > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > }
> > > > }
> > > >
> > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > + T (COND_AND, COND_LEN_AND) \
> > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > + T (COND_FNMS, COND_LEN_FNMS)
> > >
> > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > the COND_* cases, rather than putting them in this macro.
> > >
> > > Thanks,
> > > Richard
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-07-31 14:29 ` 钟居哲
@ 2023-08-02 7:49 ` Richard Biener
2023-08-02 8:29 ` juzhe.zhong
0 siblings, 1 reply; 18+ messages in thread
From: Richard Biener @ 2023-08-02 7:49 UTC (permalink / raw)
To: 钟居哲; +Cc: richard.sandiford, gcc-patches
On Mon, 31 Jul 2023, ??? wrote:
> Oh, Thanks a lot.
> I can test it in RISC-V backend now.
>
> But I have another questions:
> >> I'm a bit confused (but also by the existing mask code), whether
> >>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> >>If if-conversion recognizes a .COND_ADD then we need to add nothing
> >>for masking (that is, ifn == cond_fn already). In your code above
> >>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> >>isn't that the very same?! So how come you in one case add two
> >>and in the other add four args?
> >>Please make sure to place gcc_unreachable () in each arm and check
> >>you have test coverage. I believe that the else arm is unreachable
> >>but when you vectorize .FMA you will need to add 4 and when you
> >>vectorize .COND_FMA you will need to add two arguments (as said,
> >>no idea why we special case reduc_idx >= 0 at the moment).
>
> Do you mean I add gcc_unreachable in else like this:
>
> if (len_loop_p)
> {
> if (len_opno >= 0)
> {
> ifn = cond_len_fn;
> /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> vect_nargs += 2;
> }
> else if (reduc_idx >= 0)
> {
> /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> ifn = get_len_internal_fn (cond_fn);
> vect_nargs += 4;
no, a gcc_unreachable () here. That is, make sure you have test coverage
for the above two cases (to me the len_opno >= 0 case is obvious)
> }
> else
> gcc_unreachable ();
> }
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 21:58
> To: ???
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, ??? wrote:
>
> > Yeah. I have tried this case too.
> >
> > But this case doesn't need to be vectorized as COND_FMA, am I right?
>
> Only when you enable loop masking. Alternatively use
>
> double foo (double *a, double *b, double *c)
> {
> double result = 0.0;
> for (int i = 0; i < 1024; ++i)
> result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
> return result;
> }
>
> but then for me if-conversion produces
>
> iftmp.0_18 = __builtin_fma (_8, _10, _5);
> _ifc__43 = _26 ? iftmp.0_18 : 0.0;
>
> with -ffast-math (probably rightfully so). I then get .FMAs
> vectorized and .COND_FMA folded.
>
> > The thing I wonder is that whether this condtion:
> >
> > if (mask_opno >= 0 && reduc_idx >= 0)
> >
> > or similar as len
> > if (len_opno >= 0 && reduc_idx >= 0)
> >
> > Whether they are redundant in vectorizable_call ?
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 21:33
> > To: juzhe.zhong@rivai.ai
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> >
> > > Hi, Richi.
> > >
> > > >> I think you need to use fma from math.h together with -ffast-math
> > > >>to get fma.
> > >
> > > As you said, this is one of the case I tried:
> > > https://godbolt.org/z/xMzrrv5dT
> > > GCC failed to vectorize.
> > >
> > > Could you help me with this?
> >
> > double foo (double *a, double *b, double *c)
> > {
> > double result = 0.0;
> > for (int i = 0; i < 1024; ++i)
> > result += __builtin_fma (a[i], b[i], c[i]);
> > return result;
> > }
> >
> > with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
> >
> > ...
> > vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> > _9 = *_8;
> > vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> > vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> > ...
> >
> > but ifcvt still shows
> >
> > _9 = *_8;
> > _10 = __builtin_fma (_7, _9, _4);
> > result_17 = _10 + result_20;
> >
> > still vectorizable_call has IFN_FMA with
> >
> > /* First try using an internal function. */
> > code_helper convert_code = MAX_TREE_CODES;
> > if (cfn != CFN_LAST
> > && (modifier == NONE
> > || (modifier == NARROW
> > && simple_integer_narrowing (vectype_out, vectype_in,
> > &convert_code))))
> > ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> > vectype_in);
> >
> > from CFN_BUILT_IN_FMA
> >
> >
> >
> > > Thanks.
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Biener
> > > Date: 2023-07-31 20:00
> > > To: juzhe.zhong@rivai.ai
> > > CC: richard.sandiford; gcc-patches
> > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > >
> > > > Ok . Thanks Richard.
> > > >
> > > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > > >
> > > > I tried many times to reproduce such cases but I failed.
> > >
> > > I think you need to use fma from math.h together with -ffast-math
> > > to get fma.
> > >
> > > Richard.
> > >
> > > > Thanks.
> > > >
> > > >
> > > > juzhe.zhong@rivai.ai
> > > >
> > > > From: Richard Sandiford
> > > > Date: 2023-07-31 18:19
> > > > To: Juzhe-Zhong
> > > > CC: gcc-patches; rguenther
> > > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > > Hi, Richard and Richi.
> > > > >
> > > > > Base on the suggestions from Richard:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > > >
> > > > > This patch choose (1) approach that Richard provided, meaning:
> > > > >
> > > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > > are needed at the gimple level.
> > > > >
> > > > > Such approach can make codes much cleaner and reasonable.
> > > > >
> > > > > Consider this following case:
> > > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > > {
> > > > > for (int i = 0; i < n; i++)
> > > > > if (cond[i])
> > > > > a[i] = b[i] + a[i];
> > > > > }
> > > > >
> > > > >
> > > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > > <source>:5:21: missed: couldn't vectorize loop
> > > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > > >
> > > > > ARM SVE:
> > > > >
> > > > > ...
> > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > ...
> > > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > > ...
> > > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > > >
> > > > > For RVV, we want IR as follows:
> > > > >
> > > > > ...
> > > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > > ...
> > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > ...
> > > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > > ...
> > > > >
> > > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > > >
> > > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > > >
> > > > > And also, Bootstrap and Regression on X86 passed.
> > > > >
> > > > > OK for trunk?
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > > (get_len_internal_fn): New function.
> > > > > (CASE): Ditto.
> > > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > > >
> > > > > ---
> > > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > > gcc/internal-fn.h | 1 +
> > > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > index 8e294286388..379220bebc7 100644
> > > > > --- a/gcc/internal-fn.cc
> > > > > +++ b/gcc/internal-fn.cc
> > > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > > }
> > > > > }
> > > > >
> > > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > > + T (COND_AND, COND_LEN_AND) \
> > > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > > + T (COND_FNMS, COND_LEN_FNMS)
> > > >
> > > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > > the COND_* cases, rather than putting them in this macro.
> > > >
> > > > Thanks,
> > > > Richard
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-08-02 7:49 ` Richard Biener
@ 2023-08-02 8:29 ` juzhe.zhong
2023-08-02 8:33 ` Richard Biener
0 siblings, 1 reply; 18+ messages in thread
From: juzhe.zhong @ 2023-08-02 8:29 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 12229 bytes --]
Thanks Richard so much.
Forgive me asking question again :)
Is this following code correct for you ?
+ if (len_loop_p)
+ {
+ if (len_opno >= 0)
+ {
+ ifn = cond_len_fn;
+ /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
+ vect_nargs += 2;
+ }
+ else if (reduc_idx >= 0)
+ gcc_unreachable ();
+ }
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-08-02 15:49
To: 钟居哲
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Mon, 31 Jul 2023, ??? wrote:
> Oh, Thanks a lot.
> I can test it in RISC-V backend now.
>
> But I have another questions:
> >> I'm a bit confused (but also by the existing mask code), whether
> >>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> >>If if-conversion recognizes a .COND_ADD then we need to add nothing
> >>for masking (that is, ifn == cond_fn already). In your code above
> >>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> >>isn't that the very same?! So how come you in one case add two
> >>and in the other add four args?
> >>Please make sure to place gcc_unreachable () in each arm and check
> >>you have test coverage. I believe that the else arm is unreachable
> >>but when you vectorize .FMA you will need to add 4 and when you
> >>vectorize .COND_FMA you will need to add two arguments (as said,
> >>no idea why we special case reduc_idx >= 0 at the moment).
>
> Do you mean I add gcc_unreachable in else like this:
>
> if (len_loop_p)
> {
> if (len_opno >= 0)
> {
> ifn = cond_len_fn;
> /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> vect_nargs += 2;
> }
> else if (reduc_idx >= 0)
> {
> /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> ifn = get_len_internal_fn (cond_fn);
> vect_nargs += 4;
no, a gcc_unreachable () here. That is, make sure you have test coverage
for the above two cases (to me the len_opno >= 0 case is obvious)
> }
> else
> gcc_unreachable ();
> }
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-07-31 21:58
> To: ???
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, ??? wrote:
>
> > Yeah. I have tried this case too.
> >
> > But this case doesn't need to be vectorized as COND_FMA, am I right?
>
> Only when you enable loop masking. Alternatively use
>
> double foo (double *a, double *b, double *c)
> {
> double result = 0.0;
> for (int i = 0; i < 1024; ++i)
> result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
> return result;
> }
>
> but then for me if-conversion produces
>
> iftmp.0_18 = __builtin_fma (_8, _10, _5);
> _ifc__43 = _26 ? iftmp.0_18 : 0.0;
>
> with -ffast-math (probably rightfully so). I then get .FMAs
> vectorized and .COND_FMA folded.
>
> > The thing I wonder is that whether this condtion:
> >
> > if (mask_opno >= 0 && reduc_idx >= 0)
> >
> > or similar as len
> > if (len_opno >= 0 && reduc_idx >= 0)
> >
> > Whether they are redundant in vectorizable_call ?
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 21:33
> > To: juzhe.zhong@rivai.ai
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> >
> > > Hi, Richi.
> > >
> > > >> I think you need to use fma from math.h together with -ffast-math
> > > >>to get fma.
> > >
> > > As you said, this is one of the case I tried:
> > > https://godbolt.org/z/xMzrrv5dT
> > > GCC failed to vectorize.
> > >
> > > Could you help me with this?
> >
> > double foo (double *a, double *b, double *c)
> > {
> > double result = 0.0;
> > for (int i = 0; i < 1024; ++i)
> > result += __builtin_fma (a[i], b[i], c[i]);
> > return result;
> > }
> >
> > with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
> >
> > ...
> > vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> > _9 = *_8;
> > vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> > vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> > ...
> >
> > but ifcvt still shows
> >
> > _9 = *_8;
> > _10 = __builtin_fma (_7, _9, _4);
> > result_17 = _10 + result_20;
> >
> > still vectorizable_call has IFN_FMA with
> >
> > /* First try using an internal function. */
> > code_helper convert_code = MAX_TREE_CODES;
> > if (cfn != CFN_LAST
> > && (modifier == NONE
> > || (modifier == NARROW
> > && simple_integer_narrowing (vectype_out, vectype_in,
> > &convert_code))))
> > ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> > vectype_in);
> >
> > from CFN_BUILT_IN_FMA
> >
> >
> >
> > > Thanks.
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Biener
> > > Date: 2023-07-31 20:00
> > > To: juzhe.zhong@rivai.ai
> > > CC: richard.sandiford; gcc-patches
> > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > >
> > > > Ok . Thanks Richard.
> > > >
> > > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > > >
> > > > I tried many times to reproduce such cases but I failed.
> > >
> > > I think you need to use fma from math.h together with -ffast-math
> > > to get fma.
> > >
> > > Richard.
> > >
> > > > Thanks.
> > > >
> > > >
> > > > juzhe.zhong@rivai.ai
> > > >
> > > > From: Richard Sandiford
> > > > Date: 2023-07-31 18:19
> > > > To: Juzhe-Zhong
> > > > CC: gcc-patches; rguenther
> > > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > > Hi, Richard and Richi.
> > > > >
> > > > > Base on the suggestions from Richard:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > > >
> > > > > This patch choose (1) approach that Richard provided, meaning:
> > > > >
> > > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > > are needed at the gimple level.
> > > > >
> > > > > Such approach can make codes much cleaner and reasonable.
> > > > >
> > > > > Consider this following case:
> > > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > > {
> > > > > for (int i = 0; i < n; i++)
> > > > > if (cond[i])
> > > > > a[i] = b[i] + a[i];
> > > > > }
> > > > >
> > > > >
> > > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > > <source>:5:21: missed: couldn't vectorize loop
> > > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > > >
> > > > > ARM SVE:
> > > > >
> > > > > ...
> > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > ...
> > > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > > ...
> > > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > > >
> > > > > For RVV, we want IR as follows:
> > > > >
> > > > > ...
> > > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > > ...
> > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > ...
> > > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > > ...
> > > > >
> > > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > > >
> > > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > > >
> > > > > And also, Bootstrap and Regression on X86 passed.
> > > > >
> > > > > OK for trunk?
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > > (get_len_internal_fn): New function.
> > > > > (CASE): Ditto.
> > > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > > >
> > > > > ---
> > > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > > gcc/internal-fn.h | 1 +
> > > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > index 8e294286388..379220bebc7 100644
> > > > > --- a/gcc/internal-fn.cc
> > > > > +++ b/gcc/internal-fn.cc
> > > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > > }
> > > > > }
> > > > >
> > > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > > + T (COND_AND, COND_LEN_AND) \
> > > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > > + T (COND_FNMS, COND_LEN_FNMS)
> > > >
> > > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > > the COND_* cases, rather than putting them in this macro.
> > > >
> > > > Thanks,
> > > > Richard
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-08-02 8:29 ` juzhe.zhong
@ 2023-08-02 8:33 ` Richard Biener
2023-08-02 8:46 ` juzhe.zhong
2023-08-03 2:38 ` juzhe.zhong
0 siblings, 2 replies; 18+ messages in thread
From: Richard Biener @ 2023-08-02 8:33 UTC (permalink / raw)
To: juzhe.zhong; +Cc: richard.sandiford, gcc-patches
On Wed, 2 Aug 2023, juzhe.zhong@rivai.ai wrote:
> Thanks Richard so much.
>
> Forgive me asking question again :)
>
> Is this following code correct for you ?
Well, I wonder what kind of testcase runs into the reduc_idx >= 0 case.
The point is I don't _know_ whether the code is correct, in fact it looked
suspicious ;)
> + if (len_loop_p)
> + {
> + if (len_opno >= 0)
> + {
> + ifn = cond_len_fn;
> + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> + vect_nargs += 2;
> + }
> + else if (reduc_idx >= 0)
> + gcc_unreachable ();
> + }
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-08-02 15:49
> To: ???
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, ??? wrote:
>
> > Oh, Thanks a lot.
> > I can test it in RISC-V backend now.
> >
> > But I have another questions:
> > >> I'm a bit confused (but also by the existing mask code), whether
> > >>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> > >>If if-conversion recognizes a .COND_ADD then we need to add nothing
> > >>for masking (that is, ifn == cond_fn already). In your code above
> > >>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> > >>isn't that the very same?! So how come you in one case add two
> > >>and in the other add four args?
> > >>Please make sure to place gcc_unreachable () in each arm and check
> > >>you have test coverage. I believe that the else arm is unreachable
> > >>but when you vectorize .FMA you will need to add 4 and when you
> > >>vectorize .COND_FMA you will need to add two arguments (as said,
> > >>no idea why we special case reduc_idx >= 0 at the moment).
> >
> > Do you mean I add gcc_unreachable in else like this:
> >
> > if (len_loop_p)
> > {
> > if (len_opno >= 0)
> > {
> > ifn = cond_len_fn;
> > /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> > vect_nargs += 2;
> > }
> > else if (reduc_idx >= 0)
> > {
> > /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> > ifn = get_len_internal_fn (cond_fn);
> > vect_nargs += 4;
>
> no, a gcc_unreachable () here. That is, make sure you have test coverage
> for the above two cases (to me the len_opno >= 0 case is obvious)
>
> > }
> > else
> > gcc_unreachable ();
> > }
> >
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 21:58
> > To: ???
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, ??? wrote:
> >
> > > Yeah. I have tried this case too.
> > >
> > > But this case doesn't need to be vectorized as COND_FMA, am I right?
> >
> > Only when you enable loop masking. Alternatively use
> >
> > double foo (double *a, double *b, double *c)
> > {
> > double result = 0.0;
> > for (int i = 0; i < 1024; ++i)
> > result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
> > return result;
> > }
> >
> > but then for me if-conversion produces
> >
> > iftmp.0_18 = __builtin_fma (_8, _10, _5);
> > _ifc__43 = _26 ? iftmp.0_18 : 0.0;
> >
> > with -ffast-math (probably rightfully so). I then get .FMAs
> > vectorized and .COND_FMA folded.
> >
> > > The thing I wonder is that whether this condtion:
> > >
> > > if (mask_opno >= 0 && reduc_idx >= 0)
> > >
> > > or similar as len
> > > if (len_opno >= 0 && reduc_idx >= 0)
> > >
> > > Whether they are redundant in vectorizable_call ?
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Biener
> > > Date: 2023-07-31 21:33
> > > To: juzhe.zhong@rivai.ai
> > > CC: richard.sandiford; gcc-patches
> > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > >
> > > > Hi, Richi.
> > > >
> > > > >> I think you need to use fma from math.h together with -ffast-math
> > > > >>to get fma.
> > > >
> > > > As you said, this is one of the case I tried:
> > > > https://godbolt.org/z/xMzrrv5dT
> > > > GCC failed to vectorize.
> > > >
> > > > Could you help me with this?
> > >
> > > double foo (double *a, double *b, double *c)
> > > {
> > > double result = 0.0;
> > > for (int i = 0; i < 1024; ++i)
> > > result += __builtin_fma (a[i], b[i], c[i]);
> > > return result;
> > > }
> > >
> > > with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
> > >
> > > ...
> > > vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> > > _9 = *_8;
> > > vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> > > vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> > > ...
> > >
> > > but ifcvt still shows
> > >
> > > _9 = *_8;
> > > _10 = __builtin_fma (_7, _9, _4);
> > > result_17 = _10 + result_20;
> > >
> > > still vectorizable_call has IFN_FMA with
> > >
> > > /* First try using an internal function. */
> > > code_helper convert_code = MAX_TREE_CODES;
> > > if (cfn != CFN_LAST
> > > && (modifier == NONE
> > > || (modifier == NARROW
> > > && simple_integer_narrowing (vectype_out, vectype_in,
> > > &convert_code))))
> > > ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> > > vectype_in);
> > >
> > > from CFN_BUILT_IN_FMA
> > >
> > >
> > >
> > > > Thanks.
> > > >
> > > >
> > > > juzhe.zhong@rivai.ai
> > > >
> > > > From: Richard Biener
> > > > Date: 2023-07-31 20:00
> > > > To: juzhe.zhong@rivai.ai
> > > > CC: richard.sandiford; gcc-patches
> > > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > > >
> > > > > Ok . Thanks Richard.
> > > > >
> > > > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > > > >
> > > > > I tried many times to reproduce such cases but I failed.
> > > >
> > > > I think you need to use fma from math.h together with -ffast-math
> > > > to get fma.
> > > >
> > > > Richard.
> > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > juzhe.zhong@rivai.ai
> > > > >
> > > > > From: Richard Sandiford
> > > > > Date: 2023-07-31 18:19
> > > > > To: Juzhe-Zhong
> > > > > CC: gcc-patches; rguenther
> > > > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > > > Hi, Richard and Richi.
> > > > > >
> > > > > > Base on the suggestions from Richard:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > > > >
> > > > > > This patch choose (1) approach that Richard provided, meaning:
> > > > > >
> > > > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > > > are needed at the gimple level.
> > > > > >
> > > > > > Such approach can make codes much cleaner and reasonable.
> > > > > >
> > > > > > Consider this following case:
> > > > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > > > {
> > > > > > for (int i = 0; i < n; i++)
> > > > > > if (cond[i])
> > > > > > a[i] = b[i] + a[i];
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > > > <source>:5:21: missed: couldn't vectorize loop
> > > > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > > > >
> > > > > > ARM SVE:
> > > > > >
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > > > ...
> > > > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > > > >
> > > > > > For RVV, we want IR as follows:
> > > > > >
> > > > > > ...
> > > > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > > > ...
> > > > > >
> > > > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > > > >
> > > > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > > > >
> > > > > > And also, Bootstrap and Regression on X86 passed.
> > > > > >
> > > > > > OK for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > > > (get_len_internal_fn): New function.
> > > > > > (CASE): Ditto.
> > > > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > > > >
> > > > > > ---
> > > > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > > > gcc/internal-fn.h | 1 +
> > > > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > > index 8e294286388..379220bebc7 100644
> > > > > > --- a/gcc/internal-fn.cc
> > > > > > +++ b/gcc/internal-fn.cc
> > > > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > > > + T (COND_AND, COND_LEN_AND) \
> > > > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > > > + T (COND_FNMS, COND_LEN_FNMS)
> > > > >
> > > > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > > > the COND_* cases, rather than putting them in this macro.
> > > > >
> > > > > Thanks,
> > > > > Richard
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-08-02 8:33 ` Richard Biener
@ 2023-08-02 8:46 ` juzhe.zhong
2023-08-03 2:38 ` juzhe.zhong
1 sibling, 0 replies; 18+ messages in thread
From: juzhe.zhong @ 2023-08-02 8:46 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 13533 bytes --]
Yes. I also suspect whether we can run into reduc_idx >= 0.
I will add gcc_unreachable () and add fully testcase for it.
After I have fully tested in RISC-V port then send V4.
Thank you so much.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-08-02 16:33
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Wed, 2 Aug 2023, juzhe.zhong@rivai.ai wrote:
> Thanks Richard so much.
>
> Forgive me asking question again :)
>
> Is this following code correct for you ?
Well, I wonder what kind of testcase runs into the reduc_idx >= 0 case.
The point is I don't _know_ whether the code is correct, in fact it looked
suspicious ;)
> + if (len_loop_p)
> + {
> + if (len_opno >= 0)
> + {
> + ifn = cond_len_fn;
> + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> + vect_nargs += 2;
> + }
> + else if (reduc_idx >= 0)
> + gcc_unreachable ();
> + }
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-08-02 15:49
> To: ???
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, ??? wrote:
>
> > Oh, Thanks a lot.
> > I can test it in RISC-V backend now.
> >
> > But I have another questions:
> > >> I'm a bit confused (but also by the existing mask code), whether
> > >>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> > >>If if-conversion recognizes a .COND_ADD then we need to add nothing
> > >>for masking (that is, ifn == cond_fn already). In your code above
> > >>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> > >>isn't that the very same?! So how come you in one case add two
> > >>and in the other add four args?
> > >>Please make sure to place gcc_unreachable () in each arm and check
> > >>you have test coverage. I believe that the else arm is unreachable
> > >>but when you vectorize .FMA you will need to add 4 and when you
> > >>vectorize .COND_FMA you will need to add two arguments (as said,
> > >>no idea why we special case reduc_idx >= 0 at the moment).
> >
> > Do you mean I add gcc_unreachable in else like this:
> >
> > if (len_loop_p)
> > {
> > if (len_opno >= 0)
> > {
> > ifn = cond_len_fn;
> > /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> > vect_nargs += 2;
> > }
> > else if (reduc_idx >= 0)
> > {
> > /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> > ifn = get_len_internal_fn (cond_fn);
> > vect_nargs += 4;
>
> no, a gcc_unreachable () here. That is, make sure you have test coverage
> for the above two cases (to me the len_opno >= 0 case is obvious)
>
> > }
> > else
> > gcc_unreachable ();
> > }
> >
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 21:58
> > To: ???
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, ??? wrote:
> >
> > > Yeah. I have tried this case too.
> > >
> > > But this case doesn't need to be vectorized as COND_FMA, am I right?
> >
> > Only when you enable loop masking. Alternatively use
> >
> > double foo (double *a, double *b, double *c)
> > {
> > double result = 0.0;
> > for (int i = 0; i < 1024; ++i)
> > result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
> > return result;
> > }
> >
> > but then for me if-conversion produces
> >
> > iftmp.0_18 = __builtin_fma (_8, _10, _5);
> > _ifc__43 = _26 ? iftmp.0_18 : 0.0;
> >
> > with -ffast-math (probably rightfully so). I then get .FMAs
> > vectorized and .COND_FMA folded.
> >
> > > The thing I wonder is that whether this condtion:
> > >
> > > if (mask_opno >= 0 && reduc_idx >= 0)
> > >
> > > or similar as len
> > > if (len_opno >= 0 && reduc_idx >= 0)
> > >
> > > Whether they are redundant in vectorizable_call ?
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Biener
> > > Date: 2023-07-31 21:33
> > > To: juzhe.zhong@rivai.ai
> > > CC: richard.sandiford; gcc-patches
> > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > >
> > > > Hi, Richi.
> > > >
> > > > >> I think you need to use fma from math.h together with -ffast-math
> > > > >>to get fma.
> > > >
> > > > As you said, this is one of the case I tried:
> > > > https://godbolt.org/z/xMzrrv5dT
> > > > GCC failed to vectorize.
> > > >
> > > > Could you help me with this?
> > >
> > > double foo (double *a, double *b, double *c)
> > > {
> > > double result = 0.0;
> > > for (int i = 0; i < 1024; ++i)
> > > result += __builtin_fma (a[i], b[i], c[i]);
> > > return result;
> > > }
> > >
> > > with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
> > >
> > > ...
> > > vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> > > _9 = *_8;
> > > vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> > > vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> > > ...
> > >
> > > but ifcvt still shows
> > >
> > > _9 = *_8;
> > > _10 = __builtin_fma (_7, _9, _4);
> > > result_17 = _10 + result_20;
> > >
> > > still vectorizable_call has IFN_FMA with
> > >
> > > /* First try using an internal function. */
> > > code_helper convert_code = MAX_TREE_CODES;
> > > if (cfn != CFN_LAST
> > > && (modifier == NONE
> > > || (modifier == NARROW
> > > && simple_integer_narrowing (vectype_out, vectype_in,
> > > &convert_code))))
> > > ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> > > vectype_in);
> > >
> > > from CFN_BUILT_IN_FMA
> > >
> > >
> > >
> > > > Thanks.
> > > >
> > > >
> > > > juzhe.zhong@rivai.ai
> > > >
> > > > From: Richard Biener
> > > > Date: 2023-07-31 20:00
> > > > To: juzhe.zhong@rivai.ai
> > > > CC: richard.sandiford; gcc-patches
> > > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > > >
> > > > > Ok . Thanks Richard.
> > > > >
> > > > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > > > >
> > > > > I tried many times to reproduce such cases but I failed.
> > > >
> > > > I think you need to use fma from math.h together with -ffast-math
> > > > to get fma.
> > > >
> > > > Richard.
> > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > juzhe.zhong@rivai.ai
> > > > >
> > > > > From: Richard Sandiford
> > > > > Date: 2023-07-31 18:19
> > > > > To: Juzhe-Zhong
> > > > > CC: gcc-patches; rguenther
> > > > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > > > Hi, Richard and Richi.
> > > > > >
> > > > > > Base on the suggestions from Richard:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > > > >
> > > > > > This patch choose (1) approach that Richard provided, meaning:
> > > > > >
> > > > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > > > are needed at the gimple level.
> > > > > >
> > > > > > Such approach can make codes much cleaner and reasonable.
> > > > > >
> > > > > > Consider this following case:
> > > > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > > > {
> > > > > > for (int i = 0; i < n; i++)
> > > > > > if (cond[i])
> > > > > > a[i] = b[i] + a[i];
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > > > <source>:5:21: missed: couldn't vectorize loop
> > > > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > > > >
> > > > > > ARM SVE:
> > > > > >
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > > > ...
> > > > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > > > >
> > > > > > For RVV, we want IR as follows:
> > > > > >
> > > > > > ...
> > > > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > > > ...
> > > > > >
> > > > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > > > >
> > > > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > > > >
> > > > > > And also, Bootstrap and Regression on X86 passed.
> > > > > >
> > > > > > OK for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > > > (get_len_internal_fn): New function.
> > > > > > (CASE): Ditto.
> > > > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > > > >
> > > > > > ---
> > > > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > > > gcc/internal-fn.h | 1 +
> > > > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > > index 8e294286388..379220bebc7 100644
> > > > > > --- a/gcc/internal-fn.cc
> > > > > > +++ b/gcc/internal-fn.cc
> > > > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > > > + T (COND_AND, COND_LEN_AND) \
> > > > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > > > + T (COND_FNMS, COND_LEN_FNMS)
> > > > >
> > > > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > > > the COND_* cases, rather than putting them in this macro.
> > > > >
> > > > > Thanks,
> > > > > Richard
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
2023-08-02 8:33 ` Richard Biener
2023-08-02 8:46 ` juzhe.zhong
@ 2023-08-03 2:38 ` juzhe.zhong
1 sibling, 0 replies; 18+ messages in thread
From: juzhe.zhong @ 2023-08-03 2:38 UTC (permalink / raw)
To: rguenther; +Cc: richard.sandiford, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 13533 bytes --]
Hi, Richi.
I have fully tested in RISC-V port with adding gcc_unreachable () in V4 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626133.html
Bootstrap and regression on X86 passed.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-08-02 16:33
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches
Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
On Wed, 2 Aug 2023, juzhe.zhong@rivai.ai wrote:
> Thanks Richard so much.
>
> Forgive me asking question again :)
>
> Is this following code correct for you ?
Well, I wonder what kind of testcase runs into the reduc_idx >= 0 case.
The point is I don't _know_ whether the code is correct, in fact it looked
suspicious ;)
> + if (len_loop_p)
> + {
> + if (len_opno >= 0)
> + {
> + ifn = cond_len_fn;
> + /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> + vect_nargs += 2;
> + }
> + else if (reduc_idx >= 0)
> + gcc_unreachable ();
> + }
>
> Thanks.
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-08-02 15:49
> To: ???
> CC: richard.sandiford; gcc-patches
> Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> On Mon, 31 Jul 2023, ??? wrote:
>
> > Oh, Thanks a lot.
> > I can test it in RISC-V backend now.
> >
> > But I have another questions:
> > >> I'm a bit confused (but also by the existing mask code), whether
> > >>vect_nargs needs adjustment depends on the IFN in the IL we analyze.
> > >>If if-conversion recognizes a .COND_ADD then we need to add nothing
> > >>for masking (that is, ifn == cond_fn already). In your code above
> > >>you either use cond_len_fn or get_len_internal_fn (cond_fn) but
> > >>isn't that the very same?! So how come you in one case add two
> > >>and in the other add four args?
> > >>Please make sure to place gcc_unreachable () in each arm and check
> > >>you have test coverage. I believe that the else arm is unreachable
> > >>but when you vectorize .FMA you will need to add 4 and when you
> > >>vectorize .COND_FMA you will need to add two arguments (as said,
> > >>no idea why we special case reduc_idx >= 0 at the moment).
> >
> > Do you mean I add gcc_unreachable in else like this:
> >
> > if (len_loop_p)
> > {
> > if (len_opno >= 0)
> > {
> > ifn = cond_len_fn;
> > /* COND_* -> COND_LEN_* takes 2 extra arguments:LEN,BIAS. */
> > vect_nargs += 2;
> > }
> > else if (reduc_idx >= 0)
> > {
> > /* FMA -> COND_LEN_FMA takes 4 extra arguments:MASK,ELSE,LEN,BIAS. */
> > ifn = get_len_internal_fn (cond_fn);
> > vect_nargs += 4;
>
> no, a gcc_unreachable () here. That is, make sure you have test coverage
> for the above two cases (to me the len_opno >= 0 case is obvious)
>
> > }
> > else
> > gcc_unreachable ();
> > }
> >
> > Thanks.
> >
> >
> > juzhe.zhong@rivai.ai
> >
> > From: Richard Biener
> > Date: 2023-07-31 21:58
> > To: ???
> > CC: richard.sandiford; gcc-patches
> > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > On Mon, 31 Jul 2023, ??? wrote:
> >
> > > Yeah. I have tried this case too.
> > >
> > > But this case doesn't need to be vectorized as COND_FMA, am I right?
> >
> > Only when you enable loop masking. Alternatively use
> >
> > double foo (double *a, double *b, double *c)
> > {
> > double result = 0.0;
> > for (int i = 0; i < 1024; ++i)
> > result += i & 1 ? __builtin_fma (a[i], b[i], c[i]) : 0.0;
> > return result;
> > }
> >
> > but then for me if-conversion produces
> >
> > iftmp.0_18 = __builtin_fma (_8, _10, _5);
> > _ifc__43 = _26 ? iftmp.0_18 : 0.0;
> >
> > with -ffast-math (probably rightfully so). I then get .FMAs
> > vectorized and .COND_FMA folded.
> >
> > > The thing I wonder is that whether this condtion:
> > >
> > > if (mask_opno >= 0 && reduc_idx >= 0)
> > >
> > > or similar as len
> > > if (len_opno >= 0 && reduc_idx >= 0)
> > >
> > > Whether they are redundant in vectorizable_call ?
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >
> > > From: Richard Biener
> > > Date: 2023-07-31 21:33
> > > To: juzhe.zhong@rivai.ai
> > > CC: richard.sandiford; gcc-patches
> > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > >
> > > > Hi, Richi.
> > > >
> > > > >> I think you need to use fma from math.h together with -ffast-math
> > > > >>to get fma.
> > > >
> > > > As you said, this is one of the case I tried:
> > > > https://godbolt.org/z/xMzrrv5dT
> > > > GCC failed to vectorize.
> > > >
> > > > Could you help me with this?
> > >
> > > double foo (double *a, double *b, double *c)
> > > {
> > > double result = 0.0;
> > > for (int i = 0; i < 1024; ++i)
> > > result += __builtin_fma (a[i], b[i], c[i]);
> > > return result;
> > > }
> > >
> > > with -mavx2 -mfma -Ofast this is vectorized on x86_64 to
> > >
> > > ...
> > > vect__9.13_27 = MEM <vector(4) double> [(double *)vectp_a.11_29];
> > > _9 = *_8;
> > > vect__10.14_26 = .FMA (vect__7.10_30, vect__9.13_27, vect__4.7_33);
> > > vect_result_17.15_25 = vect__10.14_26 + vect_result_20.4_36;
> > > ...
> > >
> > > but ifcvt still shows
> > >
> > > _9 = *_8;
> > > _10 = __builtin_fma (_7, _9, _4);
> > > result_17 = _10 + result_20;
> > >
> > > still vectorizable_call has IFN_FMA with
> > >
> > > /* First try using an internal function. */
> > > code_helper convert_code = MAX_TREE_CODES;
> > > if (cfn != CFN_LAST
> > > && (modifier == NONE
> > > || (modifier == NARROW
> > > && simple_integer_narrowing (vectype_out, vectype_in,
> > > &convert_code))))
> > > ifn = vectorizable_internal_function (cfn, callee, vectype_out,
> > > vectype_in);
> > >
> > > from CFN_BUILT_IN_FMA
> > >
> > >
> > >
> > > > Thanks.
> > > >
> > > >
> > > > juzhe.zhong@rivai.ai
> > > >
> > > > From: Richard Biener
> > > > Date: 2023-07-31 20:00
> > > > To: juzhe.zhong@rivai.ai
> > > > CC: richard.sandiford; gcc-patches
> > > > Subject: Re: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > On Mon, 31 Jul 2023, juzhe.zhong@rivai.ai wrote:
> > > >
> > > > > Ok . Thanks Richard.
> > > > >
> > > > > Could you give me a case that SVE can vectorize a reduction with FMA?
> > > > > Meaning it will go into vectorize_call and vectorize FMA into COND_FMA ?
> > > > >
> > > > > I tried many times to reproduce such cases but I failed.
> > > >
> > > > I think you need to use fma from math.h together with -ffast-math
> > > > to get fma.
> > > >
> > > > Richard.
> > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > juzhe.zhong@rivai.ai
> > > > >
> > > > > From: Richard Sandiford
> > > > > Date: 2023-07-31 18:19
> > > > > To: Juzhe-Zhong
> > > > > CC: gcc-patches; rguenther
> > > > > Subject: Re: [PATCH V2] VECT: Support CALL vectorization for COND_LEN_*
> > > > > Juzhe-Zhong <juzhe.zhong@rivai.ai> writes:
> > > > > > Hi, Richard and Richi.
> > > > > >
> > > > > > Base on the suggestions from Richard:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625396.html
> > > > > >
> > > > > > This patch choose (1) approach that Richard provided, meaning:
> > > > > >
> > > > > > RVV implements cond_* optabs as expanders. RVV therefore supports
> > > > > > both IFN_COND_ADD and IFN_COND_LEN_ADD. No dummy length arguments
> > > > > > are needed at the gimple level.
> > > > > >
> > > > > > Such approach can make codes much cleaner and reasonable.
> > > > > >
> > > > > > Consider this following case:
> > > > > > void foo (float * __restrict a, float * __restrict b, int * __restrict cond, int n)
> > > > > > {
> > > > > > for (int i = 0; i < n; i++)
> > > > > > if (cond[i])
> > > > > > a[i] = b[i] + a[i];
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Output of RISC-V (32-bits) gcc (trunk) (Compiler #3)
> > > > > > <source>:5:21: missed: couldn't vectorize loop
> > > > > > <source>:5:21: missed: not vectorized: control flow in loop.
> > > > > >
> > > > > > ARM SVE:
> > > > > >
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vec_mask_and_55 = loop_mask_49 & mask__27.10_51;
> > > > > > ...
> > > > > > vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
> > > > > >
> > > > > > For RVV, we want IR as follows:
> > > > > >
> > > > > > ...
> > > > > > _68 = .SELECT_VL (ivtmp_66, POLY_INT_CST [4, 4]);
> > > > > > ...
> > > > > > mask__27.10_51 = vect__4.9_49 != { 0, ... };
> > > > > > ...
> > > > > > vect__9.17_60 = .COND_LEN_ADD (mask__27.10_51, vect__6.13_55, vect__8.16_59, vect__6.13_55, _68, 0);
> > > > > > ...
> > > > > >
> > > > > > Both len and mask of COND_LEN_ADD are real not dummy.
> > > > > >
> > > > > > This patch has been fully tested in RISC-V port with supporting both COND_* and COND_LEN_*.
> > > > > >
> > > > > > And also, Bootstrap and Regression on X86 passed.
> > > > > >
> > > > > > OK for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > * internal-fn.cc (FOR_EACH_LEN_FN_PAIR): New macro.
> > > > > > (get_len_internal_fn): New function.
> > > > > > (CASE): Ditto.
> > > > > > * internal-fn.h (get_len_internal_fn): Ditto.
> > > > > > * tree-vect-stmts.cc (vectorizable_call): Support CALL vectorization with COND_LEN_*.
> > > > > >
> > > > > > ---
> > > > > > gcc/internal-fn.cc | 46 ++++++++++++++++++++++
> > > > > > gcc/internal-fn.h | 1 +
> > > > > > gcc/tree-vect-stmts.cc | 87 +++++++++++++++++++++++++++++++++++++-----
> > > > > > 3 files changed, 125 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > > > > index 8e294286388..379220bebc7 100644
> > > > > > --- a/gcc/internal-fn.cc
> > > > > > +++ b/gcc/internal-fn.cc
> > > > > > @@ -4443,6 +4443,52 @@ get_conditional_internal_fn (internal_fn fn)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +/* Invoke T(IFN) for each internal function IFN that also has an
> > > > > > + IFN_COND_LEN_* or IFN_MASK_LEN_* form. */
> > > > > > +#define FOR_EACH_LEN_FN_PAIR(T) \
> > > > > > + T (MASK_LOAD, MASK_LEN_LOAD) \
> > > > > > + T (MASK_STORE, MASK_LEN_STORE) \
> > > > > > + T (MASK_GATHER_LOAD, MASK_LEN_GATHER_LOAD) \
> > > > > > + T (MASK_SCATTER_STORE, MASK_LEN_SCATTER_STORE) \
> > > > > > + T (COND_ADD, COND_LEN_ADD) \
> > > > > > + T (COND_SUB, COND_LEN_SUB) \
> > > > > > + T (COND_MUL, COND_LEN_MUL) \
> > > > > > + T (COND_DIV, COND_LEN_DIV) \
> > > > > > + T (COND_MOD, COND_LEN_MOD) \
> > > > > > + T (COND_RDIV, COND_LEN_RDIV) \
> > > > > > + T (COND_FMIN, COND_LEN_FMIN) \
> > > > > > + T (COND_FMAX, COND_LEN_FMAX) \
> > > > > > + T (COND_MIN, COND_LEN_MIN) \
> > > > > > + T (COND_MAX, COND_LEN_MAX) \
> > > > > > + T (COND_AND, COND_LEN_AND) \
> > > > > > + T (COND_IOR, COND_LEN_IOR) \
> > > > > > + T (COND_XOR, COND_LEN_XOR) \
> > > > > > + T (COND_SHL, COND_LEN_SHL) \
> > > > > > + T (COND_SHR, COND_LEN_SHR) \
> > > > > > + T (COND_NEG, COND_LEN_NEG) \
> > > > > > + T (COND_FMA, COND_LEN_FMA) \
> > > > > > + T (COND_FMS, COND_LEN_FMS) \
> > > > > > + T (COND_FNMA, COND_LEN_FNMA) \
> > > > > > + T (COND_FNMS, COND_LEN_FNMS)
> > > > >
> > > > > With the earlier patch to add DEF_INTERNAL_COND_FN and
> > > > > DEF_INTERNAL_SIGNED_COND_FN, I think we should use those to handle
> > > > > the COND_* cases, rather than putting them in this macro.
> > > > >
> > > > > Thanks,
> > > > > Richard
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-08-03 2:38 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-28 7:10 [PATCH V2] VECT: Support CALL vectorization for COND_LEN_* Juzhe-Zhong
2023-07-31 9:26 ` Richard Biener
2023-07-31 9:52 ` juzhe.zhong
2023-07-31 10:45 ` Richard Biener
2023-07-31 10:59 ` juzhe.zhong
2023-07-31 10:19 ` Richard Sandiford
2023-07-31 11:52 ` juzhe.zhong
2023-07-31 12:00 ` Richard Biener
2023-07-31 12:09 ` juzhe.zhong
2023-07-31 13:33 ` Richard Biener
2023-07-31 13:43 ` 钟居哲
2023-07-31 13:58 ` Richard Biener
2023-07-31 14:29 ` 钟居哲
2023-08-02 7:49 ` Richard Biener
2023-08-02 8:29 ` juzhe.zhong
2023-08-02 8:33 ` Richard Biener
2023-08-02 8:46 ` juzhe.zhong
2023-08-03 2:38 ` juzhe.zhong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).