[PATCH V2] Middle-end: Fix bug of induction variable vectorization for RVV

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH V2] Middle-end: Fix bug of induction variable vectorization for RVV
@ 2023-11-10 12:20 Juzhe-Zhong
  2023-11-10 13:35 ` Richard Biener
  0 siblings, 1 reply; 2+ messages in thread
From: Juzhe-Zhong @ 2023-11-10 12:20 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford, rguenther, Juzhe-Zhong

PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112438

1. Since SELECT_VL result is not necessary always VF in non-final iteration.

Current GIMPLE IR is wrong:

# vect_vec_iv_.8_22 = PHI <_21(4), { 0, 1, 2, ... }(3)>
...
_35 = .SELECT_VL (ivtmp_33, VF);
_21 = vect_vec_iv_.8_22 + { VF, ... };

E.g. Consider the total iterations N = 6, the VF = 4.
Since SELECT_VL output is defined as not always to be VF in non-final iteration
which needs to depend on hardware implementation.

Suppose we have a RVV CPU core with vsetvl doing even distribution workload optimization.
It may process 3 elements at the 1st iteration and 3 elements at the last iteration.
Then the induction variable here: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; 
is wrong which is adding VF, which is 4, actually, we didn't process 4 elements.

It should be adding 3 elements which is the result of SELECT_VL.
So, here the correct IR should be:

  _36 = .SELECT_VL (ivtmp_34, VF);
  _22 = (int) _36;
  vect_cst__21 = [vec_duplicate_expr] _22;

2. This issue only happens on non-SLP vectorization single rgroup since:
   
     if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
    {
      tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
      if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
					  OPTIMIZE_FOR_SPEED)
	  && LOOP_VINFO_LENS (loop_vinfo).length () == 1
	  && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
	  && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
	      || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
    }

3. This issue doesn't appears on nested loop no matter LOOP_VINFO_USING_SELECT_VL_P is true or false.

Since:

  # vect_vec_iv_.6_5 = PHI <_19(3), { 0, ... }(5)>
  # vect_diff_15.7_20 = PHI <vect_diff_9.8_22(3), vect_diff_18.5_11(5)>
  _19 = vect_vec_iv_.6_5 + { 1, ... };
  vect_diff_9.8_22 = .COND_LEN_ADD ({ -1, ... }, vect_vec_iv_.6_5, vect_diff_15.7_20, vect_diff_15.7_20, _28, 0);
  ivtmp_1 = ivtmp_4 + 4294967295;
  ....
  <bb 5> [local count: 6549826]:
  # vect_diff_18.5_11 = PHI <vect_diff_9.8_22(4), { 0, ... }(2)>
  # ivtmp_26 = PHI <ivtmp_27(4), 40(2)>
  _28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
  goto <bb 3>; [100.00%]

Note the induction variable IR: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; update induction variable
independent on VF (or don't care about how many elements are processed in the iteration).

The update is loop invariant. So it won't be the problem even if LOOP_VINFO_USING_SELECT_VL_P is true.
   
Testing passed, Ok for trunk ?

	PR tree-optimization/112438

gcc/ChangeLog:

	* tree-vect-loop.cc (vectorizable_induction):

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/pr112438.c: New test.

---
 .../gcc.target/riscv/rvv/autovec/pr112438.c   | 33 +++++++++++++++++++
 gcc/tree-vect-loop.cc                         | 30 ++++++++++++++++-
 2 files changed, 62 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c
new file mode 100644
index 00000000000..51f90df38a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model -ffast-math -fdump-tree-optimized-details" } */
+
+void
+foo (int n, int *__restrict in, int *__restrict out)
+{
+  for (int i = 0; i < n; i += 1)
+    {
+      out[i] = in[i] + i;
+    }
+}
+
+void
+foo2 (int n, float * __restrict in, 
+float * __restrict out)
+{
+  for (int i = 0; i < n; i += 1)
+    {
+      out[i] = in[i] + i;
+    }
+}
+
+void
+foo3 (int n, float * __restrict in, 
+float * __restrict out, float x)
+{
+  for (int i = 0; i < n; i += 1)
+    {
+      out[i] = in[i] + i* i;
+    }
+}
+
+/* We don't want to see vect_vec_iv_.21_25 + { POLY_INT_CST [4, 4], ... }.  */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 8abc1937d74..b152072c969 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10306,10 +10306,36 @@ vectorizable_induction (loop_vec_info loop_vinfo,
 
 
   /* Create the vector that holds the step of the induction.  */
+  gimple_stmt_iterator *step_iv_si = NULL;
   if (nested_in_vect_loop)
     /* iv_loop is nested in the loop to be vectorized. Generate:
        vec_step = [S, S, S, S]  */
     new_name = step_expr;
+  else if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
+    {
+      /* When we're using loop_len produced by SELEC_VL, the non-final
+	 iterations are not always processing VF elements.  So vectorize
+	 induction variable instead of
+
+	   _21 = vect_vec_iv_.6_22 + { VF, ... };
+
+	 We should generate:
+
+	   _35 = .SELECT_VL (ivtmp_33, VF);
+	   vect_cst__22 = [vec_duplicate_expr] _35;
+	   _21 = vect_vec_iv_.6_22 + vect_cst__22;  */
+      gcc_assert (!slp_node);
+      gimple_seq seq = NULL;
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      tree len = vect_get_loop_len (loop_vinfo, NULL, lens, 1, vectype, 0, 0);
+      expr = force_gimple_operand (fold_convert (TREE_TYPE (step_expr),
+						 unshare_expr (len)),
+				   &seq, true, NULL_TREE);
+      new_name = gimple_build (&seq, MULT_EXPR, TREE_TYPE (step_expr), expr,
+			       step_expr);
+      gsi_insert_seq_before (&si, seq, GSI_SAME_STMT);
+      step_iv_si = &si;
+    }
   else
     {
       /* iv_loop is the loop to be vectorized. Generate:
@@ -10336,7 +10362,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
 	      || TREE_CODE (new_name) == SSA_NAME);
   new_vec = build_vector_from_val (step_vectype, t);
   vec_step = vect_init_vector (loop_vinfo, stmt_info,
-			       new_vec, step_vectype, NULL);
+			       new_vec, step_vectype, step_iv_si);
 
 
   /* Create the following def-use cycle:
@@ -10382,6 +10408,8 @@ vectorizable_induction (loop_vec_info loop_vinfo,
       gimple_seq seq = NULL;
       /* FORNOW. This restriction should be relaxed.  */
       gcc_assert (!nested_in_vect_loop);
+      /* We expect LOOP_VINFO_USING_SELECT_VL_P to be false if ncopies > 1.  */
+      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 
       /* Create the vector that holds the step of the induction.  */
       if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (step_expr)))
-- 
2.36.3


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH V2] Middle-end: Fix bug of induction variable vectorization for RVV
  2023-11-10 12:20 [PATCH V2] Middle-end: Fix bug of induction variable vectorization for RVV Juzhe-Zhong
@ 2023-11-10 13:35 ` Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2023-11-10 13:35 UTC (permalink / raw)
  To: Juzhe-Zhong; +Cc: gcc-patches, richard.sandiford

On Fri, 10 Nov 2023, Juzhe-Zhong wrote:

> PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112438
> 
> 1. Since SELECT_VL result is not necessary always VF in non-final iteration.
> 
> Current GIMPLE IR is wrong:
> 
> # vect_vec_iv_.8_22 = PHI <_21(4), { 0, 1, 2, ... }(3)>
> ...
> _35 = .SELECT_VL (ivtmp_33, VF);
> _21 = vect_vec_iv_.8_22 + { VF, ... };
> 
> E.g. Consider the total iterations N = 6, the VF = 4.
> Since SELECT_VL output is defined as not always to be VF in non-final iteration
> which needs to depend on hardware implementation.
> 
> Suppose we have a RVV CPU core with vsetvl doing even distribution workload optimization.
> It may process 3 elements at the 1st iteration and 3 elements at the last iteration.
> Then the induction variable here: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; 
> is wrong which is adding VF, which is 4, actually, we didn't process 4 elements.
> 
> It should be adding 3 elements which is the result of SELECT_VL.
> So, here the correct IR should be:
> 
>   _36 = .SELECT_VL (ivtmp_34, VF);
>   _22 = (int) _36;
>   vect_cst__21 = [vec_duplicate_expr] _22;
> 
> 2. This issue only happens on non-SLP vectorization single rgroup since:
>    
>      if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
>     {
>       tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>       if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
> 					  OPTIMIZE_FOR_SPEED)
> 	  && LOOP_VINFO_LENS (loop_vinfo).length () == 1
> 	  && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
> 	  && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> 	      || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
> 	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
>     }
> 
> 3. This issue doesn't appears on nested loop no matter LOOP_VINFO_USING_SELECT_VL_P is true or false.
> 
> Since:
> 
>   # vect_vec_iv_.6_5 = PHI <_19(3), { 0, ... }(5)>
>   # vect_diff_15.7_20 = PHI <vect_diff_9.8_22(3), vect_diff_18.5_11(5)>
>   _19 = vect_vec_iv_.6_5 + { 1, ... };
>   vect_diff_9.8_22 = .COND_LEN_ADD ({ -1, ... }, vect_vec_iv_.6_5, vect_diff_15.7_20, vect_diff_15.7_20, _28, 0);
>   ivtmp_1 = ivtmp_4 + 4294967295;
>   ....
>   <bb 5> [local count: 6549826]:
>   # vect_diff_18.5_11 = PHI <vect_diff_9.8_22(4), { 0, ... }(2)>
>   # ivtmp_26 = PHI <ivtmp_27(4), 40(2)>
>   _28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
>   goto <bb 3>; [100.00%]
> 
> Note the induction variable IR: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; update induction variable
> independent on VF (or don't care about how many elements are processed in the iteration).
> 
> The update is loop invariant. So it won't be the problem even if LOOP_VINFO_USING_SELECT_VL_P is true.
>    
> Testing passed, Ok for trunk ?

OK.

Richard.

> 	PR tree-optimization/112438
> 
> gcc/ChangeLog:
> 
> 	* tree-vect-loop.cc (vectorizable_induction):
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/riscv/rvv/autovec/pr112438.c: New test.
> 
> ---
>  .../gcc.target/riscv/rvv/autovec/pr112438.c   | 33 +++++++++++++++++++
>  gcc/tree-vect-loop.cc                         | 30 ++++++++++++++++-
>  2 files changed, 62 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c
> 
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c
> new file mode 100644
> index 00000000000..51f90df38a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr112438.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model -ffast-math -fdump-tree-optimized-details" } */
> +
> +void
> +foo (int n, int *__restrict in, int *__restrict out)
> +{
> +  for (int i = 0; i < n; i += 1)
> +    {
> +      out[i] = in[i] + i;
> +    }
> +}
> +
> +void
> +foo2 (int n, float * __restrict in, 
> +float * __restrict out)
> +{
> +  for (int i = 0; i < n; i += 1)
> +    {
> +      out[i] = in[i] + i;
> +    }
> +}
> +
> +void
> +foo3 (int n, float * __restrict in, 
> +float * __restrict out, float x)
> +{
> +  for (int i = 0; i < n; i += 1)
> +    {
> +      out[i] = in[i] + i* i;
> +    }
> +}
> +
> +/* We don't want to see vect_vec_iv_.21_25 + { POLY_INT_CST [4, 4], ... }.  */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 8abc1937d74..b152072c969 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -10306,10 +10306,36 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>  
>  
>    /* Create the vector that holds the step of the induction.  */
> +  gimple_stmt_iterator *step_iv_si = NULL;
>    if (nested_in_vect_loop)
>      /* iv_loop is nested in the loop to be vectorized. Generate:
>         vec_step = [S, S, S, S]  */
>      new_name = step_expr;
> +  else if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> +    {
> +      /* When we're using loop_len produced by SELEC_VL, the non-final
> +	 iterations are not always processing VF elements.  So vectorize
> +	 induction variable instead of
> +
> +	   _21 = vect_vec_iv_.6_22 + { VF, ... };
> +
> +	 We should generate:
> +
> +	   _35 = .SELECT_VL (ivtmp_33, VF);
> +	   vect_cst__22 = [vec_duplicate_expr] _35;
> +	   _21 = vect_vec_iv_.6_22 + vect_cst__22;  */
> +      gcc_assert (!slp_node);
> +      gimple_seq seq = NULL;
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      tree len = vect_get_loop_len (loop_vinfo, NULL, lens, 1, vectype, 0, 0);
> +      expr = force_gimple_operand (fold_convert (TREE_TYPE (step_expr),
> +						 unshare_expr (len)),
> +				   &seq, true, NULL_TREE);
> +      new_name = gimple_build (&seq, MULT_EXPR, TREE_TYPE (step_expr), expr,
> +			       step_expr);
> +      gsi_insert_seq_before (&si, seq, GSI_SAME_STMT);
> +      step_iv_si = &si;
> +    }
>    else
>      {
>        /* iv_loop is the loop to be vectorized. Generate:
> @@ -10336,7 +10362,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>  	      || TREE_CODE (new_name) == SSA_NAME);
>    new_vec = build_vector_from_val (step_vectype, t);
>    vec_step = vect_init_vector (loop_vinfo, stmt_info,
> -			       new_vec, step_vectype, NULL);
> +			       new_vec, step_vectype, step_iv_si);
>  
>  
>    /* Create the following def-use cycle:
> @@ -10382,6 +10408,8 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>        gimple_seq seq = NULL;
>        /* FORNOW. This restriction should be relaxed.  */
>        gcc_assert (!nested_in_vect_loop);
> +      /* We expect LOOP_VINFO_USING_SELECT_VL_P to be false if ncopies > 1.  */
> +      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>  
>        /* Create the vector that holds the step of the induction.  */
>        if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (step_expr)))
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-11-10 13:35 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-10 12:20 [PATCH V2] Middle-end: Fix bug of induction variable vectorization for RVV Juzhe-Zhong
2023-11-10 13:35 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).