From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=6UyH=A7=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 3E6543851C12
	for <gcc-patches@gcc.gnu.org>; Wed, 10 May 2023 16:45:16 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3E6543851C12
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 208B512FC;
	Wed, 10 May 2023 09:46:00 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E4B323F67D;
	Wed, 10 May 2023 09:45:14 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: juzhe.zhong@rivai.ai
Mail-Followup-To: juzhe.zhong@rivai.ai,gcc-patches@gcc.gnu.org,  rguenther@suse.de, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org,  rguenther@suse.de
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
References: <20230504132540.286148-1-juzhe.zhong@rivai.ai>
Date: Wed, 10 May 2023 17:45:13 +0100
In-Reply-To: <20230504132540.286148-1-juzhe.zhong@rivai.ai> (juzhe zhong's
	message of "Thu, 4 May 2023 21:25:40 +0800")
Message-ID: <mpt8rdwjfty.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-29.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

In addition to Jeff's comments:

juzhe.zhong@rivai.ai writes:
> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>    operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated value.
> +operand 1 is the total elements need to be updated value.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = IN_RANGE (ceil (operand1 / 2), operand2);

GCC's IN_RANGE is a predicate, so it would be best to avoid that here.
Why isn't it simply ceil (operand1 / 2), which must be <= operand2?

> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, but also
> +is used as the IV of address calculation with multiply/shift operation. This allow
> +us dynamic adjust the number of elements is processed in each iteration of the loop.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> [...]
> diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> index 909b705d00d..5abca64379e 100644
> --- a/gcc/tree-ssa-loop-manip.cc
> +++ b/gcc/tree-ssa-loop-manip.cc
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
>     so that we can free them all at once.  */
>  static bitmap_obstack loop_renamer_obstack;
>  
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
>     It is expected that neither BASE nor STEP are shared with other expressions
>     (unless the sharing rules allow this).  Use VAR as a base var_decl for it
>     (if NULL, a new temporary will be created).  The increment will occur at
> @@ -57,8 +59,8 @@ static bitmap_obstack loop_renamer_obstack;
>     VAR_AFTER (unless they are NULL).  */
>  
>  void
> -create_iv (tree base, tree step, tree var, class loop *loop,
> -	   gimple_stmt_iterator *incr_pos, bool after,
> +create_iv (tree base, tree_code code, tree step, tree var,
> +	   class loop *loop, gimple_stmt_iterator *incr_pos, bool after,
>  	   tree *var_before, tree *var_after)
>  {
>    gassign *stmt;
> @@ -66,7 +68,9 @@ create_iv (tree base, tree step, tree var, class loop *loop,
>    tree initial, step1;
>    gimple_seq stmts;
>    tree vb, va;
> -  enum tree_code incr_op = PLUS_EXPR;
> +  /* The code can only be PLUS_EXPR or MINUS_EXPR.  */
> +  gcc_assert (code == PLUS_EXPR || code == MINUS_EXPR);
> +  tree_code incr_op = code;

As Richard said, we should be able to get rid of incr_op, probably
by calling the parameter incr_op.

I think the later:

	      incr_op = MINUS_EXPR;

stmts need to be updated to something that flips between PLUS_EXPR
and MINUS_EXPR (with updates to the comments).

It would probably make sense to split the create_iv part out as a
separate prepatch.

>    edge pe = loop_preheader_edge (loop);
>  
>    if (var != NULL_TREE)
> @@ -1365,7 +1369,7 @@ tree_transform_and_unroll_loop (class loop *loop, unsigned factor,
>        tree ctr_before, ctr_after;
>        gimple_stmt_iterator bsi = gsi_last_nondebug_bb (new_exit->src);
>        exit_if = as_a <gcond *> (gsi_stmt (bsi));
> -      create_iv (exit_base, exit_step, NULL_TREE, loop,
> +      create_iv (exit_base, PLUS_EXPR, exit_step, NULL_TREE, loop,
>  		 &bsi, false, &ctr_before, &ctr_after);
>        gimple_cond_set_code (exit_if, exit_cmp);
>        gimple_cond_set_lhs (exit_if, ctr_after);
> @@ -1580,8 +1584,8 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch)
>      gsi = gsi_last_bb (loop->latch);
>    else
>      gsi = gsi_last_nondebug_bb (loop->header);
> -  create_iv (build_int_cst_type (type, 0), build_int_cst (type, 1), NULL_TREE,
> -	     loop, &gsi, bump_in_latch, &var_before, NULL);
> +  create_iv (build_int_cst_type (type, 0), PLUS_EXPR, build_int_cst (type, 1),
> +	     NULL_TREE, loop, &gsi, bump_in_latch, &var_before, NULL);
>  
>    rewrite_all_phi_nodes_with_iv (loop, var_before);
>  
> [...]
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 44bd5f2c805..d63ded5d4f0 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>    return false;
>  }
>  
> +/* Try to use permutes to define the lens in DEST_RGM using the lens
> +   in SRC_RGM, given that the former has twice as many lens as the
> +   latter.  Return true on success, adding any new statements to SEQ.  */

Nit: think it's better to talk about "lengths" rather than "lens"
(which ordinarily means something different).

Like Jeff said, this operation is no longer a permute when applied
to lengths (as opposed to masks).  And because of that, I don't think
there's anything special about DEST_RGN having twice as many controls
as SRC_RGN.  (And the implementation does seem to be more general
than just a factor of two.)

> +
> +static bool
> +vect_maybe_permute_loop_lens (tree iv_type, gimple_seq *seq,
> +			      rgroup_controls *dest_rgm,
> +			      rgroup_controls *src_rgm)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> +    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter)
> +    {
> +      for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +	{
> +	  tree src = src_rgm->controls[i / dest_rgm->controls.length ()];
> +	  tree dest = dest_rgm->controls[i];
> +	  gassign *stmt;
> +	  if (i == 0)
> +	    {
> +	      /* MIN (X, VF*I/N) capped to the range [0, VF/N].  */
> +	      tree factor = build_int_cst (iv_type, nitems_per_ctrl);
> +	      stmt = gimple_build_assign (dest, MIN_EXPR, src, factor);
> +	      gimple_seq_add_stmt (seq, stmt);
> +	    }
> +	  else
> +	    {
> +	      /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N].  */
> +	      tree factor = build_int_cst (iv_type, nitems_per_ctrl * i);
> +	      tree temp = make_ssa_name (iv_type);
> +	      stmt = gimple_build_assign (temp, MIN_EXPR, src, factor);
> +	      gimple_seq_add_stmt (seq, stmt);
> +	      stmt = gimple_build_assign (dest, MINUS_EXPR, src, temp);
> +	      gimple_seq_add_stmt (seq, stmt);
> +	    }
> +	}
> +      return true;
> +    }
> +  return false;
> +}
> +

This return false thing looks odd.  I don't think we can decide on
a group-by-group basis whether to use this approach or not.  Instead
it has to be a per-loop decision.  Either (1) we use SELECT_VL for the
single-control group, and use this function to derive all other groups
from that, or (2) we just MIN for all groups.  No mixing and matching
is allowed.

In other words, I think we need to decide at the top level whether
to use SELECT_VL or not.  As things stand, this decision should be
"no" if nitems_per_iter != 1 for *any* group.  If the decision is
to use SELECT_VL, we should use it for the single-control group
only and then unconditionally use the function above for the other
groups, with no failure path.

The decision about whether to use SELECT_VL or not should be cached
if necessary, so that it's available...

>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
>     for all the rgroup controls in RGC and return a control that is nonzero
>     when the loop needs to iterate.  Add any new preheader statements to
> [...]
> @@ -10361,15 +10361,18 @@ vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>  }
>  
>  /* Given a complete set of length LENS, extract length number INDEX for an
> -   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
> +   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.
> +   Insert any set-up statements before GSI.  */
>  
>  tree
> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> -		   unsigned int nvectors, unsigned int index)
> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
> +		   vec_loop_lens *lens, unsigned int nvectors, tree vectype,
> +		   unsigned int index)
>  {
>    rgroup_controls *rgl = &(*lens)[nvectors - 1];
>    bool use_bias_adjusted_len =
>      LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>  
>    /* Populate the rgroup's len array, if this is the first time we've
>       used it.  */
> @@ -10400,6 +10403,27 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>  
>    if (use_bias_adjusted_len)
>      return rgl->bias_adjusted_ctrl;
> +  else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
> +					   OPTIMIZE_FOR_SPEED))
> +    {
> +      tree loop_len = rgl->controls[index];
> +      poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
> +      poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
> +      if (maybe_ne (nunits1, nunits2))
> +	{
> +	  /* A loop len for data type X can be reused for data type Y
> +	     if X has N times more elements than Y and if Y's elements
> +	     are N times bigger than X's.  */
> +	  gcc_assert (multiple_p (nunits1, nunits2));
> +	  unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
> +	  gimple_seq seq = NULL;
> +	  loop_len = gimple_build (&seq, RDIV_EXPR, iv_type, loop_len,
> +				   build_int_cst (iv_type, factor));
> +	  if (seq)
> +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> +	}
> +      return loop_len;
> +    }
>    else
>      return rgl->controls[index];
>  }

...here.  That is, the key isn't whether SELECT_VL is available,
but instead whether we've decided to use it for this loop (unless
I'm missing something).

> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 3ad6a7d28d7..26813ab56e9 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo,
>    return iv_step;
>  }
>  
> +/* Prepare the pointer IVs which needs to be updated by a variable amount.
> +   Such variable amount is the outcome of .SELECT_VL. In this case, we can
> +   allow each iteration process the flexible number of elements as long as
> +   the number <= vf elments.
> +
> +   Return data reference according to SELECT_VL.
> +   If new statements are needed, insert them before GSI.  */
> +
> +static tree
> +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
> +			    tree aggr_type, class loop *at_loop, tree offset,
> +			    tree *dummy, gimple_stmt_iterator *gsi,
> +			    bool simd_lane_access_p, vec_loop_lens *loop_lens,
> +			    dr_vec_info *dr_info,
> +			    vect_memory_access_type memory_access_type)
> +{
> +  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> +  tree step = vect_dr_behavior (vinfo, dr_info)->step;
> +
> +  /* TODO: We don't support gather/scatter or load_lanes/store_lanes for pointer
> +     IVs are updated by variable amount but we will support them in the future.
> +   */
> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
> +	      && memory_access_type != VMAT_LOAD_STORE_LANES);
> +
> +  /* When we support SELECT_VL pattern, we dynamic adjust
> +     the memory address by .SELECT_VL result.
> +
> +     The result of .SELECT_VL is the number of elements to
> +     be processed of each iteration. So the memory address
> +     adjustment operation should be:
> +
> +     bytesize = GET_MODE_SIZE (element_mode (aggr_type));
> +     addr = addr + .SELECT_VL (ARG..) * bytesize;
> +  */
> +  gimple *ptr_incr;
> +  tree loop_len
> +    = vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0);
> +  tree len_type = TREE_TYPE (loop_len);
> +  poly_uint64 bytesize = GET_MODE_SIZE (element_mode (aggr_type));
> +  /* Since the outcome of .SELECT_VL is element size, we should adjust
> +     it into bytesize so that it can be used in address pointer variable
> +     amount IVs adjustment.  */
> +  tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len,
> +			  build_int_cst (len_type, bytesize));
> +  if (tree_int_cst_sgn (step) == -1)
> +    tmp = fold_build1 (NEGATE_EXPR, len_type, tmp);
> +  tree bump = make_temp_ssa_name (len_type, NULL, "ivtmp");
> +  gassign *assign = gimple_build_assign (bump, tmp);
> +  gsi_insert_before (gsi, assign, GSI_SAME_STMT);

It seems a little wasteful to recompute the temporary results (into gsi)
for every pointer, but I guess it's not worth complicating the code to
do on-the-fly DCE.  Later passes can clean it up for us.  (Unless Richi
feels differently.  Can't remember who's said what now.)

> +  return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loop, offset,
> +				   dummy, gsi, &ptr_incr, simd_lane_access_p,
> +				   bump);
> +}
> +
>  /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}.  */
>  
>  static bool
> @@ -8279,7 +8334,7 @@ vectorizable_store (vec_info *vinfo,
>  
>        stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
>        ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
> -      create_iv (stride_base, ivstep, NULL,
> +      create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
>  		 loop, &incr_gsi, insert_after,
>  		 &offvar, NULL);
>        incr = gsi_stmt (incr_gsi);
> @@ -8540,6 +8595,17 @@ vectorizable_store (vec_info *vinfo,
>  	    vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>  					 slp_node, &gs_info, &dataref_ptr,
>  					 &vec_offsets);
> +	  else if (loop_lens && loop_lens->length () == 1
> +		   && direct_internal_fn_supported_p (
> +		     IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo),
> +		     OPTIMIZE_FOR_SPEED)
> +		   && memory_access_type != VMAT_INVARIANT)
> +	    dataref_ptr
> +	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
> +					    simd_lane_access_p ? loop : NULL,
> +					    offset, &dummy, gsi,
> +					    simd_lane_access_p, loop_lens,
> +					    dr_info, memory_access_type);

Here too I think the check should be based on a cached record
of whether we're using SELECT_VL for this particular loop,

What happens for loop_lens->length () != 1?

Thanks,
Richard

>  	  else
>  	    dataref_ptr
>  	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> @@ -8788,8 +8854,9 @@ vectorizable_store (vec_info *vinfo,
>  	      else if (loop_lens)
>  		{
>  		  tree final_len
> -		    = vect_get_loop_len (loop_vinfo, loop_lens,
> -					 vec_num * ncopies, vec_num * j + i);
> +		    = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
> +					 vec_num * ncopies, vectype,
> +					 vec_num * j + i);
>  		  tree ptr = build_int_cst (ref_type, align * BITS_PER_UNIT);
>  		  machine_mode vmode = TYPE_MODE (vectype);
>  		  opt_machine_mode new_ovmode
> @@ -9450,7 +9517,7 @@ vectorizable_load (vec_info *vinfo,
>  
>        stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
>        ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
> -      create_iv (stride_base, ivstep, NULL,
> +      create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
>  		 loop, &incr_gsi, insert_after,
>  		 &offvar, NULL);
>  
> @@ -9928,6 +9995,16 @@ vectorizable_load (vec_info *vinfo,
>  					   slp_node, &gs_info, &dataref_ptr,
>  					   &vec_offsets);
>  	    }
> +	  else if (loop_lens && loop_lens->length () == 1
> +		   && direct_internal_fn_supported_p (
> +		     IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo),
> +		     OPTIMIZE_FOR_SPEED)
> +		   && memory_access_type != VMAT_INVARIANT)
> +	    dataref_ptr
> +	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
> +					    at_loop, offset, &dummy, gsi,
> +					    simd_lane_access_p, loop_lens,
> +					    dr_info, memory_access_type);
>  	  else
>  	    dataref_ptr
>  	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> @@ -10144,8 +10221,8 @@ vectorizable_load (vec_info *vinfo,
>  		    else if (loop_lens && memory_access_type != VMAT_INVARIANT)
>  		      {
>  			tree final_len
> -			  = vect_get_loop_len (loop_vinfo, loop_lens,
> -					       vec_num * ncopies,
> +			  = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
> +					       vec_num * ncopies, vectype,
>  					       vec_num * j + i);
>  			tree ptr = build_int_cst (ref_type,
>  						  align * BITS_PER_UNIT);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 9cf2fb23fe3..357bc5c7315 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2293,8 +2293,8 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
>  				unsigned int, tree, unsigned int);
>  extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
>  				  tree, unsigned int);
> -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
> -			       unsigned int);
> +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *, vec_loop_lens *,
> +			       unsigned int, tree, unsigned int);
>  extern gimple_seq vect_gen_len (tree, tree, tree, tree);
>  extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
>  extern bool reduction_fn_for_scalar_code (code_helper, internal_fn *);