Re: [PATCH] [RFC] lower SLP load permutation to interleaving

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] [RFC] lower SLP load permutation to interleaving
       [not found] <c6f14930-59e5-4cc8-a38e-7f0452b67dba@DB5PEPF00014B89.eurprd02.prod.outlook.com>
@ 2024-06-04 19:58 ` Richard Sandiford
  2024-06-05 11:54   ` Richard Biener
  2024-06-05  9:00 ` Tamar Christina
  1 sibling, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2024-06-04 19:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, tamar.christina

Richard Biener <rguenther@suse.de> writes:
> The following emulates classical interleaving for SLP load permutes
> that we are unlikely handling natively.  This is to handle cases
> where interleaving (or load/store-lanes) is the optimal choice for
> vectorizing even when we are doing that within SLP.  An example
> would be
>
> void foo (int * __restrict a, int * b)
> {
>   for (int i = 0; i < 16; ++i)
>     {
>       a[4*i + 0] = b[4*i + 0] * 3;
>       a[4*i + 1] = b[4*i + 1] + 3;
>       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>       a[4*i + 3] = b[4*i + 3] * 3;
>     }
> }
>
> where currently the SLP store is merging four single-lane SLP
> sub-graphs but none of the loads in it can be code-generated
> with V4SImode vectors and a VF of four as the permutes would need
> three vectors.

Nice!

> The patch introduces a lowering phase after SLP discovery but
> before SLP pattern recognition or permute optimization that
> analyzes all loads from the same dataref group and creates an
> interleaving scheme starting from an unpermuted load.
>
> What can be handled is quite restrictive, matching only a subset
> of the non-SLP interleaving cases (the power-of-two group size
> ones, in addition only cases without gaps).  The interleaving
> vectorization in addition can handle size 3 and 5 - but I am not
> sure if it's possible to do that in a VL agnostic way.  It
> should be still possible to set up the SLP graph in a way that
> a load-lane could be matched from SLP pattern recognition.

Yeah, I don't think it would be possible to decompose a 3- or
5-lane grouped load into a series of VLA 2-input permutes.
But (as I think you're saying) it seems like a load-3-lanes would just
be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
Is that right?

> As said gaps are currently not handled - for SLP we have a
> representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> would need to be filled in some way (even if we just push NULL).
>
> The patch misses multi-level even/odd handling as well as CSEing
> intermediate generated permutes.  Both is quite straight-forward
> to add, but eventually there's a better or more general strategy
> for lowering?  The main goal of the patch is to avoid falling
> back to non-SLP for cases the interleaving code handles.

Does the multi-level thing including examples like:

int a[2 * 16];
int b[8 * 16];
void f()
{
  for (int i = 0; i < 16; ++i)
    {
      a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 3];
      a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 7];
    }
}

?  For that we generate:

  _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
  _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
  _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
  _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
  _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
  _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
  _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
  _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
  _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
  _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

(two even level 1, one even level 2, one odd level 1), whereas
preferring 2xeven + 2xodd would avoid the third set of first-level
permutes:

  _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
  _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
  _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
  _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
  _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
  _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
  _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
  _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

> Comments and suggestions welcome, esp. what representation
> you'd think is suitable for SLP pattern matching to
> load/store-lane and how to represent that?  Maybe this lowering
> should happen directly in vect_lower_load_permutations?

If the load-lanes representation is as simple as above, it sounds like
it could be deferred to pattern matching.  Not sure what the result
would look like though.  It would be nice if (at least for costing
purposes) we could have a single node for all lanes of the load-lanes,
rather than create a separate node for each lane and rely on later CSE.
(Or do we already have a good representation for this?  It's been too
long, sorry.)

Bit of trivia below:

> Thanks,
> Richard.
>
> 	* tree-vect-slp.cc (vllp_cmp): New function.
> 	(vect_lower_load_permutations): Likewise.
> 	(vect_analyze_slp): Call it.
> ---
>  gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 279 insertions(+)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 7e3d0107b4e..766b773452f 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    return res;
>  }
>  
> +/* qsort comparator ordering SLP load nodes.  */
> +
> +static int
> +vllp_cmp (const void *a_, const void *b_)
> +{
> +  const slp_tree a = *(const slp_tree *)a_;
> +  const slp_tree b = *(const slp_tree *)b_;
> +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> +      && STMT_VINFO_GROUPED_ACCESS (b0)
> +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> +    {
> +      /* Same group, order after lanes used.  */
> +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
> +	return 1;
> +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
> +	return -1;
> +      else
> +	{
> +	  /* Try to order loads using the same lanes together, breaking
> +	     the tie with the lane number that first differs.  */
> +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return 0;

Does the comparison need to be "stable", with a further tie-breaker
when a != b?  Or does our qsort not rely on that?

> +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return 1;
> +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return -1;
> +	  else
> +	    {
> +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
> +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
> +		  {
> +		    /* In-order lane first, that's what the above case for
> +		       no permutation does.  */
> +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
> +		      return -1;
> +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
> +		      return 1;
> +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
> +		      return -1;
> +		    else
> +		      return 1;
> +		  }
> +	      return 0;
> +	    }
> +	}
> +    }
> +  else /* Different groups or non-groups.  */
> +    {
> +      /* Order groups as their first element to keep them together.  */
> +      if (STMT_VINFO_GROUPED_ACCESS (a0))
> +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
> +      if (STMT_VINFO_GROUPED_ACCESS (b0))
> +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
> +      if (a0 == b0)
> +	return 0;
> +      /* Tie using UID.  */
> +      else if (gimple_uid (STMT_VINFO_STMT (a0))
> +	       < gimple_uid (STMT_VINFO_STMT (b0)))
> +	return -1;
> +      else
> +	{
> +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
> +		      != gimple_uid (STMT_VINFO_STMT (b0)));
> +	  return 1;
> +	}
> +    }
> +}
> +
> +/* Process the set of LOADS that are all from the same dataref group.  */
> +
> +static void
> +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> +			      scalar_stmts_to_slp_tree_map_t *bst_map,
> +			      const array_slice<slp_tree> &loads)
> +{
> +  /* We at this point want to lower without a fixed VF or vector
> +     size in mind which means we cannot actually compute whether we
> +     need three or more vectors for a load permutation yet.  So always
> +     lower.  */
> +  stmt_vec_info first
> +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> +
> +  /* ???  In principle we have to consider a gap up to the next full
> +     vector, but we have to actually represent a scalar stmt for the
> +     gaps value so delay handling this.  The same is true for
> +     inbetween gaps which the load places in the load-permutation
> +     represent.  It's probably not worth trying an intermediate packing
> +     to vectors without gap even if that might handle some more cases.
> +     Instead get the gap case correct in some way.  */
> +  unsigned group_lanes = 0;
> +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> +    {
> +      if ((s == first && DR_GROUP_GAP (s) != 0)
> +	  || (s != first && DR_GROUP_GAP (s) != 1))
> +	return;
> +      group_lanes++;
> +    }
> +  /* Only a power-of-two number of lanes matches interleaving.  */
> +  if (exact_log2 (group_lanes) == -1)
> +    return;
> +
> +  for (slp_tree load : loads)
> +    {
> +      /* Leave masked or gather loads alone for now.  */
> +      if (!SLP_TREE_CHILDREN (load).is_empty ())
> +	continue;
> +
> +      /* We need to lower only loads of less than half of the groups
> +	 lanes, including duplicate lanes.  */
> +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> +	continue;
> +
> +      /* Lower by reducing the group to half its size using an
> +	 interleaving scheme.  For this try to compute whether all
> +	 elements needed for this loads are in even or odd elements of

this load (or these loads)

> +	 a even/odd decomposition with N consecutive elements.

an even/odd

> +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> +	 with N == 2.  */
> +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;

Is this DR_GROUP_SIZE (first) conceptually different from group_lanes
above?  If not, I think it'd be a bit easier to follow if this line reused
the exact_log2 result above.

> +      unsigned odd = even;
> +      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
> +	{
> +	  even &= ~l;
> +	  odd &= l;
> +	}
> +      /* Give up when this doesn't match up with an interleaving scheme.  */
> +      if (!even && !odd)
> +	continue;
> +
> +      /* First build (and possibly re-use) a load node for the
> +	 unpermuted group.  */
> +      vec<stmt_vec_info> stmts;
> +      stmts.create (group_lanes);
> +      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> +	stmts.quick_push (s);
> +      poly_uint64 max_nunits;
> +      bool *matches = XALLOCAVEC (bool, group_lanes);
> +      unsigned limit = 1;
> +      unsigned tree_size = 0;
> +      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
> +					 group_lanes,
> +					 &max_nunits, matches, &limit,
> +					 &tree_size, bst_map);
> +
> +      /* Build the permute to get the original load permutation order.  */
> +      lane_permutation_t final_perm;
> +      final_perm.create (SLP_TREE_LANES (load));
> +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> +	final_perm.quick_push
> +	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
> +
> +      /* Now build a even or odd extraction from the unpermuted load.  */

an even

Thanks,
Richard

> +      lane_permutation_t perm;
> +      perm.create (group_lanes / 2);
> +      unsigned level;
> +      if (even
> +	  && ((level = 1 << ctz_hwi (even)), true)
> +	  && group_lanes % (2 * level) == 0)
> +	{
> +	  /* { 0, 1, ... 4, 5 ..., } */
> +	  unsigned level = 1 << ctz_hwi (even);
> +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> +	    for (unsigned j = 0; j < level; ++j)
> +	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
> +	}
> +      else if (odd)
> +	{
> +	  /* { ..., 2, 3, ... 6, 7 } */
> +	  unsigned level = 1 << ctz_hwi (odd);
> +	  gcc_assert (group_lanes % (2 * level) == 0);
> +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> +	    for (unsigned j = 0; j < level; ++j)
> +	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
> +	}
> +      else
> +	gcc_unreachable ();
> +
> +      /* And update final_perm.  */
> +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> +	{
> +	  unsigned l = final_perm[i].second;
> +	  unsigned j;
> +	  for (j = 0; j < perm.length (); ++j)
> +	    if (perm[j].second == l)
> +	      {
> +		final_perm[i].second = j;
> +		break;
> +	      }
> +	  gcc_assert (j < perm.length ());
> +	}
> +
> +      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> +      SLP_TREE_CHILDREN (p).quick_push (l0);
> +      SLP_TREE_LANE_PERMUTATION (p) = perm;
> +      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
> +      SLP_TREE_LANES (p) = perm.length ();
> +      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
> +      /* ???  We should have scalar stmts for this and use bst_map
> +	 to CSE.  But we do not want to pick up original SLP load
> +	 nodes with a load-permutation here.  */
> +      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
> +      /* ???  Ideally pick the best even/odd scheme usable for
> +	 most of the loads.  -> do a multi-step scheme?  */
> +
> +      /* And finally from the ordered reduction node create the
> +	 permute to shuffle the lanes into the original load-permutation
> +	 order.  We replace the original load node with this.  */
> +      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
> +      SLP_TREE_LOAD_PERMUTATION (load).release ();
> +      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
> +      SLP_TREE_CHILDREN (load).create (1);
> +      SLP_TREE_CHILDREN (load).quick_push (p);
> +    }
> +}
> +
> +/* Transform SLP loads in the SLP graph created by SLP discovery to
> +   group loads from the same group and lower load permutations that
> +   are unlikely to be supported into a series of permutes.
> +   In the degenerate case of having only single-lane SLP instances
> +   this should result in a series of permute nodes emulating an
> +   interleaving scheme.  */
> +
> +static void
> +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> +			      scalar_stmts_to_slp_tree_map_t *bst_map)
> +{
> +  /* Gather and sort loads across all instances.  */
> +  hash_set<slp_tree> visited;
> +  auto_vec<slp_tree> loads;
> +  for (auto inst : loop_vinfo->slp_instances)
> +    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
> +  if (loads.is_empty ())
> +    return;
> +  loads.qsort (vllp_cmp);
> +
> +  /* Now process each dataref group separately.  */
> +  unsigned firsti = 0;
> +  for (unsigned i = 1; i < loads.length (); ++i)
> +    {
> +      slp_tree first = loads[firsti];
> +      slp_tree next = loads[i];
> +      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
> +      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
> +      if (STMT_VINFO_GROUPED_ACCESS (a0)
> +	  && STMT_VINFO_GROUPED_ACCESS (b0)
> +	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> +	continue;
> +      /* Just one SLP load of a possible group, leave those alone.  */
> +      if (i == firsti + 1)
> +	{
> +	  firsti = i;
> +	  continue;
> +	}
> +      /* Now we have multiple SLP loads of the same group from
> +	 firsti to i - 1.  */
> +      vect_lower_load_permutations (loop_vinfo, bst_map,
> +				    make_array_slice (&loads[firsti],
> +						      i - firsti));
> +      firsti = i;
> +    }
> +  if (firsti < loads.length () - 1)
> +    vect_lower_load_permutations (loop_vinfo, bst_map,
> +				  make_array_slice (&loads[firsti],
> +						    loads.length () - firsti));
> +}
> +
>  /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
>     trees of packed scalar stmts if SLP is possible.  */
>  
> @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
>  	}
>      }
>  
> +  /* When we end up with load permutations that we cannot possibly handle,
> +     like those requiring three vector inputs, lower them using interleaving
> +     like schemes.  */
> +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> +    vect_lower_load_permutations (loop_vinfo, bst_map);
> +
>    hash_set<slp_tree> visited_patterns;
>    slp_tree_to_load_perm_map_t perm_cache;
>    slp_compat_nodes_map_t compat_cache;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] lower SLP load permutation to interleaving
  2024-06-04 19:58 ` [PATCH] [RFC] lower SLP load permutation to interleaving Richard Sandiford
@ 2024-06-05 11:54   ` Richard Biener
  2024-06-05 12:10     ` Richard Biener
  2024-06-05 12:29     ` Richard Sandiford
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Biener @ 2024-06-05 11:54 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, tamar.christina

On Tue, 4 Jun 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > The following emulates classical interleaving for SLP load permutes
> > that we are unlikely handling natively.  This is to handle cases
> > where interleaving (or load/store-lanes) is the optimal choice for
> > vectorizing even when we are doing that within SLP.  An example
> > would be
> >
> > void foo (int * __restrict a, int * b)
> > {
> >   for (int i = 0; i < 16; ++i)
> >     {
> >       a[4*i + 0] = b[4*i + 0] * 3;
> >       a[4*i + 1] = b[4*i + 1] + 3;
> >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
> >       a[4*i + 3] = b[4*i + 3] * 3;
> >     }
> > }
> >
> > where currently the SLP store is merging four single-lane SLP
> > sub-graphs but none of the loads in it can be code-generated
> > with V4SImode vectors and a VF of four as the permutes would need
> > three vectors.
> 
> Nice!
> 
> > The patch introduces a lowering phase after SLP discovery but
> > before SLP pattern recognition or permute optimization that
> > analyzes all loads from the same dataref group and creates an
> > interleaving scheme starting from an unpermuted load.
> >
> > What can be handled is quite restrictive, matching only a subset
> > of the non-SLP interleaving cases (the power-of-two group size
> > ones, in addition only cases without gaps).  The interleaving
> > vectorization in addition can handle size 3 and 5 - but I am not
> > sure if it's possible to do that in a VL agnostic way.  It
> > should be still possible to set up the SLP graph in a way that
> > a load-lane could be matched from SLP pattern recognition.
> 
> Yeah, I don't think it would be possible to decompose a 3- or
> 5-lane grouped load into a series of VLA 2-input permutes.
> But (as I think you're saying) it seems like a load-3-lanes would just
> be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
> Is that right?

Yes, that's how it looks without this patch.  I think we'd need
a load node loading N, N+1, N+2, ... and then permute nodes
with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
one .LOAD_LANES from the load node and the permutes pick up the
correct vector defs?  I'm not sure yet how classification and
code generation would work for this.

The store side is already on trunk with the single SLP store node
getting lanes via permutes.

It might be we want a load/store node with N inputs/outputs as the
best representation and use lane_permutation to indicate the
input (for stores) and output (for loads) "permute".

> > As said gaps are currently not handled - for SLP we have a
> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> > would need to be filled in some way (even if we just push NULL).
> >
> > The patch misses multi-level even/odd handling as well as CSEing
> > intermediate generated permutes.  Both is quite straight-forward
> > to add, but eventually there's a better or more general strategy
> > for lowering?  The main goal of the patch is to avoid falling
> > back to non-SLP for cases the interleaving code handles.
> 
> Does the multi-level thing including examples like:
> 
> int a[2 * 16];
> int b[8 * 16];
> void f()
> {
>   for (int i = 0; i < 16; ++i)
>     {
>       a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 3];
>       a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 7];
>     }
> }
> 
> ?  For that we generate:
> 
>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>   _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
>   _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
>   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
> 
> (two even level 1, one even level 2, one odd level 1), whereas
> preferring 2xeven + 2xodd would avoid the third set of first-level
> permutes:
> 
>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>   _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

The multi-level issue is more when a single reduction to N/2
inputs still doesn't get you to the point where you can do
the permute with two inputs.  I think the above is more because
each load is handled individually, not taking into account
redundancies across loads when you have freedom in the
even/odd, level combinations (which you usually have).

I suppose it should be possible to handle this to some extent,
not sure what the best strategy is when trying to avoid brute-force
searching for an optimal set (esp. when multi-level interleaving
will be involved).

> > Comments and suggestions welcome, esp. what representation
> > you'd think is suitable for SLP pattern matching to
> > load/store-lane and how to represent that?  Maybe this lowering
> > should happen directly in vect_lower_load_permutations?
> 
> If the load-lanes representation is as simple as above, it sounds like
> it could be deferred to pattern matching.  Not sure what the result
> would look like though.  It would be nice if (at least for costing
> purposes) we could have a single node for all lanes of the load-lanes,
> rather than create a separate node for each lane and rely on later CSE.
> (Or do we already have a good representation for this?  It's been too
> long, sorry.)

Yeah, as said above having a load-lane node with multiple outputs
would be the best match, similar for store-lane.  It's probably
easiest to generate those right from this lowering until we
re-implement SLP discovery from scratch.

> Bit of trivia below:
> 
> > Thanks,
> > Richard.
> >
> > 	* tree-vect-slp.cc (vllp_cmp): New function.
> > 	(vect_lower_load_permutations): Likewise.
> > 	(vect_analyze_slp): Call it.
> > ---
> >  gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 279 insertions(+)
> >
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 7e3d0107b4e..766b773452f 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
> >    return res;
> >  }
> >  
> > +/* qsort comparator ordering SLP load nodes.  */
> > +
> > +static int
> > +vllp_cmp (const void *a_, const void *b_)
> > +{
> > +  const slp_tree a = *(const slp_tree *)a_;
> > +  const slp_tree b = *(const slp_tree *)b_;
> > +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> > +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> > +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +      && STMT_VINFO_GROUPED_ACCESS (b0)
> > +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> > +    {
> > +      /* Same group, order after lanes used.  */
> > +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
> > +	return 1;
> > +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
> > +	return -1;
> > +      else
> > +	{
> > +	  /* Try to order loads using the same lanes together, breaking
> > +	     the tie with the lane number that first differs.  */
> > +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 0;
> 
> Does the comparison need to be "stable", with a further tie-breaker
> when a != b?  Or does our qsort not rely on that?

It doesn't, there's stable_sort if we'd rely on a specific order
on the consumer side.

> > +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 1;
> > +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return -1;
> > +	  else
> > +	    {
> > +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
> > +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		  {
> > +		    /* In-order lane first, that's what the above case for
> > +		       no permutation does.  */
> > +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
> > +		      return -1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
> > +		      return 1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		      return -1;
> > +		    else
> > +		      return 1;
> > +		  }
> > +	      return 0;
> > +	    }
> > +	}
> > +    }
> > +  else /* Different groups or non-groups.  */
> > +    {
> > +      /* Order groups as their first element to keep them together.  */
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0))
> > +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
> > +      if (STMT_VINFO_GROUPED_ACCESS (b0))
> > +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
> > +      if (a0 == b0)
> > +	return 0;
> > +      /* Tie using UID.  */
> > +      else if (gimple_uid (STMT_VINFO_STMT (a0))
> > +	       < gimple_uid (STMT_VINFO_STMT (b0)))
> > +	return -1;
> > +      else
> > +	{
> > +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
> > +		      != gimple_uid (STMT_VINFO_STMT (b0)));
> > +	  return 1;
> > +	}
> > +    }
> > +}
> > +
> > +/* Process the set of LOADS that are all from the same dataref group.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map,
> > +			      const array_slice<slp_tree> &loads)
> > +{
> > +  /* We at this point want to lower without a fixed VF or vector
> > +     size in mind which means we cannot actually compute whether we
> > +     need three or more vectors for a load permutation yet.  So always
> > +     lower.  */
> > +  stmt_vec_info first
> > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> > +
> > +  /* ???  In principle we have to consider a gap up to the next full
> > +     vector, but we have to actually represent a scalar stmt for the
> > +     gaps value so delay handling this.  The same is true for
> > +     inbetween gaps which the load places in the load-permutation
> > +     represent.  It's probably not worth trying an intermediate packing
> > +     to vectors without gap even if that might handle some more cases.
> > +     Instead get the gap case correct in some way.  */
> > +  unsigned group_lanes = 0;
> > +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +    {
> > +      if ((s == first && DR_GROUP_GAP (s) != 0)
> > +	  || (s != first && DR_GROUP_GAP (s) != 1))
> > +	return;
> > +      group_lanes++;
> > +    }
> > +  /* Only a power-of-two number of lanes matches interleaving.  */
> > +  if (exact_log2 (group_lanes) == -1)
> > +    return;
> > +
> > +  for (slp_tree load : loads)
> > +    {
> > +      /* Leave masked or gather loads alone for now.  */
> > +      if (!SLP_TREE_CHILDREN (load).is_empty ())
> > +	continue;
> > +
> > +      /* We need to lower only loads of less than half of the groups
> > +	 lanes, including duplicate lanes.  */
> > +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> > +	continue;
> > +
> > +      /* Lower by reducing the group to half its size using an
> > +	 interleaving scheme.  For this try to compute whether all
> > +	 elements needed for this loads are in even or odd elements of
> 
> this load (or these loads)
> 
> > +	 a even/odd decomposition with N consecutive elements.
> 
> an even/odd
> 
> > +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> > +	 with N == 2.  */
> > +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
> 
> Is this DR_GROUP_SIZE (first) conceptually different from group_lanes
> above?  If not, I think it'd be a bit easier to follow if this line reused
> the exact_log2 result above.

Once we look at groups with gaps it's different - the load permutation
lane indices have gaps represented, so a a[0], a[2], a[3] group
would have a load permutation of { 0, 2, 3 }.  group_lanes is the
number of lanes in the output of the load which has unused/gap lanes
stripped.

I've short-cut handling of groups with intermediate gaps and also with
gaps at the end for simplicity as I have to decide what to put into
SLP_TREE_SCALAR_STMTS for the unpermuted SLP load node which would
have those gaps "represented" (I'm quite sure a NULL ICEs left and
right, duplicating the previous lane sounds appealing even though
it's wrong ...).  As said, this is more a proof-of-concept ;)

For the next iteration I'm going to add some test coverage, esp.
also for the multi-level case and will see to handle gaps.

> > +      unsigned odd = even;
> > +      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
> > +	{
> > +	  even &= ~l;
> > +	  odd &= l;
> > +	}
> > +      /* Give up when this doesn't match up with an interleaving scheme.  */
> > +      if (!even && !odd)
> > +	continue;
> > +
> > +      /* First build (and possibly re-use) a load node for the
> > +	 unpermuted group.  */
> > +      vec<stmt_vec_info> stmts;
> > +      stmts.create (group_lanes);
> > +      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +	stmts.quick_push (s);
> > +      poly_uint64 max_nunits;
> > +      bool *matches = XALLOCAVEC (bool, group_lanes);
> > +      unsigned limit = 1;
> > +      unsigned tree_size = 0;
> > +      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
> > +					 group_lanes,
> > +					 &max_nunits, matches, &limit,
> > +					 &tree_size, bst_map);
> > +
> > +      /* Build the permute to get the original load permutation order.  */
> > +      lane_permutation_t final_perm;
> > +      final_perm.create (SLP_TREE_LANES (load));
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	final_perm.quick_push
> > +	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
> > +
> > +      /* Now build a even or odd extraction from the unpermuted load.  */
> 
> an even

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +      lane_permutation_t perm;
> > +      perm.create (group_lanes / 2);
> > +      unsigned level;
> > +      if (even
> > +	  && ((level = 1 << ctz_hwi (even)), true)
> > +	  && group_lanes % (2 * level) == 0)
> > +	{
> > +	  /* { 0, 1, ... 4, 5 ..., } */
> > +	  unsigned level = 1 << ctz_hwi (even);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
> > +	}
> > +      else if (odd)
> > +	{
> > +	  /* { ..., 2, 3, ... 6, 7 } */
> > +	  unsigned level = 1 << ctz_hwi (odd);
> > +	  gcc_assert (group_lanes % (2 * level) == 0);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
> > +	}
> > +      else
> > +	gcc_unreachable ();
> > +
> > +      /* And update final_perm.  */
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	{
> > +	  unsigned l = final_perm[i].second;
> > +	  unsigned j;
> > +	  for (j = 0; j < perm.length (); ++j)
> > +	    if (perm[j].second == l)
> > +	      {
> > +		final_perm[i].second = j;
> > +		break;
> > +	      }
> > +	  gcc_assert (j < perm.length ());
> > +	}
> > +
> > +      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> > +      SLP_TREE_CHILDREN (p).quick_push (l0);
> > +      SLP_TREE_LANE_PERMUTATION (p) = perm;
> > +      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
> > +      SLP_TREE_LANES (p) = perm.length ();
> > +      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
> > +      /* ???  We should have scalar stmts for this and use bst_map
> > +	 to CSE.  But we do not want to pick up original SLP load
> > +	 nodes with a load-permutation here.  */
> > +      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
> > +      /* ???  Ideally pick the best even/odd scheme usable for
> > +	 most of the loads.  -> do a multi-step scheme?  */
> > +
> > +      /* And finally from the ordered reduction node create the
> > +	 permute to shuffle the lanes into the original load-permutation
> > +	 order.  We replace the original load node with this.  */
> > +      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
> > +      SLP_TREE_LOAD_PERMUTATION (load).release ();
> > +      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
> > +      SLP_TREE_CHILDREN (load).create (1);
> > +      SLP_TREE_CHILDREN (load).quick_push (p);
> > +    }
> > +}
> > +
> > +/* Transform SLP loads in the SLP graph created by SLP discovery to
> > +   group loads from the same group and lower load permutations that
> > +   are unlikely to be supported into a series of permutes.
> > +   In the degenerate case of having only single-lane SLP instances
> > +   this should result in a series of permute nodes emulating an
> > +   interleaving scheme.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map)
> > +{
> > +  /* Gather and sort loads across all instances.  */
> > +  hash_set<slp_tree> visited;
> > +  auto_vec<slp_tree> loads;
> > +  for (auto inst : loop_vinfo->slp_instances)
> > +    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
> > +  if (loads.is_empty ())
> > +    return;
> > +  loads.qsort (vllp_cmp);
> > +
> > +  /* Now process each dataref group separately.  */
> > +  unsigned firsti = 0;
> > +  for (unsigned i = 1; i < loads.length (); ++i)
> > +    {
> > +      slp_tree first = loads[firsti];
> > +      slp_tree next = loads[i];
> > +      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
> > +      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +	  && STMT_VINFO_GROUPED_ACCESS (b0)
> > +	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> > +	continue;
> > +      /* Just one SLP load of a possible group, leave those alone.  */
> > +      if (i == firsti + 1)
> > +	{
> > +	  firsti = i;
> > +	  continue;
> > +	}
> > +      /* Now we have multiple SLP loads of the same group from
> > +	 firsti to i - 1.  */
> > +      vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				    make_array_slice (&loads[firsti],
> > +						      i - firsti));
> > +      firsti = i;
> > +    }
> > +  if (firsti < loads.length () - 1)
> > +    vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				  make_array_slice (&loads[firsti],
> > +						    loads.length () - firsti));
> > +}
> > +
> >  /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
> >     trees of packed scalar stmts if SLP is possible.  */
> >  
> > @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
> >  	}
> >      }
> >  
> > +  /* When we end up with load permutations that we cannot possibly handle,
> > +     like those requiring three vector inputs, lower them using interleaving
> > +     like schemes.  */
> > +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > +    vect_lower_load_permutations (loop_vinfo, bst_map);
> > +
> >    hash_set<slp_tree> visited_patterns;
> >    slp_tree_to_load_perm_map_t perm_cache;
> >    slp_compat_nodes_map_t compat_cache;
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] lower SLP load permutation to interleaving
  2024-06-05 11:54   ` Richard Biener
@ 2024-06-05 12:10     ` Richard Biener
  2024-06-05 12:29     ` Richard Sandiford
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Biener @ 2024-06-05 12:10 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, tamar.christina

On Wed, 5 Jun 2024, Richard Biener wrote:

> On Tue, 4 Jun 2024, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > The following emulates classical interleaving for SLP load permutes
> > > that we are unlikely handling natively.  This is to handle cases
> > > where interleaving (or load/store-lanes) is the optimal choice for
> > > vectorizing even when we are doing that within SLP.  An example
> > > would be
> > >
> > > void foo (int * __restrict a, int * b)
> > > {
> > >   for (int i = 0; i < 16; ++i)
> > >     {
> > >       a[4*i + 0] = b[4*i + 0] * 3;
> > >       a[4*i + 1] = b[4*i + 1] + 3;
> > >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
> > >       a[4*i + 3] = b[4*i + 3] * 3;
> > >     }
> > > }
> > >
> > > where currently the SLP store is merging four single-lane SLP
> > > sub-graphs but none of the loads in it can be code-generated
> > > with V4SImode vectors and a VF of four as the permutes would need
> > > three vectors.
> > 
> > Nice!
> > 
> > > The patch introduces a lowering phase after SLP discovery but
> > > before SLP pattern recognition or permute optimization that
> > > analyzes all loads from the same dataref group and creates an
> > > interleaving scheme starting from an unpermuted load.
> > >
> > > What can be handled is quite restrictive, matching only a subset
> > > of the non-SLP interleaving cases (the power-of-two group size
> > > ones, in addition only cases without gaps).  The interleaving
> > > vectorization in addition can handle size 3 and 5 - but I am not
> > > sure if it's possible to do that in a VL agnostic way.  It
> > > should be still possible to set up the SLP graph in a way that
> > > a load-lane could be matched from SLP pattern recognition.
> > 
> > Yeah, I don't think it would be possible to decompose a 3- or
> > 5-lane grouped load into a series of VLA 2-input permutes.
> > But (as I think you're saying) it seems like a load-3-lanes would just
> > be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
> > Is that right?
> 
> Yes, that's how it looks without this patch.  I think we'd need
> a load node loading N, N+1, N+2, ... and then permute nodes
> with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
> one .LOAD_LANES from the load node and the permutes pick up the
> correct vector defs?  I'm not sure yet how classification and
> code generation would work for this.
> 
> The store side is already on trunk with the single SLP store node
> getting lanes via permutes.
> 
> It might be we want a load/store node with N inputs/outputs as the
> best representation and use lane_permutation to indicate the
> input (for stores) and output (for loads) "permute".
> 
> > > As said gaps are currently not handled - for SLP we have a
> > > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> > > would need to be filled in some way (even if we just push NULL).
> > >
> > > The patch misses multi-level even/odd handling as well as CSEing
> > > intermediate generated permutes.  Both is quite straight-forward
> > > to add, but eventually there's a better or more general strategy
> > > for lowering?  The main goal of the patch is to avoid falling
> > > back to non-SLP for cases the interleaving code handles.
> > 
> > Does the multi-level thing including examples like:
> > 
> > int a[2 * 16];
> > int b[8 * 16];
> > void f()
> > {
> >   for (int i = 0; i < 16; ++i)
> >     {
> >       a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 3];
> >       a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 7];
> >     }
> > }
> > 
> > ?  For that we generate:
> > 
> >   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
> >   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
> >   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
> >   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
> >   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
> >   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
> >   _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
> >   _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
> >   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
> >   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
> > 
> > (two even level 1, one even level 2, one odd level 1), whereas
> > preferring 2xeven + 2xodd would avoid the third set of first-level
> > permutes:
> > 
> >   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
> >   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
> >   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
> >   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
> >   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
> >   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
> >   _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
> >   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

Btw, I just checked GCC 14 which uses interleaving for this
(SLP needs a three vector permute) and that gets us double the VF
and overall 28 permutes compared to 10 permutes for the above
(and 8 for your optimal variant).  So it should already be an
improvement at least.

Richard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] [RFC] lower SLP load permutation to interleaving
  2024-06-05 11:54   ` Richard Biener
  2024-06-05 12:10     ` Richard Biener
@ 2024-06-05 12:29     ` Richard Sandiford
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Sandiford @ 2024-06-05 12:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, tamar.christina

Richard Biener <rguenther@suse.de> writes:
> On Tue, 4 Jun 2024, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > The following emulates classical interleaving for SLP load permutes
>> > that we are unlikely handling natively.  This is to handle cases
>> > where interleaving (or load/store-lanes) is the optimal choice for
>> > vectorizing even when we are doing that within SLP.  An example
>> > would be
>> >
>> > void foo (int * __restrict a, int * b)
>> > {
>> >   for (int i = 0; i < 16; ++i)
>> >     {
>> >       a[4*i + 0] = b[4*i + 0] * 3;
>> >       a[4*i + 1] = b[4*i + 1] + 3;
>> >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>> >       a[4*i + 3] = b[4*i + 3] * 3;
>> >     }
>> > }
>> >
>> > where currently the SLP store is merging four single-lane SLP
>> > sub-graphs but none of the loads in it can be code-generated
>> > with V4SImode vectors and a VF of four as the permutes would need
>> > three vectors.
>> 
>> Nice!
>> 
>> > The patch introduces a lowering phase after SLP discovery but
>> > before SLP pattern recognition or permute optimization that
>> > analyzes all loads from the same dataref group and creates an
>> > interleaving scheme starting from an unpermuted load.
>> >
>> > What can be handled is quite restrictive, matching only a subset
>> > of the non-SLP interleaving cases (the power-of-two group size
>> > ones, in addition only cases without gaps).  The interleaving
>> > vectorization in addition can handle size 3 and 5 - but I am not
>> > sure if it's possible to do that in a VL agnostic way.  It
>> > should be still possible to set up the SLP graph in a way that
>> > a load-lane could be matched from SLP pattern recognition.
>> 
>> Yeah, I don't think it would be possible to decompose a 3- or
>> 5-lane grouped load into a series of VLA 2-input permutes.
>> But (as I think you're saying) it seems like a load-3-lanes would just
>> be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
>> Is that right?
>
> Yes, that's how it looks without this patch.  I think we'd need
> a load node loading N, N+1, N+2, ... and then permute nodes
> with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
> one .LOAD_LANES from the load node and the permutes pick up the
> correct vector defs?  I'm not sure yet how classification and
> code generation would work for this.
>
> The store side is already on trunk with the single SLP store node
> getting lanes via permutes.
>
> It might be we want a load/store node with N inputs/outputs as the
> best representation and use lane_permutation to indicate the
> input (for stores) and output (for loads) "permute".
>
>> > As said gaps are currently not handled - for SLP we have a
>> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
>> > would need to be filled in some way (even if we just push NULL).
>> >
>> > The patch misses multi-level even/odd handling as well as CSEing
>> > intermediate generated permutes.  Both is quite straight-forward
>> > to add, but eventually there's a better or more general strategy
>> > for lowering?  The main goal of the patch is to avoid falling
>> > back to non-SLP for cases the interleaving code handles.
>> 
>> Does the multi-level thing including examples like:
>> 
>> int a[2 * 16];
>> int b[8 * 16];
>> void f()
>> {
>>   for (int i = 0; i < 16; ++i)
>>     {
>>       a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 + 3];
>>       a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 + 7];
>>     }
>> }
>> 
>> ?  For that we generate:
>> 
>>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>>   _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
>>   _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
>>   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
>>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
>> 
>> (two even level 1, one even level 2, one odd level 1), whereas
>> preferring 2xeven + 2xodd would avoid the third set of first-level
>> permutes:
>> 
>>   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
>>   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
>>   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
>>   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
>>   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
>>   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
>>   _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
>>   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
>
> The multi-level issue is more when a single reduction to N/2
> inputs still doesn't get you to the point where you can do
> the permute with two inputs.  I think the above is more because
> each load is handled individually, not taking into account
> redundancies across loads when you have freedom in the
> even/odd, level combinations (which you usually have).
>
> I suppose it should be possible to handle this to some extent,
> not sure what the best strategy is when trying to avoid brute-force
> searching for an optimal set (esp. when multi-level interleaving
> will be involved).

Was wondering about picking the smaller of ctz_hwi (even) and
ctz_hwi (odd) (when both are nonzero).  I.e. something like:

      // HOST_BITS_PER_WIDE_INT for zero.
      auto even_bits = ctz_hwi (even);
      auto odd_bits = ctz_hwi (odd);
      if (even_bits <= odd_bits)
        {
	  unsigned level = 1 << even_bits;
	  gcc_assert (group_lanes % (2 * level) == 0);
          ...
        }
      else if (odd)
        {
	  unsigned level = 1 << odd_bits;
	  gcc_assert (group_lanes % (2 * level) == 0);
          ...
        }
      else
        gcc_unreachable ();

Picking the smaller level should guarantee that it fits.

That handles this case (and is how I generated the above),
but maybe it's worse for something else.

(And yeah, like you said in the follow-up, it's definitely better
than what we generated for GCC 14. :) )

>> > Comments and suggestions welcome, esp. what representation
>> > you'd think is suitable for SLP pattern matching to
>> > load/store-lane and how to represent that?  Maybe this lowering
>> > should happen directly in vect_lower_load_permutations?
>> 
>> If the load-lanes representation is as simple as above, it sounds like
>> it could be deferred to pattern matching.  Not sure what the result
>> would look like though.  It would be nice if (at least for costing
>> purposes) we could have a single node for all lanes of the load-lanes,
>> rather than create a separate node for each lane and rely on later CSE.
>> (Or do we already have a good representation for this?  It's been too
>> long, sorry.)
>
> Yeah, as said above having a load-lane node with multiple outputs
> would be the best match, similar for store-lane.  It's probably
> easiest to generate those right from this lowering until we
> re-implement SLP discovery from scratch.

OK.

>> Bit of trivia below:
>> 
>> > Thanks,
>> > Richard.
>> >
>> > 	* tree-vect-slp.cc (vllp_cmp): New function.
>> > 	(vect_lower_load_permutations): Likewise.
>> > 	(vect_analyze_slp): Call it.
>> > ---
>> >  gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 279 insertions(+)
>> >
>> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> > index 7e3d0107b4e..766b773452f 100644
>> > --- a/gcc/tree-vect-slp.cc
>> > +++ b/gcc/tree-vect-slp.cc
>> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
>> >    return res;
>> >  }
>> >  
>> > +/* qsort comparator ordering SLP load nodes.  */
>> > +
>> > +static int
>> > +vllp_cmp (const void *a_, const void *b_)
>> > +{
>> > +  const slp_tree a = *(const slp_tree *)a_;
>> > +  const slp_tree b = *(const slp_tree *)b_;
>> > +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
>> > +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
>> > +  if (STMT_VINFO_GROUPED_ACCESS (a0)
>> > +      && STMT_VINFO_GROUPED_ACCESS (b0)
>> > +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
>> > +    {
>> > +      /* Same group, order after lanes used.  */
>> > +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
>> > +	return 1;
>> > +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
>> > +	return -1;
>> > +      else
>> > +	{
>> > +	  /* Try to order loads using the same lanes together, breaking
>> > +	     the tie with the lane number that first differs.  */
>> > +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +	    return 0;
>> 
>> Does the comparison need to be "stable", with a further tie-breaker
>> when a != b?  Or does our qsort not rely on that?
>
> It doesn't, there's stable_sort if we'd rely on a specific order
> on the consumer side.
>
>> > +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +	    return 1;
>> > +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
>> > +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
>> > +	    return -1;
>> > +	  else
>> > +	    {
>> > +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
>> > +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
>> > +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
>> > +		  {
>> > +		    /* In-order lane first, that's what the above case for
>> > +		       no permutation does.  */
>> > +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
>> > +		      return -1;
>> > +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
>> > +		      return 1;
>> > +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
>> > +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
>> > +		      return -1;
>> > +		    else
>> > +		      return 1;
>> > +		  }
>> > +	      return 0;
>> > +	    }
>> > +	}
>> > +    }
>> > +  else /* Different groups or non-groups.  */
>> > +    {
>> > +      /* Order groups as their first element to keep them together.  */
>> > +      if (STMT_VINFO_GROUPED_ACCESS (a0))
>> > +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
>> > +      if (STMT_VINFO_GROUPED_ACCESS (b0))
>> > +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
>> > +      if (a0 == b0)
>> > +	return 0;
>> > +      /* Tie using UID.  */
>> > +      else if (gimple_uid (STMT_VINFO_STMT (a0))
>> > +	       < gimple_uid (STMT_VINFO_STMT (b0)))
>> > +	return -1;
>> > +      else
>> > +	{
>> > +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
>> > +		      != gimple_uid (STMT_VINFO_STMT (b0)));
>> > +	  return 1;
>> > +	}
>> > +    }
>> > +}
>> > +
>> > +/* Process the set of LOADS that are all from the same dataref group.  */
>> > +
>> > +static void
>> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
>> > +			      scalar_stmts_to_slp_tree_map_t *bst_map,
>> > +			      const array_slice<slp_tree> &loads)
>> > +{
>> > +  /* We at this point want to lower without a fixed VF or vector
>> > +     size in mind which means we cannot actually compute whether we
>> > +     need three or more vectors for a load permutation yet.  So always
>> > +     lower.  */
>> > +  stmt_vec_info first
>> > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
>> > +
>> > +  /* ???  In principle we have to consider a gap up to the next full
>> > +     vector, but we have to actually represent a scalar stmt for the
>> > +     gaps value so delay handling this.  The same is true for
>> > +     inbetween gaps which the load places in the load-permutation
>> > +     represent.  It's probably not worth trying an intermediate packing
>> > +     to vectors without gap even if that might handle some more cases.
>> > +     Instead get the gap case correct in some way.  */
>> > +  unsigned group_lanes = 0;
>> > +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
>> > +    {
>> > +      if ((s == first && DR_GROUP_GAP (s) != 0)
>> > +	  || (s != first && DR_GROUP_GAP (s) != 1))
>> > +	return;
>> > +      group_lanes++;
>> > +    }
>> > +  /* Only a power-of-two number of lanes matches interleaving.  */
>> > +  if (exact_log2 (group_lanes) == -1)
>> > +    return;
>> > +
>> > +  for (slp_tree load : loads)
>> > +    {
>> > +      /* Leave masked or gather loads alone for now.  */
>> > +      if (!SLP_TREE_CHILDREN (load).is_empty ())
>> > +	continue;
>> > +
>> > +      /* We need to lower only loads of less than half of the groups
>> > +	 lanes, including duplicate lanes.  */
>> > +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
>> > +	continue;
>> > +
>> > +      /* Lower by reducing the group to half its size using an
>> > +	 interleaving scheme.  For this try to compute whether all
>> > +	 elements needed for this loads are in even or odd elements of
>> 
>> this load (or these loads)
>> 
>> > +	 a even/odd decomposition with N consecutive elements.
>> 
>> an even/odd
>> 
>> > +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
>> > +	 with N == 2.  */
>> > +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
>> 
>> Is this DR_GROUP_SIZE (first) conceptually different from group_lanes
>> above?  If not, I think it'd be a bit easier to follow if this line reused
>> the exact_log2 result above.
>
> Once we look at groups with gaps it's different - the load permutation
> lane indices have gaps represented, so a a[0], a[2], a[3] group
> would have a load permutation of { 0, 2, 3 }.  group_lanes is the
> number of lanes in the output of the load which has unused/gap lanes
> stripped.
>
> I've short-cut handling of groups with intermediate gaps and also with
> gaps at the end for simplicity as I have to decide what to put into
> SLP_TREE_SCALAR_STMTS for the unpermuted SLP load node which would
> have those gaps "represented" (I'm quite sure a NULL ICEs left and
> right, duplicating the previous lane sounds appealing even though
> it's wrong ...).  As said, this is more a proof-of-concept ;)
>
> For the next iteration I'm going to add some test coverage, esp.
> also for the multi-level case and will see to handle gaps.

OK, thanks, sounds good.

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] [RFC] lower SLP load permutation to interleaving
       [not found] <c6f14930-59e5-4cc8-a38e-7f0452b67dba@DB5PEPF00014B89.eurprd02.prod.outlook.com>
  2024-06-04 19:58 ` [PATCH] [RFC] lower SLP load permutation to interleaving Richard Sandiford
@ 2024-06-05  9:00 ` Tamar Christina
  2024-06-05 11:59   ` Richard Biener
  1 sibling, 1 reply; 7+ messages in thread
From: Tamar Christina @ 2024-06-05  9:00 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: Richard Sandiford

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, June 4, 2024 3:33 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> <Tamar.Christina@arm.com>
> Subject: [PATCH] [RFC] lower SLP load permutation to interleaving
> 
> The following emulates classical interleaving for SLP load permutes
> that we are unlikely handling natively.  This is to handle cases
> where interleaving (or load/store-lanes) is the optimal choice for
> vectorizing even when we are doing that within SLP.  An example
> would be
> 
> void foo (int * __restrict a, int * b)
> {
>   for (int i = 0; i < 16; ++i)
>     {
>       a[4*i + 0] = b[4*i + 0] * 3;
>       a[4*i + 1] = b[4*i + 1] + 3;
>       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
>       a[4*i + 3] = b[4*i + 3] * 3;
>     }
> }
> 
> where currently the SLP store is merging four single-lane SLP
> sub-graphs but none of the loads in it can be code-generated
> with V4SImode vectors and a VF of four as the permutes would need
> three vectors.
> 
> The patch introduces a lowering phase after SLP discovery but
> before SLP pattern recognition or permute optimization that
> analyzes all loads from the same dataref group and creates an
> interleaving scheme starting from an unpermuted load.
> 
> What can be handled is quite restrictive, matching only a subset
> of the non-SLP interleaving cases (the power-of-two group size
> ones, in addition only cases without gaps).  The interleaving
> vectorization in addition can handle size 3 and 5 - but I am not
> sure if it's possible to do that in a VL agnostic way.  It
> should be still possible to set up the SLP graph in a way that
> a load-lane could be matched from SLP pattern recognition.
> 
> As said gaps are currently not handled - for SLP we have a
> representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> would need to be filled in some way (even if we just push NULL).
> 
> The patch misses multi-level even/odd handling as well as CSEing
> intermediate generated permutes.  Both is quite straight-forward
> to add, but eventually there's a better or more general strategy
> for lowering?  The main goal of the patch is to avoid falling
> back to non-SLP for cases the interleaving code handles.

I guess not handling CSEing the intermediate permutes only really
matter for pattern matching? Those could be eliminated in optimize_slp?

> 
> Comments and suggestions welcome, esp. what representation
> you'd think is suitable for SLP pattern matching to
> load/store-lane and how to represent that?  Maybe this lowering
> should happen directly in vect_lower_load_permutations?

I like this representation personally, I'd say having the permute explicit,
at least until optimize_slp would make pattern matching easier.

We wouldn't need hacks such as optimize_load_redistribution.
In that sense, does it make sense to eventually just lower all permuted
loads?

Cheers,
Tamar

> 
> Thanks,
> Richard.
> 
> 	* tree-vect-slp.cc (vllp_cmp): New function.
> 	(vect_lower_load_permutations): Likewise.
> 	(vect_analyze_slp): Call it.
> ---
>  gcc/tree-vect-slp.cc | 279
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 279 insertions(+)
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 7e3d0107b4e..766b773452f 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    return res;
>  }
> 
> +/* qsort comparator ordering SLP load nodes.  */
> +
> +static int
> +vllp_cmp (const void *a_, const void *b_)
> +{
> +  const slp_tree a = *(const slp_tree *)a_;
> +  const slp_tree b = *(const slp_tree *)b_;
> +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> +      && STMT_VINFO_GROUPED_ACCESS (b0)
> +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> +    {
> +      /* Same group, order after lanes used.  */
> +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
> +	return 1;
> +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
> +	return -1;
> +      else
> +	{
> +	  /* Try to order loads using the same lanes together, breaking
> +	     the tie with the lane number that first differs.  */
> +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return 0;
> +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return 1;
> +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
> +	    return -1;
> +	  else
> +	    {
> +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
> +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
> +		  {
> +		    /* In-order lane first, that's what the above case for
> +		       no permutation does.  */
> +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
> +		      return -1;
> +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
> +		      return 1;
> +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
> +		      return -1;
> +		    else
> +		      return 1;
> +		  }
> +	      return 0;
> +	    }
> +	}
> +    }
> +  else /* Different groups or non-groups.  */
> +    {
> +      /* Order groups as their first element to keep them together.  */
> +      if (STMT_VINFO_GROUPED_ACCESS (a0))
> +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
> +      if (STMT_VINFO_GROUPED_ACCESS (b0))
> +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
> +      if (a0 == b0)
> +	return 0;
> +      /* Tie using UID.  */
> +      else if (gimple_uid (STMT_VINFO_STMT (a0))
> +	       < gimple_uid (STMT_VINFO_STMT (b0)))
> +	return -1;
> +      else
> +	{
> +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
> +		      != gimple_uid (STMT_VINFO_STMT (b0)));
> +	  return 1;
> +	}
> +    }
> +}
> +
> +/* Process the set of LOADS that are all from the same dataref group.  */
> +
> +static void
> +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> +			      scalar_stmts_to_slp_tree_map_t *bst_map,
> +			      const array_slice<slp_tree> &loads)
> +{
> +  /* We at this point want to lower without a fixed VF or vector
> +     size in mind which means we cannot actually compute whether we
> +     need three or more vectors for a load permutation yet.  So always
> +     lower.  */
> +  stmt_vec_info first
> +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> +
> +  /* ???  In principle we have to consider a gap up to the next full
> +     vector, but we have to actually represent a scalar stmt for the
> +     gaps value so delay handling this.  The same is true for
> +     inbetween gaps which the load places in the load-permutation
> +     represent.  It's probably not worth trying an intermediate packing
> +     to vectors without gap even if that might handle some more cases.
> +     Instead get the gap case correct in some way.  */
> +  unsigned group_lanes = 0;
> +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> +    {
> +      if ((s == first && DR_GROUP_GAP (s) != 0)
> +	  || (s != first && DR_GROUP_GAP (s) != 1))
> +	return;
> +      group_lanes++;
> +    }
> +  /* Only a power-of-two number of lanes matches interleaving.  */
> +  if (exact_log2 (group_lanes) == -1)
> +    return;
> +
> +  for (slp_tree load : loads)
> +    {
> +      /* Leave masked or gather loads alone for now.  */
> +      if (!SLP_TREE_CHILDREN (load).is_empty ())
> +	continue;
> +
> +      /* We need to lower only loads of less than half of the groups
> +	 lanes, including duplicate lanes.  */
> +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> +	continue;
> +
> +      /* Lower by reducing the group to half its size using an
> +	 interleaving scheme.  For this try to compute whether all
> +	 elements needed for this loads are in even or odd elements of
> +	 a even/odd decomposition with N consecutive elements.
> +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> +	 with N == 2.  */
> +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
> +      unsigned odd = even;
> +      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
> +	{
> +	  even &= ~l;
> +	  odd &= l;
> +	}
> +      /* Give up when this doesn't match up with an interleaving scheme.  */
> +      if (!even && !odd)
> +	continue;
> +
> +      /* First build (and possibly re-use) a load node for the
> +	 unpermuted group.  */
> +      vec<stmt_vec_info> stmts;
> +      stmts.create (group_lanes);
> +      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> +	stmts.quick_push (s);
> +      poly_uint64 max_nunits;
> +      bool *matches = XALLOCAVEC (bool, group_lanes);
> +      unsigned limit = 1;
> +      unsigned tree_size = 0;
> +      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
> +					 group_lanes,
> +					 &max_nunits, matches, &limit,
> +					 &tree_size, bst_map);
> +
> +      /* Build the permute to get the original load permutation order.  */
> +      lane_permutation_t final_perm;
> +      final_perm.create (SLP_TREE_LANES (load));
> +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> +	final_perm.quick_push
> +	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
> +
> +      /* Now build a even or odd extraction from the unpermuted load.  */
> +      lane_permutation_t perm;
> +      perm.create (group_lanes / 2);
> +      unsigned level;
> +      if (even
> +	  && ((level = 1 << ctz_hwi (even)), true)
> +	  && group_lanes % (2 * level) == 0)
> +	{
> +	  /* { 0, 1, ... 4, 5 ..., } */
> +	  unsigned level = 1 << ctz_hwi (even);
> +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> +	    for (unsigned j = 0; j < level; ++j)
> +	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
> +	}
> +      else if (odd)
> +	{
> +	  /* { ..., 2, 3, ... 6, 7 } */
> +	  unsigned level = 1 << ctz_hwi (odd);
> +	  gcc_assert (group_lanes % (2 * level) == 0);
> +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> +	    for (unsigned j = 0; j < level; ++j)
> +	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
> +	}
> +      else
> +	gcc_unreachable ();
> +
> +      /* And update final_perm.  */
> +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> +	{
> +	  unsigned l = final_perm[i].second;
> +	  unsigned j;
> +	  for (j = 0; j < perm.length (); ++j)
> +	    if (perm[j].second == l)
> +	      {
> +		final_perm[i].second = j;
> +		break;
> +	      }
> +	  gcc_assert (j < perm.length ());
> +	}
> +
> +      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> +      SLP_TREE_CHILDREN (p).quick_push (l0);
> +      SLP_TREE_LANE_PERMUTATION (p) = perm;
> +      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
> +      SLP_TREE_LANES (p) = perm.length ();
> +      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
> +      /* ???  We should have scalar stmts for this and use bst_map
> +	 to CSE.  But we do not want to pick up original SLP load
> +	 nodes with a load-permutation here.  */
> +      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
> +      /* ???  Ideally pick the best even/odd scheme usable for
> +	 most of the loads.  -> do a multi-step scheme?  */
> +
> +      /* And finally from the ordered reduction node create the
> +	 permute to shuffle the lanes into the original load-permutation
> +	 order.  We replace the original load node with this.  */
> +      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
> +      SLP_TREE_LOAD_PERMUTATION (load).release ();
> +      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
> +      SLP_TREE_CHILDREN (load).create (1);
> +      SLP_TREE_CHILDREN (load).quick_push (p);
> +    }
> +}
> +
> +/* Transform SLP loads in the SLP graph created by SLP discovery to
> +   group loads from the same group and lower load permutations that
> +   are unlikely to be supported into a series of permutes.
> +   In the degenerate case of having only single-lane SLP instances
> +   this should result in a series of permute nodes emulating an
> +   interleaving scheme.  */
> +
> +static void
> +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> +			      scalar_stmts_to_slp_tree_map_t *bst_map)
> +{
> +  /* Gather and sort loads across all instances.  */
> +  hash_set<slp_tree> visited;
> +  auto_vec<slp_tree> loads;
> +  for (auto inst : loop_vinfo->slp_instances)
> +    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
> +  if (loads.is_empty ())
> +    return;
> +  loads.qsort (vllp_cmp);
> +
> +  /* Now process each dataref group separately.  */
> +  unsigned firsti = 0;
> +  for (unsigned i = 1; i < loads.length (); ++i)
> +    {
> +      slp_tree first = loads[firsti];
> +      slp_tree next = loads[i];
> +      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
> +      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
> +      if (STMT_VINFO_GROUPED_ACCESS (a0)
> +	  && STMT_VINFO_GROUPED_ACCESS (b0)
> +	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT
> (b0))
> +	continue;
> +      /* Just one SLP load of a possible group, leave those alone.  */
> +      if (i == firsti + 1)
> +	{
> +	  firsti = i;
> +	  continue;
> +	}
> +      /* Now we have multiple SLP loads of the same group from
> +	 firsti to i - 1.  */
> +      vect_lower_load_permutations (loop_vinfo, bst_map,
> +				    make_array_slice (&loads[firsti],
> +						      i - firsti));
> +      firsti = i;
> +    }
> +  if (firsti < loads.length () - 1)
> +    vect_lower_load_permutations (loop_vinfo, bst_map,
> +				  make_array_slice (&loads[firsti],
> +						    loads.length () - firsti));
> +}
> +
>  /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
>     trees of packed scalar stmts if SLP is possible.  */
> 
> @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> max_tree_size)
>  	}
>      }
> 
> +  /* When we end up with load permutations that we cannot possibly handle,
> +     like those requiring three vector inputs, lower them using interleaving
> +     like schemes.  */
> +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> +    vect_lower_load_permutations (loop_vinfo, bst_map);
> +
>    hash_set<slp_tree> visited_patterns;
>    slp_tree_to_load_perm_map_t perm_cache;
>    slp_compat_nodes_map_t compat_cache;
> --
> 2.35.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] [RFC] lower SLP load permutation to interleaving
  2024-06-05  9:00 ` Tamar Christina
@ 2024-06-05 11:59   ` Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2024-06-05 11:59 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, Richard Sandiford

On Wed, 5 Jun 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, June 4, 2024 3:33 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> > <Tamar.Christina@arm.com>
> > Subject: [PATCH] [RFC] lower SLP load permutation to interleaving
> > 
> > The following emulates classical interleaving for SLP load permutes
> > that we are unlikely handling natively.  This is to handle cases
> > where interleaving (or load/store-lanes) is the optimal choice for
> > vectorizing even when we are doing that within SLP.  An example
> > would be
> > 
> > void foo (int * __restrict a, int * b)
> > {
> >   for (int i = 0; i < 16; ++i)
> >     {
> >       a[4*i + 0] = b[4*i + 0] * 3;
> >       a[4*i + 1] = b[4*i + 1] + 3;
> >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
> >       a[4*i + 3] = b[4*i + 3] * 3;
> >     }
> > }
> > 
> > where currently the SLP store is merging four single-lane SLP
> > sub-graphs but none of the loads in it can be code-generated
> > with V4SImode vectors and a VF of four as the permutes would need
> > three vectors.
> > 
> > The patch introduces a lowering phase after SLP discovery but
> > before SLP pattern recognition or permute optimization that
> > analyzes all loads from the same dataref group and creates an
> > interleaving scheme starting from an unpermuted load.
> > 
> > What can be handled is quite restrictive, matching only a subset
> > of the non-SLP interleaving cases (the power-of-two group size
> > ones, in addition only cases without gaps).  The interleaving
> > vectorization in addition can handle size 3 and 5 - but I am not
> > sure if it's possible to do that in a VL agnostic way.  It
> > should be still possible to set up the SLP graph in a way that
> > a load-lane could be matched from SLP pattern recognition.
> > 
> > As said gaps are currently not handled - for SLP we have a
> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> > would need to be filled in some way (even if we just push NULL).
> > 
> > The patch misses multi-level even/odd handling as well as CSEing
> > intermediate generated permutes.  Both is quite straight-forward
> > to add, but eventually there's a better or more general strategy
> > for lowering?  The main goal of the patch is to avoid falling
> > back to non-SLP for cases the interleaving code handles.
> 
> I guess not handling CSEing the intermediate permutes only really
> matter for pattern matching? Those could be eliminated in optimize_slp?
> 
> > 
> > Comments and suggestions welcome, esp. what representation
> > you'd think is suitable for SLP pattern matching to
> > load/store-lane and how to represent that?  Maybe this lowering
> > should happen directly in vect_lower_load_permutations?
> 
> I like this representation personally, I'd say having the permute explicit,
> at least until optimize_slp would make pattern matching easier.
> 
> We wouldn't need hacks such as optimize_load_redistribution.
> In that sense, does it make sense to eventually just lower all permuted
> loads?

Yes.  One of my ideas was that when we got rid of non-SLP, the
vectorizable_* routines themselves can apply transforms to their
SLP node, say turn a vector load with load-permutation into a
scalar load + splat.

Note that all the current patches try to preserve as much as possible
in (missed) capabilities of the current mix of SLP and non-SLP.  This
is why I didn't try changing SLP discovery for loads as I did in
earlier attempts to arrive at similar results but do this as followup
lowering.

The one thing that permuted loads (SLP loads with load_permutation)
are good for is to avoid needing to represent "unused" lanes.

Richard.


> 
> Cheers,
> Tamar
> 
> > 
> > Thanks,
> > Richard.
> > 
> > 	* tree-vect-slp.cc (vllp_cmp): New function.
> > 	(vect_lower_load_permutations): Likewise.
> > 	(vect_analyze_slp): Call it.
> > ---
> >  gcc/tree-vect-slp.cc | 279
> > +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 279 insertions(+)
> > 
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 7e3d0107b4e..766b773452f 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
> >    return res;
> >  }
> > 
> > +/* qsort comparator ordering SLP load nodes.  */
> > +
> > +static int
> > +vllp_cmp (const void *a_, const void *b_)
> > +{
> > +  const slp_tree a = *(const slp_tree *)a_;
> > +  const slp_tree b = *(const slp_tree *)b_;
> > +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> > +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> > +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +      && STMT_VINFO_GROUPED_ACCESS (b0)
> > +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> > +    {
> > +      /* Same group, order after lanes used.  */
> > +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
> > +	return 1;
> > +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
> > +	return -1;
> > +      else
> > +	{
> > +	  /* Try to order loads using the same lanes together, breaking
> > +	     the tie with the lane number that first differs.  */
> > +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 0;
> > +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 1;
> > +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return -1;
> > +	  else
> > +	    {
> > +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
> > +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		  {
> > +		    /* In-order lane first, that's what the above case for
> > +		       no permutation does.  */
> > +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
> > +		      return -1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
> > +		      return 1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		      return -1;
> > +		    else
> > +		      return 1;
> > +		  }
> > +	      return 0;
> > +	    }
> > +	}
> > +    }
> > +  else /* Different groups or non-groups.  */
> > +    {
> > +      /* Order groups as their first element to keep them together.  */
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0))
> > +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
> > +      if (STMT_VINFO_GROUPED_ACCESS (b0))
> > +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
> > +      if (a0 == b0)
> > +	return 0;
> > +      /* Tie using UID.  */
> > +      else if (gimple_uid (STMT_VINFO_STMT (a0))
> > +	       < gimple_uid (STMT_VINFO_STMT (b0)))
> > +	return -1;
> > +      else
> > +	{
> > +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
> > +		      != gimple_uid (STMT_VINFO_STMT (b0)));
> > +	  return 1;
> > +	}
> > +    }
> > +}
> > +
> > +/* Process the set of LOADS that are all from the same dataref group.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map,
> > +			      const array_slice<slp_tree> &loads)
> > +{
> > +  /* We at this point want to lower without a fixed VF or vector
> > +     size in mind which means we cannot actually compute whether we
> > +     need three or more vectors for a load permutation yet.  So always
> > +     lower.  */
> > +  stmt_vec_info first
> > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> > +
> > +  /* ???  In principle we have to consider a gap up to the next full
> > +     vector, but we have to actually represent a scalar stmt for the
> > +     gaps value so delay handling this.  The same is true for
> > +     inbetween gaps which the load places in the load-permutation
> > +     represent.  It's probably not worth trying an intermediate packing
> > +     to vectors without gap even if that might handle some more cases.
> > +     Instead get the gap case correct in some way.  */
> > +  unsigned group_lanes = 0;
> > +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +    {
> > +      if ((s == first && DR_GROUP_GAP (s) != 0)
> > +	  || (s != first && DR_GROUP_GAP (s) != 1))
> > +	return;
> > +      group_lanes++;
> > +    }
> > +  /* Only a power-of-two number of lanes matches interleaving.  */
> > +  if (exact_log2 (group_lanes) == -1)
> > +    return;
> > +
> > +  for (slp_tree load : loads)
> > +    {
> > +      /* Leave masked or gather loads alone for now.  */
> > +      if (!SLP_TREE_CHILDREN (load).is_empty ())
> > +	continue;
> > +
> > +      /* We need to lower only loads of less than half of the groups
> > +	 lanes, including duplicate lanes.  */
> > +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> > +	continue;
> > +
> > +      /* Lower by reducing the group to half its size using an
> > +	 interleaving scheme.  For this try to compute whether all
> > +	 elements needed for this loads are in even or odd elements of
> > +	 a even/odd decomposition with N consecutive elements.
> > +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> > +	 with N == 2.  */
> > +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
> > +      unsigned odd = even;
> > +      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
> > +	{
> > +	  even &= ~l;
> > +	  odd &= l;
> > +	}
> > +      /* Give up when this doesn't match up with an interleaving scheme.  */
> > +      if (!even && !odd)
> > +	continue;
> > +
> > +      /* First build (and possibly re-use) a load node for the
> > +	 unpermuted group.  */
> > +      vec<stmt_vec_info> stmts;
> > +      stmts.create (group_lanes);
> > +      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +	stmts.quick_push (s);
> > +      poly_uint64 max_nunits;
> > +      bool *matches = XALLOCAVEC (bool, group_lanes);
> > +      unsigned limit = 1;
> > +      unsigned tree_size = 0;
> > +      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
> > +					 group_lanes,
> > +					 &max_nunits, matches, &limit,
> > +					 &tree_size, bst_map);
> > +
> > +      /* Build the permute to get the original load permutation order.  */
> > +      lane_permutation_t final_perm;
> > +      final_perm.create (SLP_TREE_LANES (load));
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	final_perm.quick_push
> > +	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
> > +
> > +      /* Now build a even or odd extraction from the unpermuted load.  */
> > +      lane_permutation_t perm;
> > +      perm.create (group_lanes / 2);
> > +      unsigned level;
> > +      if (even
> > +	  && ((level = 1 << ctz_hwi (even)), true)
> > +	  && group_lanes % (2 * level) == 0)
> > +	{
> > +	  /* { 0, 1, ... 4, 5 ..., } */
> > +	  unsigned level = 1 << ctz_hwi (even);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
> > +	}
> > +      else if (odd)
> > +	{
> > +	  /* { ..., 2, 3, ... 6, 7 } */
> > +	  unsigned level = 1 << ctz_hwi (odd);
> > +	  gcc_assert (group_lanes % (2 * level) == 0);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
> > +	}
> > +      else
> > +	gcc_unreachable ();
> > +
> > +      /* And update final_perm.  */
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	{
> > +	  unsigned l = final_perm[i].second;
> > +	  unsigned j;
> > +	  for (j = 0; j < perm.length (); ++j)
> > +	    if (perm[j].second == l)
> > +	      {
> > +		final_perm[i].second = j;
> > +		break;
> > +	      }
> > +	  gcc_assert (j < perm.length ());
> > +	}
> > +
> > +      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> > +      SLP_TREE_CHILDREN (p).quick_push (l0);
> > +      SLP_TREE_LANE_PERMUTATION (p) = perm;
> > +      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
> > +      SLP_TREE_LANES (p) = perm.length ();
> > +      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
> > +      /* ???  We should have scalar stmts for this and use bst_map
> > +	 to CSE.  But we do not want to pick up original SLP load
> > +	 nodes with a load-permutation here.  */
> > +      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
> > +      /* ???  Ideally pick the best even/odd scheme usable for
> > +	 most of the loads.  -> do a multi-step scheme?  */
> > +
> > +      /* And finally from the ordered reduction node create the
> > +	 permute to shuffle the lanes into the original load-permutation
> > +	 order.  We replace the original load node with this.  */
> > +      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
> > +      SLP_TREE_LOAD_PERMUTATION (load).release ();
> > +      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
> > +      SLP_TREE_CHILDREN (load).create (1);
> > +      SLP_TREE_CHILDREN (load).quick_push (p);
> > +    }
> > +}
> > +
> > +/* Transform SLP loads in the SLP graph created by SLP discovery to
> > +   group loads from the same group and lower load permutations that
> > +   are unlikely to be supported into a series of permutes.
> > +   In the degenerate case of having only single-lane SLP instances
> > +   this should result in a series of permute nodes emulating an
> > +   interleaving scheme.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map)
> > +{
> > +  /* Gather and sort loads across all instances.  */
> > +  hash_set<slp_tree> visited;
> > +  auto_vec<slp_tree> loads;
> > +  for (auto inst : loop_vinfo->slp_instances)
> > +    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
> > +  if (loads.is_empty ())
> > +    return;
> > +  loads.qsort (vllp_cmp);
> > +
> > +  /* Now process each dataref group separately.  */
> > +  unsigned firsti = 0;
> > +  for (unsigned i = 1; i < loads.length (); ++i)
> > +    {
> > +      slp_tree first = loads[firsti];
> > +      slp_tree next = loads[i];
> > +      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
> > +      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +	  && STMT_VINFO_GROUPED_ACCESS (b0)
> > +	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT
> > (b0))
> > +	continue;
> > +      /* Just one SLP load of a possible group, leave those alone.  */
> > +      if (i == firsti + 1)
> > +	{
> > +	  firsti = i;
> > +	  continue;
> > +	}
> > +      /* Now we have multiple SLP loads of the same group from
> > +	 firsti to i - 1.  */
> > +      vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				    make_array_slice (&loads[firsti],
> > +						      i - firsti));
> > +      firsti = i;
> > +    }
> > +  if (firsti < loads.length () - 1)
> > +    vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				  make_array_slice (&loads[firsti],
> > +						    loads.length () - firsti));
> > +}
> > +
> >  /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
> >     trees of packed scalar stmts if SLP is possible.  */
> > 
> > @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> > max_tree_size)
> >  	}
> >      }
> > 
> > +  /* When we end up with load permutations that we cannot possibly handle,
> > +     like those requiring three vector inputs, lower them using interleaving
> > +     like schemes.  */
> > +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > +    vect_lower_load_permutations (loop_vinfo, bst_map);
> > +
> >    hash_set<slp_tree> visited_patterns;
> >    slp_tree_to_load_perm_map_t perm_cache;
> >    slp_compat_nodes_map_t compat_cache;
> > --
> > 2.35.3
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] [RFC] lower SLP load permutation to interleaving
@ 2024-06-04 14:32 Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2024-06-04 14:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford, tamar.christina

The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively.  This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP.  An example
would be

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
    {
      a[4*i + 0] = b[4*i + 0] * 3;
      a[4*i + 1] = b[4*i + 1] + 3;
      a[4*i + 2] = (b[4*i + 2] * 3 + 3);
      a[4*i + 3] = b[4*i + 3] * 3;
    }
}

where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.

The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.

What can be handled is quite restrictive, matching only a subset
of the non-SLP interleaving cases (the power-of-two group size
ones, in addition only cases without gaps).  The interleaving
vectorization in addition can handle size 3 and 5 - but I am not
sure if it's possible to do that in a VL agnostic way.  It
should be still possible to set up the SLP graph in a way that
a load-lane could be matched from SLP pattern recognition.

As said gaps are currently not handled - for SLP we have a
representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
would need to be filled in some way (even if we just push NULL).

The patch misses multi-level even/odd handling as well as CSEing
intermediate generated permutes.  Both is quite straight-forward
to add, but eventually there's a better or more general strategy
for lowering?  The main goal of the patch is to avoid falling
back to non-SLP for cases the interleaving code handles.

Comments and suggestions welcome, esp. what representation
you'd think is suitable for SLP pattern matching to
load/store-lane and how to represent that?  Maybe this lowering
should happen directly in vect_lower_load_permutations?

Thanks,
Richard.

	* tree-vect-slp.cc (vllp_cmp): New function.
	(vect_lower_load_permutations): Likewise.
	(vect_analyze_slp): Call it.
---
 gcc/tree-vect-slp.cc | 279 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 279 insertions(+)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7e3d0107b4e..766b773452f 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
   return res;
 }
 
+/* qsort comparator ordering SLP load nodes.  */
+
+static int
+vllp_cmp (const void *a_, const void *b_)
+{
+  const slp_tree a = *(const slp_tree *)a_;
+  const slp_tree b = *(const slp_tree *)b_;
+  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
+  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
+  if (STMT_VINFO_GROUPED_ACCESS (a0)
+      && STMT_VINFO_GROUPED_ACCESS (b0)
+      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+    {
+      /* Same group, order after lanes used.  */
+      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
+	return 1;
+      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
+	return -1;
+      else
+	{
+	  /* Try to order loads using the same lanes together, breaking
+	     the tie with the lane number that first differs.  */
+	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 0;
+	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 1;
+	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return -1;
+	  else
+	    {
+	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
+		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
+		  {
+		    /* In-order lane first, that's what the above case for
+		       no permutation does.  */
+		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
+		      return -1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
+		      return 1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
+		      return -1;
+		    else
+		      return 1;
+		  }
+	      return 0;
+	    }
+	}
+    }
+  else /* Different groups or non-groups.  */
+    {
+      /* Order groups as their first element to keep them together.  */
+      if (STMT_VINFO_GROUPED_ACCESS (a0))
+	a0 = DR_GROUP_FIRST_ELEMENT (a0);
+      if (STMT_VINFO_GROUPED_ACCESS (b0))
+	b0 = DR_GROUP_FIRST_ELEMENT (b0);
+      if (a0 == b0)
+	return 0;
+      /* Tie using UID.  */
+      else if (gimple_uid (STMT_VINFO_STMT (a0))
+	       < gimple_uid (STMT_VINFO_STMT (b0)))
+	return -1;
+      else
+	{
+	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
+		      != gimple_uid (STMT_VINFO_STMT (b0)));
+	  return 1;
+	}
+    }
+}
+
+/* Process the set of LOADS that are all from the same dataref group.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map,
+			      const array_slice<slp_tree> &loads)
+{
+  /* We at this point want to lower without a fixed VF or vector
+     size in mind which means we cannot actually compute whether we
+     need three or more vectors for a load permutation yet.  So always
+     lower.  */
+  stmt_vec_info first
+    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
+
+  /* ???  In principle we have to consider a gap up to the next full
+     vector, but we have to actually represent a scalar stmt for the
+     gaps value so delay handling this.  The same is true for
+     inbetween gaps which the load places in the load-permutation
+     represent.  It's probably not worth trying an intermediate packing
+     to vectors without gap even if that might handle some more cases.
+     Instead get the gap case correct in some way.  */
+  unsigned group_lanes = 0;
+  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+    {
+      if ((s == first && DR_GROUP_GAP (s) != 0)
+	  || (s != first && DR_GROUP_GAP (s) != 1))
+	return;
+      group_lanes++;
+    }
+  /* Only a power-of-two number of lanes matches interleaving.  */
+  if (exact_log2 (group_lanes) == -1)
+    return;
+
+  for (slp_tree load : loads)
+    {
+      /* Leave masked or gather loads alone for now.  */
+      if (!SLP_TREE_CHILDREN (load).is_empty ())
+	continue;
+
+      /* We need to lower only loads of less than half of the groups
+	 lanes, including duplicate lanes.  */
+      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	continue;
+
+      /* Lower by reducing the group to half its size using an
+	 interleaving scheme.  For this try to compute whether all
+	 elements needed for this loads are in even or odd elements of
+	 a even/odd decomposition with N consecutive elements.
+	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
+	 with N == 2.  */
+      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
+      unsigned odd = even;
+      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
+	{
+	  even &= ~l;
+	  odd &= l;
+	}
+      /* Give up when this doesn't match up with an interleaving scheme.  */
+      if (!even && !odd)
+	continue;
+
+      /* First build (and possibly re-use) a load node for the
+	 unpermuted group.  */
+      vec<stmt_vec_info> stmts;
+      stmts.create (group_lanes);
+      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+	stmts.quick_push (s);
+      poly_uint64 max_nunits;
+      bool *matches = XALLOCAVEC (bool, group_lanes);
+      unsigned limit = 1;
+      unsigned tree_size = 0;
+      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
+					 group_lanes,
+					 &max_nunits, matches, &limit,
+					 &tree_size, bst_map);
+
+      /* Build the permute to get the original load permutation order.  */
+      lane_permutation_t final_perm;
+      final_perm.create (SLP_TREE_LANES (load));
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	final_perm.quick_push
+	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
+
+      /* Now build a even or odd extraction from the unpermuted load.  */
+      lane_permutation_t perm;
+      perm.create (group_lanes / 2);
+      unsigned level;
+      if (even
+	  && ((level = 1 << ctz_hwi (even)), true)
+	  && group_lanes % (2 * level) == 0)
+	{
+	  /* { 0, 1, ... 4, 5 ..., } */
+	  unsigned level = 1 << ctz_hwi (even);
+	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+	    for (unsigned j = 0; j < level; ++j)
+	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
+	}
+      else if (odd)
+	{
+	  /* { ..., 2, 3, ... 6, 7 } */
+	  unsigned level = 1 << ctz_hwi (odd);
+	  gcc_assert (group_lanes % (2 * level) == 0);
+	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+	    for (unsigned j = 0; j < level; ++j)
+	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
+	}
+      else
+	gcc_unreachable ();
+
+      /* And update final_perm.  */
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	{
+	  unsigned l = final_perm[i].second;
+	  unsigned j;
+	  for (j = 0; j < perm.length (); ++j)
+	    if (perm[j].second == l)
+	      {
+		final_perm[i].second = j;
+		break;
+	      }
+	  gcc_assert (j < perm.length ());
+	}
+
+      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
+      SLP_TREE_CHILDREN (p).quick_push (l0);
+      SLP_TREE_LANE_PERMUTATION (p) = perm;
+      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
+      SLP_TREE_LANES (p) = perm.length ();
+      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
+      /* ???  We should have scalar stmts for this and use bst_map
+	 to CSE.  But we do not want to pick up original SLP load
+	 nodes with a load-permutation here.  */
+      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
+      /* ???  Ideally pick the best even/odd scheme usable for
+	 most of the loads.  -> do a multi-step scheme?  */
+
+      /* And finally from the ordered reduction node create the
+	 permute to shuffle the lanes into the original load-permutation
+	 order.  We replace the original load node with this.  */
+      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
+      SLP_TREE_LOAD_PERMUTATION (load).release ();
+      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
+      SLP_TREE_CHILDREN (load).create (1);
+      SLP_TREE_CHILDREN (load).quick_push (p);
+    }
+}
+
+/* Transform SLP loads in the SLP graph created by SLP discovery to
+   group loads from the same group and lower load permutations that
+   are unlikely to be supported into a series of permutes.
+   In the degenerate case of having only single-lane SLP instances
+   this should result in a series of permute nodes emulating an
+   interleaving scheme.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map)
+{
+  /* Gather and sort loads across all instances.  */
+  hash_set<slp_tree> visited;
+  auto_vec<slp_tree> loads;
+  for (auto inst : loop_vinfo->slp_instances)
+    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
+  if (loads.is_empty ())
+    return;
+  loads.qsort (vllp_cmp);
+
+  /* Now process each dataref group separately.  */
+  unsigned firsti = 0;
+  for (unsigned i = 1; i < loads.length (); ++i)
+    {
+      slp_tree first = loads[firsti];
+      slp_tree next = loads[i];
+      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
+      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
+      if (STMT_VINFO_GROUPED_ACCESS (a0)
+	  && STMT_VINFO_GROUPED_ACCESS (b0)
+	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+	continue;
+      /* Just one SLP load of a possible group, leave those alone.  */
+      if (i == firsti + 1)
+	{
+	  firsti = i;
+	  continue;
+	}
+      /* Now we have multiple SLP loads of the same group from
+	 firsti to i - 1.  */
+      vect_lower_load_permutations (loop_vinfo, bst_map,
+				    make_array_slice (&loads[firsti],
+						      i - firsti));
+      firsti = i;
+    }
+  if (firsti < loads.length () - 1)
+    vect_lower_load_permutations (loop_vinfo, bst_map,
+				  make_array_slice (&loads[firsti],
+						    loads.length () - firsti));
+}
+
 /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
    trees of packed scalar stmts if SLP is possible.  */
 
@@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	}
     }
 
+  /* When we end up with load permutations that we cannot possibly handle,
+     like those requiring three vector inputs, lower them using interleaving
+     like schemes.  */
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    vect_lower_load_permutations (loop_vinfo, bst_map);
+
   hash_set<slp_tree> visited_patterns;
   slp_tree_to_load_perm_map_t perm_cache;
   slp_compat_nodes_map_t compat_cache;
-- 
2.35.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-05 12:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <c6f14930-59e5-4cc8-a38e-7f0452b67dba@DB5PEPF00014B89.eurprd02.prod.outlook.com>
2024-06-04 19:58 ` [PATCH] [RFC] lower SLP load permutation to interleaving Richard Sandiford
2024-06-05 11:54   ` Richard Biener
2024-06-05 12:10     ` Richard Biener
2024-06-05 12:29     ` Richard Sandiford
2024-06-05  9:00 ` Tamar Christina
2024-06-05 11:59   ` Richard Biener
2024-06-04 14:32 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).