From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ysFP=NH=suse.de=rguenther@sourceware.org>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	by sourceware.org (Postfix) with ESMTPS id 80EA538197E7
	for <gcc-patches@gcc.gnu.org>; Wed,  5 Jun 2024 11:59:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 80EA538197E7
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 80EA538197E7
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717588755; cv=none;
	b=TJ7Lk6j8kBN+ZjTrPVIHPHbRgqRRTYlmPUjK6maj99+dw3XiI+8OzAm8YJUUncWytWiQb8V6Z2gLMGqovTp6q5FCdT4oxXJOzwrUXcZPWNcLSHTBCnqUe8m1FydNJiRGOzmZ9W6+Bk5XeBxQaE2UDvkNvF40SL8923ofg4OUGBM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1717588755; c=relaxed/simple;
	bh=44pvLrZ9JcUZHenSXnOAxZat85dBptx9v9pbJ1ebPk8=;
	h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
	 From:To:Subject:Message-ID:MIME-Version; b=Pb94O3gj1B+4XmrzLodFMJXYB0fHPoEiuBqj5FNI/8tJZfmU9UImI7eTaT41c2ZhwqPal/0q4xuou076o9eQ/hRh4DoQvqugpBbpJBOo7C8GWISulMaJJxd5wohW8YYXOeS/39k15yxegWwe1jQXbJZB7+Zm1pPcSnZzcX9UOWE=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id D99201F7F1;
	Wed,  5 Jun 2024 11:59:09 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1717588751; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=;
	b=JDie2B+ZGKw7qcKi1F9dtvTEx2cv3H2Edc0Psr6aVlFSX1VMSBzn3T88FIhclw+E1s0/YC
	iWgJ3ElUFBJBt7MGl94OA1irCk9OTEvh+hyjJceoBo94pVBt1y9CtPObTZKagjve/sAlRh
	dJbnYM6YjdcGbSWGKDNTZ1tEK8i5xLw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1717588751;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=;
	b=YGe2JB57rzpZ0DzZqN+KirrxMPvxzywOhxE0hxPY9Ny6r2C4UCqnsq45WydENYD6qXYMGK
	naTXG0yYdlaZUOBQ==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1717588749; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=;
	b=oAHBvTI3J13rU0qk4zOatOU3WgSKm2DWGAtzSZEbU2/aECc8hFQTifk/Fd2zGlGLPh+DU1
	fYfMRDgKn/HIQgPV6S2ktYpE3eoD1Qsek2fWjkW2lIbMlZhQk1KdTiYfCJfcSQXLZBYDtO
	KEZZJBmlb10vEmd8qboBKRDJU9uBK0g=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1717588749;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=;
	b=4rwReQpG1doSKlw9RZjnAvW/Upd8WlJjLVg+1Wc5o6a7OdpdphIhslcgK+ISHvFbbpd5Vv
	kVicDFoWT2/a0yBg==
Date: Wed, 5 Jun 2024 13:59:09 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Tamar Christina <Tamar.Christina@arm.com>
cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, 
    Richard Sandiford <Richard.Sandiford@arm.com>
Subject: RE: [PATCH] [RFC] lower SLP load permutation to interleaving
In-Reply-To:  <VI1PR08MB532527CF34F272065E0AAA78FFF92@VI1PR08MB5325.eurprd08.prod.outlook.com>
Message-ID: <38s868o8-583r-qs0n-qprr-o5o639po3r60@fhfr.qr>
References: <c6f14930-59e5-4cc8-a38e-7f0452b67dba@DB5PEPF00014B89.eurprd02.prod.outlook.com>  <VI1PR08MB532527CF34F272065E0AAA78FFF92@VI1PR08MB5325.eurprd08.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Level: 
X-Spamd-Result: default: False [-4.30 / 50.00];
	BAYES_HAM(-3.00)[100.00%];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-0.20)[-0.999];
	MIME_GOOD(-0.10)[text/plain];
	FROM_HAS_DN(0.00)[];
	MISSING_XM_UA(0.00)[];
	TO_DN_SOME(0.00)[];
	TO_DN_EQ_ADDR_SOME(0.00)[];
	MIME_TRACE(0.00)[0:+];
	ARC_NA(0.00)[];
	RCVD_COUNT_ZERO(0.00)[0];
	DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
	FROM_EQ_ENVFROM(0.00)[];
	RCPT_COUNT_THREE(0.00)[3];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	FUZZY_BLOCKED(0.00)[rspamd.com];
	DBL_BLOCKED_OPENRESOLVER(0.00)[arm.com:email,gnu.org:email,tree-vect-slp.cc:url,murzim.nue2.suse.org:helo,suse.de:email]
X-Spam-Score: -4.30
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, 5 Jun 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, June 4, 2024 3:33 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Tamar Christina
> > <Tamar.Christina@arm.com>
> > Subject: [PATCH] [RFC] lower SLP load permutation to interleaving
> > 
> > The following emulates classical interleaving for SLP load permutes
> > that we are unlikely handling natively.  This is to handle cases
> > where interleaving (or load/store-lanes) is the optimal choice for
> > vectorizing even when we are doing that within SLP.  An example
> > would be
> > 
> > void foo (int * __restrict a, int * b)
> > {
> >   for (int i = 0; i < 16; ++i)
> >     {
> >       a[4*i + 0] = b[4*i + 0] * 3;
> >       a[4*i + 1] = b[4*i + 1] + 3;
> >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
> >       a[4*i + 3] = b[4*i + 3] * 3;
> >     }
> > }
> > 
> > where currently the SLP store is merging four single-lane SLP
> > sub-graphs but none of the loads in it can be code-generated
> > with V4SImode vectors and a VF of four as the permutes would need
> > three vectors.
> > 
> > The patch introduces a lowering phase after SLP discovery but
> > before SLP pattern recognition or permute optimization that
> > analyzes all loads from the same dataref group and creates an
> > interleaving scheme starting from an unpermuted load.
> > 
> > What can be handled is quite restrictive, matching only a subset
> > of the non-SLP interleaving cases (the power-of-two group size
> > ones, in addition only cases without gaps).  The interleaving
> > vectorization in addition can handle size 3 and 5 - but I am not
> > sure if it's possible to do that in a VL agnostic way.  It
> > should be still possible to set up the SLP graph in a way that
> > a load-lane could be matched from SLP pattern recognition.
> > 
> > As said gaps are currently not handled - for SLP we have a
> > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> > would need to be filled in some way (even if we just push NULL).
> > 
> > The patch misses multi-level even/odd handling as well as CSEing
> > intermediate generated permutes.  Both is quite straight-forward
> > to add, but eventually there's a better or more general strategy
> > for lowering?  The main goal of the patch is to avoid falling
> > back to non-SLP for cases the interleaving code handles.
> 
> I guess not handling CSEing the intermediate permutes only really
> matter for pattern matching? Those could be eliminated in optimize_slp?
> 
> > 
> > Comments and suggestions welcome, esp. what representation
> > you'd think is suitable for SLP pattern matching to
> > load/store-lane and how to represent that?  Maybe this lowering
> > should happen directly in vect_lower_load_permutations?
> 
> I like this representation personally, I'd say having the permute explicit,
> at least until optimize_slp would make pattern matching easier.
> 
> We wouldn't need hacks such as optimize_load_redistribution.
> In that sense, does it make sense to eventually just lower all permuted
> loads?

Yes.  One of my ideas was that when we got rid of non-SLP, the
vectorizable_* routines themselves can apply transforms to their
SLP node, say turn a vector load with load-permutation into a
scalar load + splat.

Note that all the current patches try to preserve as much as possible
in (missed) capabilities of the current mix of SLP and non-SLP.  This
is why I didn't try changing SLP discovery for loads as I did in
earlier attempts to arrive at similar results but do this as followup
lowering.

The one thing that permuted loads (SLP loads with load_permutation)
are good for is to avoid needing to represent "unused" lanes.

Richard.


> 
> Cheers,
> Tamar
> 
> > 
> > Thanks,
> > Richard.
> > 
> > 	* tree-vect-slp.cc (vllp_cmp): New function.
> > 	(vect_lower_load_permutations): Likewise.
> > 	(vect_analyze_slp): Call it.
> > ---
> >  gcc/tree-vect-slp.cc | 279
> > +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 279 insertions(+)
> > 
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 7e3d0107b4e..766b773452f 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo,
> >    return res;
> >  }
> > 
> > +/* qsort comparator ordering SLP load nodes.  */
> > +
> > +static int
> > +vllp_cmp (const void *a_, const void *b_)
> > +{
> > +  const slp_tree a = *(const slp_tree *)a_;
> > +  const slp_tree b = *(const slp_tree *)b_;
> > +  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
> > +  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
> > +  if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +      && STMT_VINFO_GROUPED_ACCESS (b0)
> > +      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
> > +    {
> > +      /* Same group, order after lanes used.  */
> > +      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
> > +	return 1;
> > +      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
> > +	return -1;
> > +      else
> > +	{
> > +	  /* Try to order loads using the same lanes together, breaking
> > +	     the tie with the lane number that first differs.  */
> > +	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 0;
> > +	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return 1;
> > +	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
> > +		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
> > +	    return -1;
> > +	  else
> > +	    {
> > +	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
> > +		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		  {
> > +		    /* In-order lane first, that's what the above case for
> > +		       no permutation does.  */
> > +		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
> > +		      return -1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
> > +		      return 1;
> > +		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
> > +			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
> > +		      return -1;
> > +		    else
> > +		      return 1;
> > +		  }
> > +	      return 0;
> > +	    }
> > +	}
> > +    }
> > +  else /* Different groups or non-groups.  */
> > +    {
> > +      /* Order groups as their first element to keep them together.  */
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0))
> > +	a0 = DR_GROUP_FIRST_ELEMENT (a0);
> > +      if (STMT_VINFO_GROUPED_ACCESS (b0))
> > +	b0 = DR_GROUP_FIRST_ELEMENT (b0);
> > +      if (a0 == b0)
> > +	return 0;
> > +      /* Tie using UID.  */
> > +      else if (gimple_uid (STMT_VINFO_STMT (a0))
> > +	       < gimple_uid (STMT_VINFO_STMT (b0)))
> > +	return -1;
> > +      else
> > +	{
> > +	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
> > +		      != gimple_uid (STMT_VINFO_STMT (b0)));
> > +	  return 1;
> > +	}
> > +    }
> > +}
> > +
> > +/* Process the set of LOADS that are all from the same dataref group.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map,
> > +			      const array_slice<slp_tree> &loads)
> > +{
> > +  /* We at this point want to lower without a fixed VF or vector
> > +     size in mind which means we cannot actually compute whether we
> > +     need three or more vectors for a load permutation yet.  So always
> > +     lower.  */
> > +  stmt_vec_info first
> > +    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> > +
> > +  /* ???  In principle we have to consider a gap up to the next full
> > +     vector, but we have to actually represent a scalar stmt for the
> > +     gaps value so delay handling this.  The same is true for
> > +     inbetween gaps which the load places in the load-permutation
> > +     represent.  It's probably not worth trying an intermediate packing
> > +     to vectors without gap even if that might handle some more cases.
> > +     Instead get the gap case correct in some way.  */
> > +  unsigned group_lanes = 0;
> > +  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +    {
> > +      if ((s == first && DR_GROUP_GAP (s) != 0)
> > +	  || (s != first && DR_GROUP_GAP (s) != 1))
> > +	return;
> > +      group_lanes++;
> > +    }
> > +  /* Only a power-of-two number of lanes matches interleaving.  */
> > +  if (exact_log2 (group_lanes) == -1)
> > +    return;
> > +
> > +  for (slp_tree load : loads)
> > +    {
> > +      /* Leave masked or gather loads alone for now.  */
> > +      if (!SLP_TREE_CHILDREN (load).is_empty ())
> > +	continue;
> > +
> > +      /* We need to lower only loads of less than half of the groups
> > +	 lanes, including duplicate lanes.  */
> > +      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> > +	continue;
> > +
> > +      /* Lower by reducing the group to half its size using an
> > +	 interleaving scheme.  For this try to compute whether all
> > +	 elements needed for this loads are in even or odd elements of
> > +	 a even/odd decomposition with N consecutive elements.
> > +	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> > +	 with N == 2.  */
> > +      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
> > +      unsigned odd = even;
> > +      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
> > +	{
> > +	  even &= ~l;
> > +	  odd &= l;
> > +	}
> > +      /* Give up when this doesn't match up with an interleaving scheme.  */
> > +      if (!even && !odd)
> > +	continue;
> > +
> > +      /* First build (and possibly re-use) a load node for the
> > +	 unpermuted group.  */
> > +      vec<stmt_vec_info> stmts;
> > +      stmts.create (group_lanes);
> > +      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
> > +	stmts.quick_push (s);
> > +      poly_uint64 max_nunits;
> > +      bool *matches = XALLOCAVEC (bool, group_lanes);
> > +      unsigned limit = 1;
> > +      unsigned tree_size = 0;
> > +      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
> > +					 group_lanes,
> > +					 &max_nunits, matches, &limit,
> > +					 &tree_size, bst_map);
> > +
> > +      /* Build the permute to get the original load permutation order.  */
> > +      lane_permutation_t final_perm;
> > +      final_perm.create (SLP_TREE_LANES (load));
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	final_perm.quick_push
> > +	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
> > +
> > +      /* Now build a even or odd extraction from the unpermuted load.  */
> > +      lane_permutation_t perm;
> > +      perm.create (group_lanes / 2);
> > +      unsigned level;
> > +      if (even
> > +	  && ((level = 1 << ctz_hwi (even)), true)
> > +	  && group_lanes % (2 * level) == 0)
> > +	{
> > +	  /* { 0, 1, ... 4, 5 ..., } */
> > +	  unsigned level = 1 << ctz_hwi (even);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
> > +	}
> > +      else if (odd)
> > +	{
> > +	  /* { ..., 2, 3, ... 6, 7 } */
> > +	  unsigned level = 1 << ctz_hwi (odd);
> > +	  gcc_assert (group_lanes % (2 * level) == 0);
> > +	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
> > +	    for (unsigned j = 0; j < level; ++j)
> > +	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
> > +	}
> > +      else
> > +	gcc_unreachable ();
> > +
> > +      /* And update final_perm.  */
> > +      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
> > +	{
> > +	  unsigned l = final_perm[i].second;
> > +	  unsigned j;
> > +	  for (j = 0; j < perm.length (); ++j)
> > +	    if (perm[j].second == l)
> > +	      {
> > +		final_perm[i].second = j;
> > +		break;
> > +	      }
> > +	  gcc_assert (j < perm.length ());
> > +	}
> > +
> > +      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> > +      SLP_TREE_CHILDREN (p).quick_push (l0);
> > +      SLP_TREE_LANE_PERMUTATION (p) = perm;
> > +      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
> > +      SLP_TREE_LANES (p) = perm.length ();
> > +      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
> > +      /* ???  We should have scalar stmts for this and use bst_map
> > +	 to CSE.  But we do not want to pick up original SLP load
> > +	 nodes with a load-permutation here.  */
> > +      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
> > +      /* ???  Ideally pick the best even/odd scheme usable for
> > +	 most of the loads.  -> do a multi-step scheme?  */
> > +
> > +      /* And finally from the ordered reduction node create the
> > +	 permute to shuffle the lanes into the original load-permutation
> > +	 order.  We replace the original load node with this.  */
> > +      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
> > +      SLP_TREE_LOAD_PERMUTATION (load).release ();
> > +      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
> > +      SLP_TREE_CHILDREN (load).create (1);
> > +      SLP_TREE_CHILDREN (load).quick_push (p);
> > +    }
> > +}
> > +
> > +/* Transform SLP loads in the SLP graph created by SLP discovery to
> > +   group loads from the same group and lower load permutations that
> > +   are unlikely to be supported into a series of permutes.
> > +   In the degenerate case of having only single-lane SLP instances
> > +   this should result in a series of permute nodes emulating an
> > +   interleaving scheme.  */
> > +
> > +static void
> > +vect_lower_load_permutations (loop_vec_info loop_vinfo,
> > +			      scalar_stmts_to_slp_tree_map_t *bst_map)
> > +{
> > +  /* Gather and sort loads across all instances.  */
> > +  hash_set<slp_tree> visited;
> > +  auto_vec<slp_tree> loads;
> > +  for (auto inst : loop_vinfo->slp_instances)
> > +    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
> > +  if (loads.is_empty ())
> > +    return;
> > +  loads.qsort (vllp_cmp);
> > +
> > +  /* Now process each dataref group separately.  */
> > +  unsigned firsti = 0;
> > +  for (unsigned i = 1; i < loads.length (); ++i)
> > +    {
> > +      slp_tree first = loads[firsti];
> > +      slp_tree next = loads[i];
> > +      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
> > +      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
> > +      if (STMT_VINFO_GROUPED_ACCESS (a0)
> > +	  && STMT_VINFO_GROUPED_ACCESS (b0)
> > +	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT
> > (b0))
> > +	continue;
> > +      /* Just one SLP load of a possible group, leave those alone.  */
> > +      if (i == firsti + 1)
> > +	{
> > +	  firsti = i;
> > +	  continue;
> > +	}
> > +      /* Now we have multiple SLP loads of the same group from
> > +	 firsti to i - 1.  */
> > +      vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				    make_array_slice (&loads[firsti],
> > +						      i - firsti));
> > +      firsti = i;
> > +    }
> > +  if (firsti < loads.length () - 1)
> > +    vect_lower_load_permutations (loop_vinfo, bst_map,
> > +				  make_array_slice (&loads[firsti],
> > +						    loads.length () - firsti));
> > +}
> > +
> >  /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
> >     trees of packed scalar stmts if SLP is possible.  */
> > 
> > @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> > max_tree_size)
> >  	}
> >      }
> > 
> > +  /* When we end up with load permutations that we cannot possibly handle,
> > +     like those requiring three vector inputs, lower them using interleaving
> > +     like schemes.  */
> > +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > +    vect_lower_load_permutations (loop_vinfo, bst_map);
> > +
> >    hash_set<slp_tree> visited_patterns;
> >    slp_tree_to_load_perm_map_t perm_cache;
> >    slp_compat_nodes_map_t compat_cache;
> > --
> > 2.35.3
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)