From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by sourceware.org (Postfix) with ESMTPS id 80EA538197E7 for ; Wed, 5 Jun 2024 11:59:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 80EA538197E7 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 80EA538197E7 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.135.223.131 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717588755; cv=none; b=TJ7Lk6j8kBN+ZjTrPVIHPHbRgqRRTYlmPUjK6maj99+dw3XiI+8OzAm8YJUUncWytWiQb8V6Z2gLMGqovTp6q5FCdT4oxXJOzwrUXcZPWNcLSHTBCnqUe8m1FydNJiRGOzmZ9W6+Bk5XeBxQaE2UDvkNvF40SL8923ofg4OUGBM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1717588755; c=relaxed/simple; bh=44pvLrZ9JcUZHenSXnOAxZat85dBptx9v9pbJ1ebPk8=; h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date: From:To:Subject:Message-ID:MIME-Version; b=Pb94O3gj1B+4XmrzLodFMJXYB0fHPoEiuBqj5FNI/8tJZfmU9UImI7eTaT41c2ZhwqPal/0q4xuou076o9eQ/hRh4DoQvqugpBbpJBOo7C8GWISulMaJJxd5wohW8YYXOeS/39k15yxegWwe1jQXbJZB7+Zm1pPcSnZzcX9UOWE= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from murzim.nue2.suse.org (unknown [10.168.4.243]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id D99201F7F1; Wed, 5 Jun 2024 11:59:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1717588751; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=; b=JDie2B+ZGKw7qcKi1F9dtvTEx2cv3H2Edc0Psr6aVlFSX1VMSBzn3T88FIhclw+E1s0/YC iWgJ3ElUFBJBt7MGl94OA1irCk9OTEvh+hyjJceoBo94pVBt1y9CtPObTZKagjve/sAlRh dJbnYM6YjdcGbSWGKDNTZ1tEK8i5xLw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1717588751; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=; b=YGe2JB57rzpZ0DzZqN+KirrxMPvxzywOhxE0hxPY9Ny6r2C4UCqnsq45WydENYD6qXYMGK naTXG0yYdlaZUOBQ== Authentication-Results: smtp-out2.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1717588749; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=; b=oAHBvTI3J13rU0qk4zOatOU3WgSKm2DWGAtzSZEbU2/aECc8hFQTifk/Fd2zGlGLPh+DU1 fYfMRDgKn/HIQgPV6S2ktYpE3eoD1Qsek2fWjkW2lIbMlZhQk1KdTiYfCJfcSQXLZBYDtO KEZZJBmlb10vEmd8qboBKRDJU9uBK0g= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1717588749; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTPQRexrD1NV21rMSgQdXYO/39rLKXeLuoO57UPZIaQ=; b=4rwReQpG1doSKlw9RZjnAvW/Upd8WlJjLVg+1Wc5o6a7OdpdphIhslcgK+ISHvFbbpd5Vv kVicDFoWT2/a0yBg== Date: Wed, 5 Jun 2024 13:59:09 +0200 (CEST) From: Richard Biener To: Tamar Christina cc: "gcc-patches@gcc.gnu.org" , Richard Sandiford Subject: RE: [PATCH] [RFC] lower SLP load permutation to interleaving In-Reply-To: Message-ID: <38s868o8-583r-qs0n-qprr-o5o639po3r60@fhfr.qr> References: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Level: X-Spamd-Result: default: False [-4.30 / 50.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.20)[-0.999]; MIME_GOOD(-0.10)[text/plain]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; TO_DN_SOME(0.00)[]; TO_DN_EQ_ADDR_SOME(0.00)[]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; RCVD_COUNT_ZERO(0.00)[0]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com]; DBL_BLOCKED_OPENRESOLVER(0.00)[arm.com:email,gnu.org:email,tree-vect-slp.cc:url,murzim.nue2.suse.org:helo,suse.de:email] X-Spam-Score: -4.30 X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, 5 Jun 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener > > Sent: Tuesday, June 4, 2024 3:33 PM > > To: gcc-patches@gcc.gnu.org > > Cc: Richard Sandiford ; Tamar Christina > > > > Subject: [PATCH] [RFC] lower SLP load permutation to interleaving > > > > The following emulates classical interleaving for SLP load permutes > > that we are unlikely handling natively. This is to handle cases > > where interleaving (or load/store-lanes) is the optimal choice for > > vectorizing even when we are doing that within SLP. An example > > would be > > > > void foo (int * __restrict a, int * b) > > { > > for (int i = 0; i < 16; ++i) > > { > > a[4*i + 0] = b[4*i + 0] * 3; > > a[4*i + 1] = b[4*i + 1] + 3; > > a[4*i + 2] = (b[4*i + 2] * 3 + 3); > > a[4*i + 3] = b[4*i + 3] * 3; > > } > > } > > > > where currently the SLP store is merging four single-lane SLP > > sub-graphs but none of the loads in it can be code-generated > > with V4SImode vectors and a VF of four as the permutes would need > > three vectors. > > > > The patch introduces a lowering phase after SLP discovery but > > before SLP pattern recognition or permute optimization that > > analyzes all loads from the same dataref group and creates an > > interleaving scheme starting from an unpermuted load. > > > > What can be handled is quite restrictive, matching only a subset > > of the non-SLP interleaving cases (the power-of-two group size > > ones, in addition only cases without gaps). The interleaving > > vectorization in addition can handle size 3 and 5 - but I am not > > sure if it's possible to do that in a VL agnostic way. It > > should be still possible to set up the SLP graph in a way that > > a load-lane could be matched from SLP pattern recognition. > > > > As said gaps are currently not handled - for SLP we have a > > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes" > > would need to be filled in some way (even if we just push NULL). > > > > The patch misses multi-level even/odd handling as well as CSEing > > intermediate generated permutes. Both is quite straight-forward > > to add, but eventually there's a better or more general strategy > > for lowering? The main goal of the patch is to avoid falling > > back to non-SLP for cases the interleaving code handles. > > I guess not handling CSEing the intermediate permutes only really > matter for pattern matching? Those could be eliminated in optimize_slp? > > > > > Comments and suggestions welcome, esp. what representation > > you'd think is suitable for SLP pattern matching to > > load/store-lane and how to represent that? Maybe this lowering > > should happen directly in vect_lower_load_permutations? > > I like this representation personally, I'd say having the permute explicit, > at least until optimize_slp would make pattern matching easier. > > We wouldn't need hacks such as optimize_load_redistribution. > In that sense, does it make sense to eventually just lower all permuted > loads? Yes. One of my ideas was that when we got rid of non-SLP, the vectorizable_* routines themselves can apply transforms to their SLP node, say turn a vector load with load-permutation into a scalar load + splat. Note that all the current patches try to preserve as much as possible in (missed) capabilities of the current mix of SLP and non-SLP. This is why I didn't try changing SLP discovery for loads as I did in earlier attempts to arrive at similar results but do this as followup lowering. The one thing that permuted loads (SLP loads with load_permutation) are good for is to avoid needing to represent "unused" lanes. Richard. > > Cheers, > Tamar > > > > > Thanks, > > Richard. > > > > * tree-vect-slp.cc (vllp_cmp): New function. > > (vect_lower_load_permutations): Likewise. > > (vect_analyze_slp): Call it. > > --- > > gcc/tree-vect-slp.cc | 279 > > +++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 279 insertions(+) > > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index 7e3d0107b4e..766b773452f 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -3839,6 +3839,279 @@ vect_analyze_slp_instance (vec_info *vinfo, > > return res; > > } > > > > +/* qsort comparator ordering SLP load nodes. */ > > + > > +static int > > +vllp_cmp (const void *a_, const void *b_) > > +{ > > + const slp_tree a = *(const slp_tree *)a_; > > + const slp_tree b = *(const slp_tree *)b_; > > + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0]; > > + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0]; > > + if (STMT_VINFO_GROUPED_ACCESS (a0) > > + && STMT_VINFO_GROUPED_ACCESS (b0) > > + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0)) > > + { > > + /* Same group, order after lanes used. */ > > + if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b)) > > + return 1; > > + else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b)) > > + return -1; > > + else > > + { > > + /* Try to order loads using the same lanes together, breaking > > + the tie with the lane number that first differs. */ > > + if (!SLP_TREE_LOAD_PERMUTATION (a).exists () > > + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) > > + return 0; > > + else if (SLP_TREE_LOAD_PERMUTATION (a).exists () > > + && !SLP_TREE_LOAD_PERMUTATION (b).exists ()) > > + return 1; > > + else if (!SLP_TREE_LOAD_PERMUTATION (a).exists () > > + && SLP_TREE_LOAD_PERMUTATION (b).exists ()) > > + return -1; > > + else > > + { > > + for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i) > > + if (SLP_TREE_LOAD_PERMUTATION (a)[i] > > + != SLP_TREE_LOAD_PERMUTATION (b)[i]) > > + { > > + /* In-order lane first, that's what the above case for > > + no permutation does. */ > > + if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i) > > + return -1; > > + else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i) > > + return 1; > > + else if (SLP_TREE_LOAD_PERMUTATION (a)[i] > > + < SLP_TREE_LOAD_PERMUTATION (b)[i]) > > + return -1; > > + else > > + return 1; > > + } > > + return 0; > > + } > > + } > > + } > > + else /* Different groups or non-groups. */ > > + { > > + /* Order groups as their first element to keep them together. */ > > + if (STMT_VINFO_GROUPED_ACCESS (a0)) > > + a0 = DR_GROUP_FIRST_ELEMENT (a0); > > + if (STMT_VINFO_GROUPED_ACCESS (b0)) > > + b0 = DR_GROUP_FIRST_ELEMENT (b0); > > + if (a0 == b0) > > + return 0; > > + /* Tie using UID. */ > > + else if (gimple_uid (STMT_VINFO_STMT (a0)) > > + < gimple_uid (STMT_VINFO_STMT (b0))) > > + return -1; > > + else > > + { > > + gcc_assert (gimple_uid (STMT_VINFO_STMT (a0)) > > + != gimple_uid (STMT_VINFO_STMT (b0))); > > + return 1; > > + } > > + } > > +} > > + > > +/* Process the set of LOADS that are all from the same dataref group. */ > > + > > +static void > > +vect_lower_load_permutations (loop_vec_info loop_vinfo, > > + scalar_stmts_to_slp_tree_map_t *bst_map, > > + const array_slice &loads) > > +{ > > + /* We at this point want to lower without a fixed VF or vector > > + size in mind which means we cannot actually compute whether we > > + need three or more vectors for a load permutation yet. So always > > + lower. */ > > + stmt_vec_info first > > + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]); > > + > > + /* ??? In principle we have to consider a gap up to the next full > > + vector, but we have to actually represent a scalar stmt for the > > + gaps value so delay handling this. The same is true for > > + inbetween gaps which the load places in the load-permutation > > + represent. It's probably not worth trying an intermediate packing > > + to vectors without gap even if that might handle some more cases. > > + Instead get the gap case correct in some way. */ > > + unsigned group_lanes = 0; > > + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) > > + { > > + if ((s == first && DR_GROUP_GAP (s) != 0) > > + || (s != first && DR_GROUP_GAP (s) != 1)) > > + return; > > + group_lanes++; > > + } > > + /* Only a power-of-two number of lanes matches interleaving. */ > > + if (exact_log2 (group_lanes) == -1) > > + return; > > + > > + for (slp_tree load : loads) > > + { > > + /* Leave masked or gather loads alone for now. */ > > + if (!SLP_TREE_CHILDREN (load).is_empty ()) > > + continue; > > + > > + /* We need to lower only loads of less than half of the groups > > + lanes, including duplicate lanes. */ > > + if (SLP_TREE_LANES (load) >= group_lanes / 2) > > + continue; > > + > > + /* Lower by reducing the group to half its size using an > > + interleaving scheme. For this try to compute whether all > > + elements needed for this loads are in even or odd elements of > > + a even/odd decomposition with N consecutive elements. > > + Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition > > + with N == 2. */ > > + unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1; > > + unsigned odd = even; > > + for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load)) > > + { > > + even &= ~l; > > + odd &= l; > > + } > > + /* Give up when this doesn't match up with an interleaving scheme. */ > > + if (!even && !odd) > > + continue; > > + > > + /* First build (and possibly re-use) a load node for the > > + unpermuted group. */ > > + vec stmts; > > + stmts.create (group_lanes); > > + for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s)) > > + stmts.quick_push (s); > > + poly_uint64 max_nunits; > > + bool *matches = XALLOCAVEC (bool, group_lanes); > > + unsigned limit = 1; > > + unsigned tree_size = 0; > > + slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts, > > + group_lanes, > > + &max_nunits, matches, &limit, > > + &tree_size, bst_map); > > + > > + /* Build the permute to get the original load permutation order. */ > > + lane_permutation_t final_perm; > > + final_perm.create (SLP_TREE_LANES (load)); > > + for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i) > > + final_perm.quick_push > > + (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); > > + > > + /* Now build a even or odd extraction from the unpermuted load. */ > > + lane_permutation_t perm; > > + perm.create (group_lanes / 2); > > + unsigned level; > > + if (even > > + && ((level = 1 << ctz_hwi (even)), true) > > + && group_lanes % (2 * level) == 0) > > + { > > + /* { 0, 1, ... 4, 5 ..., } */ > > + unsigned level = 1 << ctz_hwi (even); > > + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) > > + for (unsigned j = 0; j < level; ++j) > > + perm.quick_push (std::make_pair (0, 2 * i * level + j)); > > + } > > + else if (odd) > > + { > > + /* { ..., 2, 3, ... 6, 7 } */ > > + unsigned level = 1 << ctz_hwi (odd); > > + gcc_assert (group_lanes % (2 * level) == 0); > > + for (unsigned i = 0; i < group_lanes / 2 / level; ++i) > > + for (unsigned j = 0; j < level; ++j) > > + perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j)); > > + } > > + else > > + gcc_unreachable (); > > + > > + /* And update final_perm. */ > > + for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i) > > + { > > + unsigned l = final_perm[i].second; > > + unsigned j; > > + for (j = 0; j < perm.length (); ++j) > > + if (perm[j].second == l) > > + { > > + final_perm[i].second = j; > > + break; > > + } > > + gcc_assert (j < perm.length ()); > > + } > > + > > + slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR); > > + SLP_TREE_CHILDREN (p).quick_push (l0); > > + SLP_TREE_LANE_PERMUTATION (p) = perm; > > + SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load); > > + SLP_TREE_LANES (p) = perm.length (); > > + SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load); > > + /* ??? We should have scalar stmts for this and use bst_map > > + to CSE. But we do not want to pick up original SLP load > > + nodes with a load-permutation here. */ > > + /* ??? We need to iterate if group_lanes / 2 is still too large. */ > > + /* ??? Ideally pick the best even/odd scheme usable for > > + most of the loads. -> do a multi-step scheme? */ > > + > > + /* And finally from the ordered reduction node create the > > + permute to shuffle the lanes into the original load-permutation > > + order. We replace the original load node with this. */ > > + SLP_TREE_CODE (load) = VEC_PERM_EXPR; > > + SLP_TREE_LOAD_PERMUTATION (load).release (); > > + SLP_TREE_LANE_PERMUTATION (load) = final_perm; > > + SLP_TREE_CHILDREN (load).create (1); > > + SLP_TREE_CHILDREN (load).quick_push (p); > > + } > > +} > > + > > +/* Transform SLP loads in the SLP graph created by SLP discovery to > > + group loads from the same group and lower load permutations that > > + are unlikely to be supported into a series of permutes. > > + In the degenerate case of having only single-lane SLP instances > > + this should result in a series of permute nodes emulating an > > + interleaving scheme. */ > > + > > +static void > > +vect_lower_load_permutations (loop_vec_info loop_vinfo, > > + scalar_stmts_to_slp_tree_map_t *bst_map) > > +{ > > + /* Gather and sort loads across all instances. */ > > + hash_set visited; > > + auto_vec loads; > > + for (auto inst : loop_vinfo->slp_instances) > > + vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited); > > + if (loads.is_empty ()) > > + return; > > + loads.qsort (vllp_cmp); > > + > > + /* Now process each dataref group separately. */ > > + unsigned firsti = 0; > > + for (unsigned i = 1; i < loads.length (); ++i) > > + { > > + slp_tree first = loads[firsti]; > > + slp_tree next = loads[i]; > > + stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0]; > > + stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0]; > > + if (STMT_VINFO_GROUPED_ACCESS (a0) > > + && STMT_VINFO_GROUPED_ACCESS (b0) > > + && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT > > (b0)) > > + continue; > > + /* Just one SLP load of a possible group, leave those alone. */ > > + if (i == firsti + 1) > > + { > > + firsti = i; > > + continue; > > + } > > + /* Now we have multiple SLP loads of the same group from > > + firsti to i - 1. */ > > + vect_lower_load_permutations (loop_vinfo, bst_map, > > + make_array_slice (&loads[firsti], > > + i - firsti)); > > + firsti = i; > > + } > > + if (firsti < loads.length () - 1) > > + vect_lower_load_permutations (loop_vinfo, bst_map, > > + make_array_slice (&loads[firsti], > > + loads.length () - firsti)); > > +} > > + > > /* Check if there are stmts in the loop can be vectorized using SLP. Build SLP > > trees of packed scalar stmts if SLP is possible. */ > > > > @@ -3982,6 +4255,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned > > max_tree_size) > > } > > } > > > > + /* When we end up with load permutations that we cannot possibly handle, > > + like those requiring three vector inputs, lower them using interleaving > > + like schemes. */ > > + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) > > + vect_lower_load_permutations (loop_vinfo, bst_map); > > + > > hash_set visited_patterns; > > slp_tree_to_load_perm_map_t perm_cache; > > slp_compat_nodes_map_t compat_cache; > > -- > > 2.35.3 > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)