From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id F2CD13858C20 for ; Wed, 3 Aug 2022 11:23:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F2CD13858C20 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id ABA39202B8; Wed, 3 Aug 2022 11:23:37 +0000 (UTC) Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 8A7002C141; Wed, 3 Aug 2022 11:23:37 +0000 (UTC) Date: Wed, 3 Aug 2022 11:23:37 +0000 (UTC) From: Richard Biener To: Richard Sandiford cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] RFC: Extend SLP permutation optimisations In-Reply-To: Message-ID: References: User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-4.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Aug 2022 11:23:44 -0000 On Tue, 2 Aug 2022, Richard Sandiford wrote: > Currently SLP tries to force permute operations "down" the graph > from loads in the hope of reducing the total number of permutes > needed or (in the best case) removing the need for the permutes > entirely. This patch tries to extend it as follows: > > - Allow loads to take a different permutation from the one they > started with, rather than choosing between "original permutation" > and "no permutation". > > - Allow changes in both directions, if the target supports the > reverse permute operation. > > - Treat the placement of permute operations as a two-way dataflow > problem: after propagating information from leaves to roots (as now), > propagate information back up the graph. > > - Take execution frequency into account when optimising for speed, > so that (for example) permutes inside loops have a higher cost > than permutes outside loops. > > - Try to reduce the total number of permutes when optimising for > size, even if that increases the number of permutes on a given > execution path. > > See the big block comment above vect_optimize_slp_pass for > a detailed description. > > A while back I posted a patch to extend the existing optimisation > to non-consecutive loads. This patch doesn't include that change > (although it's a possible future addition). > > The original motivation for doing this was to add a framework that would > allow other layout differences in future. The two main ones are: > > - Make it easier to represent predicated operations, including > predicated operations with gaps. E.g.: > > a[0] += 1; > a[1] += 1; > a[3] += 1; > > could be a single load/add/store for SVE. We could handle this > by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } > (depending on what's being counted). We might need to move > elements between lanes at various points, like with permutes. > > (This would first mean adding support for stores with gaps.) > > - Make it easier to switch between an even/odd and unpermuted layout > when switching between wide and narrow elements. E.g. if a widening > operation produces an even vector and an odd vector, we should try > to keep operations on the wide elements in that order rather than > force them to be permuted back "in order". > > To give some examples of what the current patch does: > > int f1(int *__restrict a, int *__restrict b, int *__restrict c, > int *__restrict d) > { > a[0] = (b[1] << c[3]) - d[1]; > a[1] = (b[0] << c[2]) - d[0]; > a[2] = (b[3] << c[1]) - d[3]; > a[3] = (b[2] << c[0]) - d[2]; > } > > continues to produce the same code as before when optimising for > speed: b, c and d are permuted at load time. But when optimising > for size we instead permute c into the same order as b+d and then > permute the result of the arithmetic into the same order as a: > > ldr q1, [x2] > ldr q0, [x1] > ext v1.16b, v1.16b, v1.16b, #8 // <------ > sshl v0.4s, v0.4s, v1.4s > ldr q1, [x3] > sub v0.4s, v0.4s, v1.4s > rev64 v0.4s, v0.4s // <------ > str q0, [x0] > ret > > The following function: > > int f2(int *__restrict a, int *__restrict b, int *__restrict c, > int *__restrict d) > { > a[0] = (b[3] << c[3]) - d[3]; > a[1] = (b[2] << c[2]) - d[2]; > a[2] = (b[1] << c[1]) - d[1]; > a[3] = (b[0] << c[0]) - d[0]; > } > > continues to push the reverse down to just before the store, > like the current code does. > > In: > > int f3(int *__restrict a, int *__restrict b, int *__restrict c, > int *__restrict d) > { > for (int i = 0; i < 100; ++i) > { > a[0] = (a[0] + c[3]); > a[1] = (a[1] + c[2]); > a[2] = (a[2] + c[1]); > a[3] = (a[3] + c[0]); > c += 4; > } > } > > the loads of a are hoisted and the stores of a are sunk, so that > only the load from c happens in the loop. When optimising for > speed, we prefer to have the loop operate on the reversed layout, > changing on entry and exit from the loop: > > mov x3, x0 > adrp x0, .LC0 > add x1, x2, 1600 > ldr q2, [x0, #:lo12:.LC0] > ldr q0, [x3] > mov v1.16b, v0.16b > tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- > .p2align 3,,7 > .L6: > ldr q1, [x2], 16 > add v0.4s, v0.4s, v1.4s > cmp x2, x1 > bne .L6 > mov v1.16b, v0.16b > adrp x0, .LC0 > ldr q2, [x0, #:lo12:.LC0] > tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- > str q0, [x3] > ret > > Similarly, for the very artificial testcase: > > int f4(int *__restrict a, int *__restrict b, int *__restrict c, > int *__restrict d) > { > int a0 = a[0]; > int a1 = a[1]; > int a2 = a[2]; > int a3 = a[3]; > for (int i = 0; i < 100; ++i) > { > a0 ^= c[0]; > a1 ^= c[1]; > a2 ^= c[2]; > a3 ^= c[3]; > c += 4; > for (int j = 0; j < 100; ++j) > { > a0 += d[1]; > a1 += d[0]; > a2 += d[3]; > a3 += d[2]; > d += 4; > } > b[0] = a0; > b[1] = a1; > b[2] = a2; > b[3] = a3; > b += 4; > } > a[0] = a0; > a[1] = a1; > a[2] = a2; > a[3] = a3; > } > > the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, > even though it's part of an SCC that includes the outer loop. > In other words, this is a motivating case for not assigning > permutes at SCC granularity. The code we get is: > > ldr q0, [x0] > mov x4, x1 > mov x5, x0 > add x1, x3, 1600 > add x3, x4, 1600 > .p2align 3,,7 > .L11: > ldr q1, [x2], 16 > sub x0, x1, #1600 > eor v0.16b, v1.16b, v0.16b > rev64 v0.4s, v0.4s // <--- > .p2align 3,,7 > .L10: > ldr q1, [x0], 16 > add v0.4s, v0.4s, v1.4s > cmp x0, x1 > bne .L10 > rev64 v0.4s, v0.4s // <--- > add x1, x0, 1600 > str q0, [x4], 16 > cmp x3, x4 > bne .L11 > str q0, [x5] > ret > > The patch is still incomplete, hence the FIXMEs, but it's at the stage > where it passes bootstrap & regtest on aarch64-linux-gnu. (There's one > regression in slp-11b.c, but I think it's arguably restoring the > expected load-lanes behaviour.) Does this look like a plausible > direction, before I go filling in the FIXME parts? Yes, it looks nice! I wonder if you spent much thoughts on layout optimizing for SLP patterns and/or how to integrate both? Thanks, Richard. > Thanks, > Richard > > --- > gcc/graphds.cc | 13 +- > gcc/graphds.h | 3 +- > gcc/hash-map-traits.h | 74 +- > gcc/hash-traits.h | 96 ++- > gcc/tree-vect-slp.cc | 1870 ++++++++++++++++++++++++++++++++--------- > 5 files changed, 1599 insertions(+), 457 deletions(-) > > *** /tmp/iBbApg_graphds.cc 2022-08-02 08:16:23.975354155 +0100 > --- gcc/graphds.cc 2022-08-01 20:44:22.287691305 +0100 > *************** > *** 281,287 **** > numbers assigned by the previous pass. If SUBGRAPH is not NULL, it > specifies the subgraph of G whose strongly connected components we want > to determine. If SKIP_EDGE_P is not NULL, it points to a callback function. > ! Edge E will be skipped if callback function returns true. > > After running this function, v->component is the number of the strongly > connected component for each vertex of G. Returns the number of the > --- 281,294 ---- > numbers assigned by the previous pass. If SUBGRAPH is not NULL, it > specifies the subgraph of G whose strongly connected components we want > to determine. If SKIP_EDGE_P is not NULL, it points to a callback function. > ! Edge E will be skipped if callback function returns true. If SCC_GROUPING > ! is not null, the nodes will be added to it in the following order: > ! > ! - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag, > ! A's nodes come before B's nodes. > ! > ! - All of an SCC's nodes are listed consecutively, although the order > ! of the nodes within an SCC is not really meaningful. > > After running this function, v->component is the number of the strongly > connected component for each vertex of G. Returns the number of the > *************** > *** 289,295 **** > > int > graphds_scc (struct graph *g, bitmap subgraph, > ! skip_edge_callback skip_edge_p) > { > int *queue = XNEWVEC (int, g->n_vertices); > vec postorder = vNULL; > --- 296,302 ---- > > int > graphds_scc (struct graph *g, bitmap subgraph, > ! skip_edge_callback skip_edge_p, vec *scc_grouping) > { > int *queue = XNEWVEC (int, g->n_vertices); > vec postorder = vNULL; > *************** > *** 317,323 **** > > for (i = 0; i < nq; i++) > queue[i] = postorder[nq - i - 1]; > ! comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p); > > free (queue); > postorder.release (); > --- 324,330 ---- > > for (i = 0; i < nq; i++) > queue[i] = postorder[nq - i - 1]; > ! comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, skip_edge_p); > > free (queue); > postorder.release (); > *** /tmp/eRUAEk_graphds.h 2022-08-02 08:16:23.987354123 +0100 > --- gcc/graphds.h 2022-08-01 20:44:18.835700593 +0100 > *************** > *** 58,64 **** > typedef bool (*skip_edge_callback) (struct graph_edge *); > int graphds_dfs (struct graph *, int *, int, > vec *, bool, bitmap, skip_edge_callback = NULL); > ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL); > void graphds_domtree (struct graph *, int, int *, int *, int *); > typedef void (*graphds_edge_callback) (struct graph *, > struct graph_edge *, void *); > --- 58,65 ---- > typedef bool (*skip_edge_callback) (struct graph_edge *); > int graphds_dfs (struct graph *, int *, int, > vec *, bool, bitmap, skip_edge_callback = NULL); > ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL, > ! vec * = NULL); > void graphds_domtree (struct graph *, int, int *, int *, int *); > typedef void (*graphds_edge_callback) (struct graph *, > struct graph_edge *, void *); > *** /tmp/4zWIek_hash-map-traits.h 2022-08-02 08:16:23.995354102 +0100 > --- gcc/hash-map-traits.h 2022-08-01 20:44:22.287691305 +0100 > *************** > *** 105,118 **** > static const bool maybe_mx = false; > }; > > ! /* Implement traits for a hash_map with values of type Value for cases > ! in which the key cannot represent empty and deleted slots. Instead > ! record empty and deleted entries in Value. Derived classes must > ! implement the hash and equal_keys functions. */ > > ! template > struct unbounded_hashmap_traits > { > template static inline void remove (T &); > static const bool empty_zero_p = default_hash_traits ::empty_zero_p; > template static inline bool is_empty (const T &); > --- 105,123 ---- > static const bool maybe_mx = false; > }; > > ! /* Implement traits for a hash_map with keys of type Key and values of > ! type Value for cases in which the key cannot represent empty and > ! deleted slots. Instead record empty and deleted entries in Value. */ > > ! template > struct unbounded_hashmap_traits > { > + typedef typename Key::value_type key_type; > + > + static hashval_t hash (const typename Key::value_type &); > + static bool equal_keys (const typename Key::value_type &, > + const typename Key::compare_type &); > + > template static inline void remove (T &); > static const bool empty_zero_p = default_hash_traits ::empty_zero_p; > template static inline bool is_empty (const T &); > *************** > *** 121,162 **** > template static inline void mark_deleted (T &); > }; > > ! template > template > inline void > ! unbounded_hashmap_traits ::remove (T &entry) > { > default_hash_traits ::remove (entry.m_value); > } > > ! template > template > inline bool > ! unbounded_hashmap_traits ::is_empty (const T &entry) > { > return default_hash_traits ::is_empty (entry.m_value); > } > > ! template > template > inline bool > ! unbounded_hashmap_traits ::is_deleted (const T &entry) > { > return default_hash_traits ::is_deleted (entry.m_value); > } > > ! template > template > inline void > ! unbounded_hashmap_traits ::mark_empty (T &entry) > { > default_hash_traits ::mark_empty (entry.m_value); > } > > ! template > template > inline void > ! unbounded_hashmap_traits ::mark_deleted (T &entry) > { > default_hash_traits ::mark_deleted (entry.m_value); > } > --- 126,184 ---- > template static inline void mark_deleted (T &); > }; > > ! template > ! inline hashval_t > ! unbounded_hashmap_traits > ! ::hash (const typename Key::value_type &key) > ! { > ! return Key::hash (key); > ! } > ! > ! template > ! inline bool > ! unbounded_hashmap_traits > ! ::equal_keys (const typename Key::value_type &x, > ! const typename Key::compare_type &y) > ! { > ! return Key::equal (x, y); > ! } > ! > ! template > template > inline void > ! unbounded_hashmap_traits ::remove (T &entry) > { > default_hash_traits ::remove (entry.m_value); > } > > ! template > template > inline bool > ! unbounded_hashmap_traits ::is_empty (const T &entry) > { > return default_hash_traits ::is_empty (entry.m_value); > } > > ! template > template > inline bool > ! unbounded_hashmap_traits ::is_deleted (const T &entry) > { > return default_hash_traits ::is_deleted (entry.m_value); > } > > ! template > template > inline void > ! unbounded_hashmap_traits ::mark_empty (T &entry) > { > default_hash_traits ::mark_empty (entry.m_value); > } > > ! template > template > inline void > ! unbounded_hashmap_traits ::mark_deleted (T &entry) > { > default_hash_traits ::mark_deleted (entry.m_value); > } > *************** > *** 166,190 **** > slots. */ > > template > ! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits > ! { > ! typedef Key key_type; > ! static inline hashval_t hash (Key); > ! static inline bool equal_keys (Key, Key); > ! }; > ! > ! template > ! inline hashval_t > ! unbounded_int_hashmap_traits ::hash (Key k) > ! { > ! return k; > ! } > ! > ! template > ! inline bool > ! unbounded_int_hashmap_traits ::equal_keys (Key k1, Key k2) > ! { > ! return k1 == k2; > ! } > > #endif // HASH_MAP_TRAITS_H > --- 188,194 ---- > slots. */ > > template > ! using unbounded_int_hashmap_traits > ! = unbounded_hashmap_traits , Value>; > > #endif // HASH_MAP_TRAITS_H > *** /tmp/FL7vGb_hash-traits.h 2022-08-02 08:16:24.011354061 +0100 > --- gcc/hash-traits.h 2022-08-01 20:44:22.287691305 +0100 > *************** > *** 85,125 **** > { > } > > > ! /* Hasher for integer type Type in which Empty is a spare value that can be > ! used to mark empty slots. If Deleted != Empty then Deleted is another > ! spare value that can be used for deleted slots; if Deleted == Empty then > ! hash table entries cannot be deleted. */ > ! > ! template > ! struct int_hash : typed_noop_remove > { > typedef Type value_type; > typedef Type compare_type; > > static inline hashval_t hash (value_type); > static inline bool equal (value_type existing, value_type candidate); > - static inline void mark_deleted (Type &); > - static const bool empty_zero_p = Empty == 0; > - static inline void mark_empty (Type &); > - static inline bool is_deleted (Type); > - static inline bool is_empty (Type); > }; > > ! template > inline hashval_t > ! int_hash ::hash (value_type x) > { > return x; > } > > ! template > inline bool > ! int_hash ::equal (value_type x, value_type y) > { > return x == y; > } > > template > inline void > int_hash ::mark_deleted (Type &x) > --- 85,135 ---- > { > } > > + /* Base traits for integer type Type, providing just the hash and > + comparison functionality. */ > > ! template > ! struct int_hash_base : typed_noop_remove > { > typedef Type value_type; > typedef Type compare_type; > > static inline hashval_t hash (value_type); > static inline bool equal (value_type existing, value_type candidate); > }; > > ! template > inline hashval_t > ! int_hash_base ::hash (value_type x) > { > return x; > } > > ! template > inline bool > ! int_hash_base ::equal (value_type x, value_type y) > { > return x == y; > } > > + /* Hasher for integer type Type in which Empty is a spare value that can be > + used to mark empty slots. If Deleted != Empty then Deleted is another > + spare value that can be used for deleted slots; if Deleted == Empty then > + hash table entries cannot be deleted. */ > + > + template > + struct int_hash : int_hash_base > + { > + typedef Type value_type; > + typedef Type compare_type; > + > + static inline void mark_deleted (Type &); > + static const bool empty_zero_p = Empty == 0; > + static inline void mark_empty (Type &); > + static inline bool is_deleted (Type); > + static inline bool is_empty (Type); > + }; > + > template > inline void > int_hash ::mark_deleted (Type &x) > *************** > *** 398,403 **** > --- 408,467 ---- > return T1::is_empty (x.first); > } > > + /* Base traits for vectors, providing just the hash and comparison > + functionality. Type gives the traits for the element type. */ > + > + template > + struct vec_hash_base > + { > + typedef vec value_type; > + typedef vec compare_type; > + > + static inline hashval_t hash (value_type); > + static inline bool equal (value_type, compare_type); > + }; > + > + template > + inline hashval_t > + vec_hash_base ::hash (value_type x) > + { > + inchash::hash hstate; > + hstate.add_int (x.length ()); > + for (auto &value : x) > + hstate.merge_hash (Type::hash (value)); > + return hstate.end (); > + } > + > + template > + inline bool > + vec_hash_base ::equal (value_type x, compare_type y) > + { > + if (x.length () != y.length ()) > + return false; > + for (unsigned int i = 0; i < x.length (); ++i) > + if (!Type::equal (x[i], y[i])) > + return false; > + return true; > + } > + > + /* Traits for vectors whose contents should be freed normally. */ > + > + template > + struct vec_free_hash_base : vec_hash_base > + { > + static void remove (typename vec_hash_base ::value_type &); > + }; > + > + template > + void > + vec_free_hash_base > + ::remove (typename vec_hash_base ::value_type &x) > + { > + for (auto &value : x) > + Type::remove (x); > + x.release (); > + } > + > template struct default_hash_traits : T {}; > > template > *** /tmp/5ek0ya_tree-vect-slp.cc 2022-08-02 08:16:24.027354017 +0100 > --- gcc/tree-vect-slp.cc 2022-08-01 20:44:22.631690380 +0100 > *************** > *** 20,25 **** > --- 20,26 ---- > . */ > > #include "config.h" > + #define INCLUDE_ALGORITHM > #include "system.h" > #include "coretypes.h" > #include "backend.h" > *************** > *** 49,54 **** > --- 50,57 ---- > #include "tree-eh.h" > #include "tree-cfg.h" > #include "alloc-pool.h" > + #include "sreal.h" > + #include "predict.h" > > static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *, > slp_tree, stmt_vector_for_cost *); > *************** > *** 304,309 **** > --- 307,322 ---- > oprnds_info.release (); > } > > + /* Return the execution frequency of NODE (so that a higher value indicates > + a "more important" node when optimizing for speed). */ > + > + static sreal > + vect_slp_node_weight (slp_tree node) > + { > + stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node)); > + basic_block bb = gimple_bb (stmt_info->stmt); > + return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count); > + } > > /* Return true if STMTS contains a pattern statement. */ > > *************** > *** 3531,3559 **** > return opt_result::success (); > } > > ! struct slpg_vertex > { > ! slpg_vertex (slp_tree node_) > ! : node (node_), perm_in (-1), perm_out (-1) {} > > ! int get_perm_materialized () const > ! { return perm_in != perm_out ? perm_in : 0; } > > slp_tree node; > ! /* The common permutation on the incoming lanes (towards SLP children). */ > ! int perm_in; > ! /* The permutation on the outgoing lanes (towards SLP parents). When > ! the node is a materialization point for a permute this differs > ! from perm_in (and is then usually zero). Materialization happens > ! on the input side. */ > ! int perm_out; > }; > > /* Fill the vertices and leafs vector with all nodes in the SLP graph. */ > > ! static void > ! vect_slp_build_vertices (hash_set &visited, slp_tree node, > ! vec &vertices, vec &leafs) > { > unsigned i; > slp_tree child; > --- 3544,3994 ---- > return opt_result::success (); > } > > ! /* Estimates the cost of inserting layout changes into the SLP graph. */ > ! > ! struct slpg_layout_cost > ! { > ! slpg_layout_cost () = default; > ! slpg_layout_cost (sreal, bool); > ! > ! bool operator== (const slpg_layout_cost &) const; > ! bool operator!= (const slpg_layout_cost &) const; > ! > ! bool is_better_than (const slpg_layout_cost &, bool) const; > ! > ! void add_parallel_cost (const slpg_layout_cost &); > ! void add_serial_cost (const slpg_layout_cost &); > ! void split (unsigned int); > ! > ! /* The longest sequence of layout changes needed during any traversal > ! of the partition dag, weighted by execution frequency. > ! > ! This is the most important metric when optimizing for speed, since > ! it helps to ensure that we keep the number of operations on > ! critical paths to a minimum. */ > ! sreal depth = 0; > ! > ! /* An estimate of the total number of operations needed. IT is weighted by > ! execution frequency when optimizing for speed but not when optimizing for > ! size. In order to avoid double-counting, a node with a fanout of N will > ! distribute 1/N of its total cost to each successor. > ! > ! This is the most important metric when optimizing for size, since > ! it helps to keep the total number of operations to a minimum, */ > ! sreal total = 0; > ! }; > ! > ! /* Construct costs for a node with weight WEIGHT. A higher weight > ! indicates more frequent execution. IS_FOR_SIZE is true if we are > ! optimizing for size rather than speed. */ > ! > ! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size) > ! : depth (weight), total (is_for_size && weight > 0 ? 1 : weight) > ! { > ! } > ! > ! bool > ! slpg_layout_cost::operator== (const slpg_layout_cost &other) const > ! { > ! return depth == other.depth && total == other.total; > ! } > ! > ! bool > ! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const > ! { > ! return !operator== (other); > ! } > ! > ! /* Return true if these costs are better than OTHER. IS_FOR_SIZE is > ! true if we are optimizing for size rather than speed. */ > ! > ! bool > ! slpg_layout_cost::is_better_than (const slpg_layout_cost &other, > ! bool is_for_size) const > ! { > ! if (is_for_size) > ! { > ! if (total != other.total) > ! return total < other.total; > ! return depth < other.depth; > ! } > ! else > ! { > ! if (depth != other.depth) > ! return depth < other.depth; > ! return total < other.total; > ! } > ! } > ! > ! /* Increase the costs to account for something with cost INPUT_COST > ! happening in parallel with the current costs. */ > ! > ! void > ! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost) > { > ! depth = std::max (depth, input_cost.depth); > ! total += input_cost.total; > ! } > ! > ! /* Increase the costs to account for something with cost INPUT_COST > ! happening in series with the current costs. */ > ! > ! void > ! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other) > ! { > ! depth += other.depth; > ! total += other.total; > ! } > ! > ! /* Split the total cost among TIMES successors or predecessors. */ > > ! void > ! slpg_layout_cost::split (unsigned int times) > ! { > ! if (times > 1) > ! total /= times; > ! } > ! > ! /* Information about one node in the SLP graph, for use during > ! vect_optimize_slp_pass. */ > ! > ! struct slpg_vertex > ! { > ! slpg_vertex (slp_tree node_) : node (node_) {} > > + /* The node itself. */ > slp_tree node; > ! > ! /* Which partition the node belongs to, or -1 if none. */ > ! int partition = -1; > ! > ! /* The number of nodes that directly use the result of this one > ! (i.e. the number of nodes that count this one as a child). */ > ! unsigned int out_degree = 0; > ! > ! /* The execution frequency of the node. */ > ! sreal weight = 0; > ! > ! /* The total execution frequency of all nodes that directly use the > ! result of this one. */ > ! sreal out_weight = 0; > ! }; > ! > ! /* Information about one partition of the SLP graph, for use during > ! vect_optimize_slp_pass. */ > ! > ! struct slpg_partition_info > ! { > ! /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END) > ! of m_partitioned_nodes. */ > ! unsigned int node_begin = 0; > ! unsigned int node_end = 0; > ! > ! /* Which layout we've chosen to use for this partition, or -1 if > ! we haven't picked one yet. */ > ! int layout = -1; > ! > ! /* The number of predecessors and successors in the partition dag. > ! The predecessors always have lower partition numbers and the > ! successors always have higher partition numbers. > ! > ! Note that the directions of these edges are not necessarily the > ! same as in the data flow graph. For example, if an SCC contains > ! an inner and an outer loop, the inner loop's partition will have > ! at least two incoming edges from the outer loop's partition: > ! one for a live-in value and one for a live-out value. In data > ! flow terms, the live-in edge would also be from the outer loop > ! to the inner loop, but the live-out edge would be from the inner > ! loop to the outer loop. */ > ! unsigned int in_degree = 0; > ! unsigned int out_degree = 0; > ! }; > ! > ! /* Information about the costs of using a particular layout for a > ! particular partition. It can also say that the combination is > ! invalid. */ > ! > ! struct slpg_partition_layout_costs > ! { > ! bool is_possible () const { return cost.total != sreal::min (); } > ! void mark_impossible () { cost.total = sreal::min (); } > ! > ! /* The costs inherited from predecessor partitions plus the inherent > ! cost of the layout within the node itself. */ > ! slpg_layout_cost in_cost; > ! > ! /* The costs inherited from successor partitions. */ > ! slpg_layout_cost out_cost; > ! > ! /* The sum of the above, when used. */ > ! slpg_layout_cost cost; > ! }; > ! > ! /* This class tries to optimize the layout of vectors in order to avoid > ! unnecessary shuffling. At the moment, the set of possible layouts are > ! restricted to bijective permutes. > ! > ! The goal of the pass depends on whether we're optimizing for size or > ! for speed. When optimizing for size, the goal is to reduce the overall > ! number of layout changes (including layout changes implied by things > ! like load permutes). When optimizing for speed, the goal is to > ! reduce the maximum latency attributable to layout changes on any > ! non-cyclic path through the data flow graph. > ! > ! For example, when optimizing a loop nest for speed, we will prefer > ! to make layout changes outside of a loop rather than inside of a loop, > ! and will prefer to make layout changes in parallel rather than serially, > ! even if that increases the overall number of layout changes. > ! > ! The high-level procedure is: > ! > ! (1) Build a graph in which edges go from uses (parents) to definitions > ! (children). > ! > ! (2) Divide the graph into a dag of strongly-connected components (SCCs). > ! > ! (3) Partition the nodes in each SCC based on their containing cfg loop, > ! creating a dag of partitions. The goal is now to assign a layout > ! to each partition. > ! > ! (4) Construct a set of vector layouts that are worth considering and > ! calculate the local cost of using each layout in each partition. > ! Some partitions might not allow a given layout at all. > ! > ! (5) Perform a forward walk over the partition dag (from loads to stores) > ! accumulating the "forward" cost of using each layout. When visiting > ! each partition, assign a tentative choice of layout to the partition > ! and use that choice when calculating the cost of using a different > ! layout in successor partitions. > ! > ! (6) Perform a backward walk over the partition dag (from stores to loads), > ! accumulating the "backward" cost of using each layout. When visiting > ! each partition, make a final choice of layout for that partition based > ! on the accumulated forward costs (from (5)) and backward costs > ! (from (6)). > ! > ! (7) Apply the chosen layouts to the SLP graph. > ! > ! For example, consider the SLP statements: > ! > ! S1: a_1 = load > ! loop: > ! S2: a_2 = PHI > ! S3: b_1 = load > ! S4: a_3 = a_2 + b_1 > ! exit: > ! S5: a_4 = PHI > ! S6: store a_4 > ! > ! S2 and S4 form an SCC and are part of the same loop. Every other > ! statement is in a singleton SCC. In this example there is a one-to-one > ! mapping between SCCs and partitions and the partition dag looks like this; > ! > ! S1 S3 > ! \ / > ! S2+S4 > ! | > ! S5 > ! | > ! S6 > ! > ! S2, S3 and S4 will have a higher execution frequency than the other > ! statements, so when optimizing for speed, the goal is to avoid any > ! layout changes: > ! > ! - within S3 > ! - within S2+S4 > ! - on the S3->S2+S4 edge > ! > ! For example, if S3 was originally a reversing load, the goal of the > ! pass is to make it an unreversed load and change the layout on the > ! S1->S2+S4 and S2+S4->S5 edges to compensate. (Changing the layout > ! on S1->S2+S4 and S5->S6 would also be acceptable.) > ! > ! The difference between SCCs and partitions becomes important if we > ! add an outer loop: > ! > ! S1: a_1 = ... > ! loop1: > ! S2: a_2 = PHI > ! S3: b_1 = load > ! S4: a_3 = a_2 + b_1 > ! loop2: > ! S5: a_4 = PHI > ! S6: c_1 = load > ! S7: a_5 = a_4 + c_1 > ! exit2: > ! S8: a_6 = PHI > ! S9: store a_6 > ! exit1: > ! > ! Here, S2, S4, S5, S7 and S8 form a single SCC. However, we usually > ! do not want restrictions in the outer loop to "infect" the decision > ! for the inner loop. For example, if an outer-loop node in the SCC > ! contains a statement with a fixed layout, that should not prevent the > ! inner loop from using a different layout. Conversely, the inner loop > ! should not dictate a layout to the outer loop: if the outer loop does > ! a lot of computation, then it may not be efficient to do all of that > ! computation in the inner loop's preferred layout. > ! > ! We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner). > ! We also try to arrange partitions so that: > ! > ! - the partition for an outer loop comes before the partition for > ! an inner loop > ! > ! - if a sibling loop A dominates a sibling loop B, A's partition > ! comes before B's > ! > ! So in this case, the partition dag is: > ! > ! S1 S3 > ! \ / > ! S2+S4+S8 S6 > ! | \\ / > ! | S5+S7 > ! | > ! S9 > ! > ! There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and > ! one for a reversal of the edge S7->S8. > ! > ! The backward walk picks a layout for S5+S7 before S2+S4+S8. The choice > ! for S2+S4+S8 therefore has to balance the cost of using the outer loop's > ! preferred layout against the cost of changing the layout on entry to the > ! inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed). > ! > ! ??? Although this works well when optimizing for speed, it has the > ! downside when optimizing for size that the choice of layout for S5+S7 > ! is completely independent of S9. Perhaps we should not partition SCCs > ! when optimizing for size? > ! > ! To give a concrete example of the difference between optimizing > ! for size and speed, consider: > ! > ! a[0] = (b[1] << c[3]) - d[1]; > ! a[1] = (b[0] << c[2]) - d[0]; > ! a[2] = (b[3] << c[1]) - d[3]; > ! a[3] = (b[2] << c[0]) - d[2]; > ! > ! There are three different layouts here: one for a, one for b and d, > ! and one for c. When optimizing for speed it is better to permute > ! each of b, c and d into the order required by a, since those permutes > ! happen in parallel. But when optimizing for size, it is better to: > ! > ! - permute c into the same order as b > ! - do the arithmetic > ! - permute the result into the order required by a > ! > ! This gives 2 permutes rather than 3. */ > ! > ! class vect_optimize_slp_pass > ! { > ! public: > ! vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {} > ! void run (); > ! > ! private: > ! /* Graph building. */ > ! struct loop *containing_loop (slp_tree); > ! bool is_cfg_latch_edge (graph_edge *); > ! void build_vertices (hash_set &, slp_tree); > ! void build_vertices (); > ! void build_graph (); > ! > ! /* Partitioning. */ > ! void create_partitions (); > ! template void for_each_partition_edge (unsigned int, T); > ! > ! /* Layout selection. */ > ! bool can_change_layouts (unsigned int, unsigned int); > ! bool layout_is_reversible (unsigned int); > ! unsigned int layout_cost (unsigned int, unsigned int); > ! slpg_partition_layout_costs &partition_layout_costs (unsigned int, > ! unsigned int); > ! bool partition_can_absorb_permutes (unsigned int); > ! int node_absorption_cost (slp_tree, unsigned int); > ! void start_choosing_layouts (); > ! > ! /* Cost propagation. */ > ! slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int, > ! unsigned int, unsigned int); > ! slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int); > ! slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int); > ! void forward_pass (); > ! void backward_pass (); > ! > ! /* Rematerialization. */ > ! slp_tree get_result_with_layout (slp_tree, unsigned int); > ! void materialize (); > ! > ! /* Clean-up. */ > ! void remove_redundant_permutes (); > ! > ! void dump (); > ! > ! vec_info *m_vinfo; > ! > ! /* True if we should optimize the graph for size, false if we should > ! optimize it for speed. (It wouldn't be easy to make this decision > ! more locally.) */ > ! bool m_optimize_size; > ! > ! /* A graph of all SLP nodes, with edges leading from uses to definitions. > ! In other words, a node's predecessors are its slp_tree parents and > ! a node's successors are its slp_tree children. */ > ! graph *m_slpg = nullptr; > ! > ! /* The vertices of M_SLPG, indexed by slp_tree::vertex. */ > ! auto_vec m_vertices; > ! > ! /* The list of all leaves of M_SLPG. such as external definitions, constants, > ! and loads. */ > ! auto_vec m_leafs; > ! > ! /* This array has one entry for every vector layout that we're considering. > ! Element 0 is null and indicates "no change". Other entries describe > ! permutes that are inherent in the current graph and that we would > ! like to reverse if possible. > ! > ! For example, a permute { 1, 2, 3, 0 } means that something has > ! effectively been permuted in that way, such as a load group > ! { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]). > ! We'd then like to apply the reverse permute { 3, 0, 1, 2 } > ! in order to put things "back" in order. */ > ! auto_vec > m_perms; > ! > ! /* A partitioning of the nodes for which a layout must be chosen. > ! Each partition represents an pair; that is, > ! nodes in different SCCs belong to different partitions, and nodes > ! within an SCC are further partitioned according to the containing > ! cfg loop. Partition comes before if: > ! > ! - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk > ! from leaves (such as loads) to roots (such as stores). > ! > ! - SCC1 == SCC2 and L1's header strictly dominates L2's header. */ > ! auto_vec m_partitions; > ! > ! /* The list of all nodes for which a layout must be chosen. Nodes for > ! partition P come before the nodes for partition P+1. Nodes within a > ! partition are in reverse postorder. */ > ! auto_vec m_partitioned_nodes; > ! > ! /* Index P * num-layouts + L contains the cost of using layout L > ! for partition P. */ > ! auto_vec m_partition_layout_costs; > ! > ! /* Index N * num-layouts + L, if nonnull, is a node that provides the > ! original output of node N adjusted to have layout L. */ > ! auto_vec m_node_layouts; > }; > > /* Fill the vertices and leafs vector with all nodes in the SLP graph. */ > > ! void > ! vect_optimize_slp_pass::build_vertices (hash_set &visited, > ! slp_tree node) > { > unsigned i; > slp_tree child; > *************** > *** 3561,3568 **** > if (visited.add (node)) > return; > > ! node->vertex = vertices.length (); > ! vertices.safe_push (slpg_vertex (node)); > > bool leaf = true; > bool force_leaf = false; > --- 3996,4003 ---- > if (visited.add (node)) > return; > > ! node->vertex = m_vertices.length (); > ! m_vertices.safe_push (slpg_vertex (node)); > > bool leaf = true; > bool force_leaf = false; > *************** > *** 3570,3576 **** > if (child) > { > leaf = false; > ! vect_slp_build_vertices (visited, child, vertices, leafs); > } > else > force_leaf = true; > --- 4005,4011 ---- > if (child) > { > leaf = false; > ! build_vertices (visited, child); > } > else > force_leaf = true; > *************** > *** 3580,3600 **** > and inductions. Force those SLP PHIs to act as leafs to make them > backwards reachable. */ > if (leaf || force_leaf) > ! leafs.safe_push (node->vertex); > } > > /* Fill the vertices and leafs vector with all nodes in the SLP graph. */ > > ! static void > ! vect_slp_build_vertices (vec_info *info, vec &vertices, > ! vec &leafs) > { > hash_set visited; > unsigned i; > slp_instance instance; > ! FOR_EACH_VEC_ELT (info->slp_instances, i, instance) > ! vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), vertices, > ! leafs); > } > > /* Apply (reverse) bijectite PERM to VEC. */ > --- 4015,4033 ---- > and inductions. Force those SLP PHIs to act as leafs to make them > backwards reachable. */ > if (leaf || force_leaf) > ! m_leafs.safe_push (node->vertex); > } > > /* Fill the vertices and leafs vector with all nodes in the SLP graph. */ > > ! void > ! vect_optimize_slp_pass::build_vertices () > { > hash_set visited; > unsigned i; > slp_instance instance; > ! FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance) > ! build_vertices (visited, SLP_INSTANCE_TREE (instance)); > } > > /* Apply (reverse) bijectite PERM to VEC. */ > *************** > *** 3625,3699 **** > } > } > > ! /* Return whether permutations PERM_A and PERM_B as recorded in the > ! PERMS vector are equal. */ > > ! static bool > ! vect_slp_perms_eq (const vec > &perms, > ! int perm_a, int perm_b) > { > ! return (perm_a == perm_b > ! || (perm_a != -1 && perm_b != -1 > ! && perms[perm_a].length () == perms[perm_b].length () > ! && memcmp (&perms[perm_a][0], &perms[perm_b][0], > ! sizeof (unsigned) * perms[perm_a].length ()) == 0)); > } > > ! /* Optimize the SLP graph of VINFO. */ > > ! void > ! vect_optimize_slp (vec_info *vinfo) > { > ! if (vinfo->slp_instances.is_empty ()) > ! return; > > ! slp_tree node; > ! unsigned i; > ! auto_vec vertices; > ! auto_vec leafs; > ! vect_slp_build_vertices (vinfo, vertices, leafs); > > ! struct graph *slpg = new_graph (vertices.length ()); > ! for (slpg_vertex &v : vertices) > for (slp_tree child : SLP_TREE_CHILDREN (v.node)) > if (child) > ! add_edge (slpg, v.node->vertex, child->vertex); > > ! /* Compute (reverse) postorder on the inverted graph. */ > ! auto_vec ipo; > ! graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL); > ! > ! auto_vec > perms; > ! perms.safe_push (vNULL); /* zero is no permute */ > ! > ! /* Produce initial permutations. */ > ! for (i = 0; i < leafs.length (); ++i) > ! { > ! int idx = leafs[i]; > ! slp_tree node = vertices[idx].node; > ! > ! /* Handle externals and constants optimistically throughout the > ! iteration. But treat existing vectors as fixed since we > ! do not handle permuting them below. */ > ! if ((SLP_TREE_DEF_TYPE (node) == vect_external_def > ! && !SLP_TREE_VEC_DEFS (node).exists ()) > ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def) > ! continue; > > ! /* Leafs do not change across iterations. Note leafs also double > ! as entries to the reverse graph. */ > ! if (!slpg->vertices[idx].succ) > { > ! vertices[idx].perm_in = 0; > ! vertices[idx].perm_out = 0; > } > > /* Loads are the only thing generating permutes. */ > if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > continue; > > /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the > ! node unpermuted, record this permute. */ > stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node); > if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > continue; > --- 4058,4416 ---- > } > } > > ! /* Return the cfg loop that contains NODE. */ > > ! struct loop * > ! vect_optimize_slp_pass::containing_loop (slp_tree node) > { > ! stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node); > ! if (!rep) > ! return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father; > ! return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father; > } > > ! /* Return true if UD (an edge from a use to a definition) is associated > ! with a loop latch edge in the cfg. */ > > ! bool > ! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud) > { > ! slp_tree use = m_vertices[ud->src].node; > ! slp_tree def = m_vertices[ud->dest].node; > ! if (SLP_TREE_DEF_TYPE (use) != vect_internal_def > ! || SLP_TREE_DEF_TYPE (def) != vect_internal_def) > ! return false; > > ! stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use)); > ! return (is_a (use_rep->stmt) > ! && bb_loop_header_p (gimple_bb (use_rep->stmt)) > ! && containing_loop (def) == containing_loop (use)); > ! } > ! > ! /* Build the graph. Mark edges that correspond to cfg loop latch edges with > ! a nonnull data field. */ > > ! void > ! vect_optimize_slp_pass::build_graph () > ! { > ! build_vertices (); > ! > ! m_slpg = new_graph (m_vertices.length ()); > ! for (slpg_vertex &v : m_vertices) > for (slp_tree child : SLP_TREE_CHILDREN (v.node)) > if (child) > ! { > ! graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex); > ! if (is_cfg_latch_edge (ud)) > ! ud->data = this; > ! } > ! } > > ! /* Return true if E corresponds to a loop latch edge in the cfg. */ > > ! static bool > ! skip_cfg_latch_edges (graph_edge *e) > ! { > ! return e->data; > ! } > ! > ! /* Create the node partitions. */ > ! > ! void > ! vect_optimize_slp_pass::create_partitions () > ! { > ! /* Calculate a postorder of the graph, ignoring edges that correspond > ! to natural latch edges in the cfg. Reading the vector from the > ! end to the beginning gives the reverse postorder. */ > ! auto_vec initial_rpo; > ! graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo, > ! false, NULL, skip_cfg_latch_edges); > ! gcc_assert (initial_rpo.length () == m_vertices.length ()); > ! > ! /* Calculate the strongly connected components of the graph. */ > ! auto_vec scc_grouping; > ! unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping); > ! > ! /* Create a new index order in which all nodes from the same SCC are > ! consecutive. Use scc_pos to record the index of the first node in > ! each SCC. */ > ! auto_vec scc_pos (num_sccs); > ! int last_component = -1; > ! unsigned int node_count = 0; > ! for (unsigned int node_i : scc_grouping) > ! { > ! if (last_component != m_slpg->vertices[node_i].component) > ! { > ! last_component = m_slpg->vertices[node_i].component; > ! gcc_assert (last_component == int (scc_pos.length ())); > ! scc_pos.quick_push (node_count); > ! } > ! node_count += 1; > ! } > ! gcc_assert (node_count == initial_rpo.length () > ! && last_component + 1 == int (num_sccs)); > ! > ! /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes > ! inside each SCC following the RPO we calculated above. The fact that > ! we ignored natural latch edges when calculating the RPO should ensure > ! that, for natural loop nests: > ! > ! - the first node that we encounter in a cfg loop is the loop header phi > ! - the loop header phis are in dominance order > ! > ! Arranging for this is an optimization (see below) rather than a > ! correctness issue. Unnatural loops with a tangled mess of backedges > ! will still work correctly, but might give poorer results. > ! > ! Also update scc_pos so that it gives 1 + the index of the last node > ! in the SCC. */ > ! m_partitioned_nodes.safe_grow (node_count); > ! for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;) > ! { > ! unsigned int node_i = initial_rpo[old_i]; > ! unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++; > ! m_partitioned_nodes[new_i] = node_i; > ! } > ! > ! /* Partition each SCC based on the containing cfg loop. The order we > ! constructed above should ensure that, for natural cfg loops, we'll > ! create sub-SCC partitions for outer loops before the corresponding > ! sub-SCC partitions for inner loops. Similarly, when one sibling > ! loop A dominates another sibling loop B, we should create a sub-SCC > ! partition for A before a sub-SCC partition for B. > ! > ! As above, nothing depends for correctness on whether this achieves > ! a natural nesting, but we should get better results when it does. */ > ! m_partitions.reserve (m_vertices.length ()); > ! unsigned int next_partition_i = 0; > ! hash_map loop_partitions; > ! unsigned int rpo_begin = 0; > ! unsigned int num_partitioned_nodes = 0; > ! for (unsigned int rpo_end : scc_pos) > ! { > ! loop_partitions.empty (); > ! for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i) > ! { > ! /* Handle externals and constants optimistically throughout. > ! But treat existing vectors as fixed since we do not handle > ! permuting them. */ > ! unsigned int node_i = m_partitioned_nodes[rpo_i]; > ! slp_tree node = m_vertices[node_i].node; > ! if ((SLP_TREE_DEF_TYPE (node) == vect_external_def > ! && !SLP_TREE_VEC_DEFS (node).exists ()) > ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def) > ! m_vertices[node_i].partition = -1; > ! else > ! { > ! struct loop *loop = containing_loop (node); > ! bool existed; > ! auto &partition_i > ! = loop_partitions.get_or_insert (loop, &existed); > ! if (!existed) > ! { > ! partition_i = next_partition_i++; > ! m_partitions.quick_push (slpg_partition_info ()); > ! } > ! m_vertices[node_i].partition = partition_i; > ! num_partitioned_nodes += 1; > ! m_partitions[partition_i].node_end += 1; > ! } > ! } > ! rpo_begin = rpo_end; > ! } > ! > ! /* Assign ranges of consecutive node indices to each partition, > ! in partition order. Start with node_end being the same as > ! node_begin so that the next loop can use it as a counter. */ > ! unsigned int node_begin = 0; > ! for (auto &partition : m_partitions) > ! { > ! partition.node_begin = node_begin; > ! node_begin += partition.node_end; > ! partition.node_end = partition.node_begin; > ! } > ! gcc_assert (node_begin == num_partitioned_nodes); > ! > ! /* Finally build the list of nodes in partition order. */ > ! m_partitioned_nodes.truncate (num_partitioned_nodes); > ! for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i) > ! { > ! int partition_i = m_vertices[node_i].partition; > ! if (partition_i >= 0) > { > ! unsigned int order_i = m_partitions[partition_i].node_end++; > ! m_partitioned_nodes[order_i] = node_i; > } > + } > + } > + > + /* Look for edges from earlier partitions into node NODE_I and edges from > + node NODE_I into later partitions. Call: > + > + FN (ud, other_node_i, other_partition_i) > + > + for each such use-to-def edge ud, where other_node_i and other_partition_i > + are the node and partition at the other end of the edge. */ > + > + template > + void > + vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn) > + { > + int partition_i = m_vertices[node_i].partition; > + for (graph_edge *pred = m_slpg->vertices[node_i].pred; > + pred; pred = pred->pred_next) > + { > + int src_partition_i = m_vertices[pred->src].partition; > + if (src_partition_i >= 0 && src_partition_i != partition_i) > + fn (pred, pred->src, src_partition_i); > + } > + for (graph_edge *succ = m_slpg->vertices[node_i].succ; > + succ; succ = succ->succ_next) > + { > + int dest_partition_i = m_vertices[succ->dest].partition; > + if (dest_partition_i >= 0 && dest_partition_i != partition_i) > + fn (succ, succ->dest, dest_partition_i); > + } > + } > + > + /* Return true if the target is able to change a vector from layout > + FROM_LAYOUT_I to layout TO_LAYOUT_I. */ > + > + bool > + vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i, > + unsigned int to_layout_i) > + { > + /* FIXME: test what the target can support. */ > + (void) from_layout_i; > + (void) to_layout_i; > + return true; > + } > + > + /* We already know that the target can perform permute m_perms[LAYOUT_I]. > + In our terms this means that we know that the target can go from a > + layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with > + it applied (layout 0, the original layout of the SLP graph). Return > + true if the target can also go in the opposite direction: from the > + original layout to one in which m_perms[LAYOUT_I] has been reversed. */ > + > + bool > + vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i) > + { > + return can_change_layouts (0, layout_i); > + } > + > + /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I > + to layout TO_LAYOUT_I. */ > + > + unsigned int > + vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i, > + unsigned int to_layout_i) > + { > + if (from_layout_i == to_layout_i) > + return 0; > + > + if (can_change_layouts (from_layout_i, to_layout_i)) > + return 1; > + > + /* Change to layout 0 and then out again. */ > + gcc_assert (layout_is_reversible (from_layout_i)); > + return 2; > + } > + > + /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I. */ > + > + inline slpg_partition_layout_costs & > + vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i, > + unsigned int layout_i) > + { > + return m_partition_layout_costs[partition_i * m_perms.length () + layout_i]; > + } > + > + /* Return true if partition PARTITION_I is structually allowed to absorb > + neighboring permutes. */ > + > + bool > + vect_optimize_slp_pass:: > + partition_can_absorb_permutes (unsigned int partition_i) > + { > + auto &partition = m_partitions[partition_i]; > + if (partition.node_end != partition.node_begin + 1) > + return false; > + > + slp_tree node = m_vertices[m_partitioned_nodes[partition.node_begin]].node; > + if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > + return true; > + > + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node); > + if (rep > + && STMT_VINFO_DATA_REF (rep) > + && DR_IS_READ (STMT_VINFO_DATA_REF (rep))) > + return true; > + > + return false; > + } > + > + /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the > + permute that would be required to go from layout 0 to layout LAYOUT_I. > + Return the cost of the change if so, in the same arbitrary units as > + for layout_cost. Return -1 otherwise. */ > + > + int > + vect_optimize_slp_pass::node_absorption_cost (slp_tree node, > + unsigned int layout_i) > + { > + if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > + /* FIXME: check whether NODE can handle cases in which both the inputs > + and outputs have layout LAYOUT_I. Return -1 if not. */ > + return 0; > + > + auto &vertex = m_vertices[node->vertex]; > + auto &partition = m_partitions[vertex.partition]; > + if (partition.layout == int (layout_i)) > + return 0; > + /* FIXME: check whether NODE can be changed from its current layout > + to layout LAYOUT_I. Return -1 if not. */ > + return 1; > + } > + > + /* Decide which element layouts we should consider using. Mark which > + partitions can potentially use which layouts and the local costs of > + each choice. */ > + > + void > + vect_optimize_slp_pass::start_choosing_layouts () > + { > + /* Used to assign unique permute indices. */ > + using perm_hash = unbounded_hashmap_traits< > + vec_free_hash_base>, > + int_hash > + >; > + hash_map, int, perm_hash> layout_ids; > + > + /* Layout 0 is "no change". */ > + m_perms.safe_push (vNULL); > + > + /* Create layouts from existing permutations. */ > + /* FIXME: cap the maximum number of layouts, to avoid quadraticness. */ > + for (unsigned int node_i : m_leafs) > + { > + int partition_i = m_vertices[node_i].partition; > + if (partition_i < 0) > + continue; > + > + /* Leafs also double as entries to the reverse graph. Allow the > + layout of those to be changed. */ > + if (!m_slpg->vertices[node_i].succ) > + m_partitions[partition_i].layout = 0; > > /* Loads are the only thing generating permutes. */ > + slp_tree node = m_vertices[node_i].node; > if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > continue; > > /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the > ! node unpermuted, record a layout that reverses this permute. */ > ! gcc_assert (m_partitions[partition_i].layout == 0); > stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node); > if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > continue; > *************** > *** 3708,3720 **** > if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) > any_permute = true; > } > - /* If there's no permute no need to split one out. */ > - if (!any_permute) > - continue; > /* If the span doesn't match we'd disrupt VF computation, avoid > that for now. */ > if (imax - imin + 1 != SLP_TREE_LANES (node)) > continue; > > /* For now only handle true permutes, like > vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > --- 4425,4442 ---- > if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) > any_permute = true; > } > /* If the span doesn't match we'd disrupt VF computation, avoid > that for now. */ > if (imax - imin + 1 != SLP_TREE_LANES (node)) > continue; > + /* If there's no permute no need to split one out. In this case > + we can consider turning the load into a permuted load, if that > + turns out to be cheaper than alternatives. */ > + if (!any_permute) > + { > + m_partitions[partition_i].layout = -1; > + continue; > + } > > /* For now only handle true permutes, like > vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > *************** > *** 3735,3985 **** > perm.safe_grow (SLP_TREE_LANES (node), true); > for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > ! perms.safe_push (perm); > ! vertices[idx].perm_in = perms.length () - 1; > ! vertices[idx].perm_out = perms.length () - 1; > } > > ! /* In addition to the above we have to mark outgoing permutes facing > ! non-reduction graph entries that are not represented as to be > ! materialized. */ > ! for (slp_instance instance : vinfo->slp_instances) > if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor) > { > ! /* Just setting perm_out isn't enough for the propagation to > ! pick this up. */ > ! vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0; > ! vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0; > } > > ! /* Propagate permutes along the graph and compute materialization points. */ > ! bool changed; > ! bool do_materialization = false; > ! unsigned iteration = 0; > ! do > ! { > ! changed = false; > ! ++iteration; > ! > ! if (dump_enabled_p ()) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! "SLP optimize iteration %d\n", iteration); > > ! for (i = vertices.length (); i > 0 ; --i) > { > ! int idx = ipo[i-1]; > ! slp_tree node = vertices[idx].node; > > ! /* Handle externals and constants optimistically throughout the > ! iteration. */ > ! if (SLP_TREE_DEF_TYPE (node) == vect_external_def > ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def) > ! continue; > > ! /* We still eventually have failed backedge SLP nodes in the > ! graph, those are only cancelled when analyzing operations. > ! Simply treat them as transparent ops, propagating permutes > ! through them. */ > ! if (SLP_TREE_DEF_TYPE (node) == vect_internal_def) > ! { > ! /* We do not handle stores with a permutation, so all > ! incoming permutes must have been materialized. */ > ! stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node); > ! if (STMT_VINFO_DATA_REF (rep) > ! && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep))) > ! { > ! /* ??? We're forcing materialization in place > ! of the child here, we'd need special handling > ! in materialization to leave perm_in -1 here. */ > ! vertices[idx].perm_in = 0; > ! vertices[idx].perm_out = 0; > ! } > ! /* We cannot move a permute across an operation that is > ! not independent on lanes. Note this is an explicit > ! negative list since that's much shorter than the respective > ! positive one but it's critical to keep maintaining it. */ > ! if (is_gimple_call (STMT_VINFO_STMT (rep))) > ! switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep))) > ! { > ! case CFN_COMPLEX_ADD_ROT90: > ! case CFN_COMPLEX_ADD_ROT270: > ! case CFN_COMPLEX_MUL: > ! case CFN_COMPLEX_MUL_CONJ: > ! case CFN_VEC_ADDSUB: > ! case CFN_VEC_FMADDSUB: > ! case CFN_VEC_FMSUBADD: > ! vertices[idx].perm_in = 0; > ! vertices[idx].perm_out = 0; > ! default:; > ! } > } > > ! if (!slpg->vertices[idx].succ) > ! /* Pick up pre-computed leaf values. */ > ! ; > ! else > { > ! bool any_succ_perm_out_m1 = false; > ! int perm_in = vertices[idx].perm_in; > ! for (graph_edge *succ = slpg->vertices[idx].succ; > ! succ; succ = succ->succ_next) > { > ! int succ_idx = succ->dest; > ! int succ_perm = vertices[succ_idx].perm_out; > ! /* Handle unvisited (and constant) nodes optimistically. */ > ! /* ??? But for constants once we want to handle > ! non-bijective permutes we have to verify the permute, > ! when unifying lanes, will not unify different constants. > ! For example see gcc.dg/vect/bb-slp-14.c for a case > ! that would break. */ > ! if (succ_perm == -1) > ! { > ! /* When we handled a non-leaf optimistically, note > ! that so we can adjust its outgoing permute below. */ > ! slp_tree succ_node = vertices[succ_idx].node; > ! if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def > ! && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def) > ! any_succ_perm_out_m1 = true; > ! continue; > ! } > ! if (perm_in == -1) > ! perm_in = succ_perm; > ! else if (succ_perm == 0 > ! || !vect_slp_perms_eq (perms, perm_in, succ_perm)) > { > ! perm_in = 0; > ! break; > } > } > > ! /* Adjust any incoming permutes we treated optimistically. */ > ! if (perm_in != -1 && any_succ_perm_out_m1) > { > ! for (graph_edge *succ = slpg->vertices[idx].succ; > ! succ; succ = succ->succ_next) > { > ! slp_tree succ_node = vertices[succ->dest].node; > ! if (vertices[succ->dest].perm_out == -1 > ! && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def > ! && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def) > ! { > ! vertices[succ->dest].perm_out = perm_in; > ! /* And ensure this propagates. */ > ! if (vertices[succ->dest].perm_in == -1) > ! vertices[succ->dest].perm_in = perm_in; > ! } > } > ! changed = true; > ! } > > ! if (!vect_slp_perms_eq (perms, perm_in, > ! vertices[idx].perm_in)) > ! { > ! /* Make sure we eventually converge. */ > ! gcc_checking_assert (vertices[idx].perm_in == -1 > ! || perm_in == 0); > ! vertices[idx].perm_in = perm_in; > ! > ! /* While we can handle VEC_PERM nodes as transparent > ! pass-through they can be a cheap materialization > ! point as well. In addition they can act as source > ! of a random permutation as well. > ! The following ensures that former materialization > ! points that now have zero incoming permutes no > ! longer appear as such and that former "any" permutes > ! get pass-through. We keep VEC_PERM nodes optimistic > ! as "any" outgoing permute though. */ > ! if (vertices[idx].perm_out != 0 > ! && SLP_TREE_CODE (node) != VEC_PERM_EXPR) > ! vertices[idx].perm_out = perm_in; > ! changed = true; > ! } > } > > ! /* Elide pruning at materialization points in the first > ! iteration phase. */ > ! if (!do_materialization) > ! continue; > > ! int perm = vertices[idx].perm_out; > ! if (perm == 0 || perm == -1) > continue; > > ! /* Decide on permute materialization. Look whether there's > ! a use (pred) edge that is permuted differently than us. > ! In that case mark ourselves so the permutation is applied. */ > ! bool all_preds_permuted = slpg->vertices[idx].pred != NULL; > ! if (all_preds_permuted) > ! for (graph_edge *pred = slpg->vertices[idx].pred; > ! pred; pred = pred->pred_next) > ! { > ! int pred_perm = vertices[pred->src].perm_in; > ! gcc_checking_assert (pred_perm != -1); > ! if (!vect_slp_perms_eq (perms, perm, pred_perm)) > ! { > ! all_preds_permuted = false; > ! break; > ! } > ! } > ! if (!all_preds_permuted) > ! { > ! vertices[idx].perm_out = 0; > ! changed = true; > } > - } > > ! /* If the initial propagation converged, switch on materialization > ! and re-propagate. */ > ! if (!changed && !do_materialization) > { > ! do_materialization = true; > ! changed = true; > } > } > ! while (changed); > ! statistics_histogram_event (cfun, "SLP optimize perm iterations", iteration); > > ! /* Materialize. */ > ! for (i = 0; i < vertices.length (); ++i) > { > ! int perm_in = vertices[i].perm_in; > ! slp_tree node = vertices[i].node; > > ! /* First permute invariant/external original successors, we handle > ! those optimistically during propagation and duplicate them if > ! they are used with different permutations. */ > ! unsigned j; > ! slp_tree child; > ! if (perm_in > 0) > ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child) > ! { > ! if (!child > ! || (SLP_TREE_DEF_TYPE (child) != vect_constant_def > ! && SLP_TREE_DEF_TYPE (child) != vect_external_def)) > ! continue; > > ! /* If the vector is uniform there's nothing to do. */ > ! if (vect_slp_tree_uniform_p (child)) > ! continue; > > ! /* We can end up sharing some externals via two_operator > ! handling. Be prepared to unshare those. */ > ! if (child->refcnt != 1) > ! { > ! gcc_assert (slpg->vertices[child->vertex].pred->pred_next); > ! SLP_TREE_CHILDREN (node)[j] = child > ! = vect_create_new_slp_node > ! (SLP_TREE_SCALAR_OPS (child).copy ()); > ! } > ! vect_slp_permute (perms[perm_in], > ! SLP_TREE_SCALAR_OPS (child), true); > ! } > > if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > { > /* Apply the common permutes to the input vectors. */ > ! if (perm_in > 0) > ! { > /* If the node is already a permute node we can apply > the permutation to the lane selection, effectively > materializing it on the incoming vectors. */ > --- 4457,5092 ---- > perm.safe_grow (SLP_TREE_LANES (node), true); > for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > ! bool existed; > ! int &layout_i = layout_ids.get_or_insert (perm, &existed); > ! if (existed) > ! perm.release (); > ! else > ! { > ! layout_i = m_perms.length (); > ! m_perms.safe_push (perm); > ! } > ! m_partitions[partition_i].layout = layout_i; > } > > ! /* Initially assume that every layout is possible and has zero cost > ! in every partition. */ > ! m_partition_layout_costs.safe_grow_cleared (m_partitions.length () > ! * m_perms.length ()); > ! > ! /* We have to mark outgoing permutes facing non-reduction graph > ! entries that are not represented as to be materialized. */ > ! for (slp_instance instance : m_vinfo->slp_instances) > if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor) > { > ! unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex; > ! m_partitions[m_vertices[node_i].partition].layout = 0; > } > > ! /* Check which layouts each node and partition can handle. Calculate the > ! weights associated with inserting layout changes on edges. */ > ! m_optimize_size = true; > ! for (unsigned int node_i : m_partitioned_nodes) > ! { > ! auto &vertex = m_vertices[node_i]; > ! unsigned int partition_i = vertex.partition; > ! auto &partition = m_partitions[partition_i]; > ! slp_tree node = vertex.node; > ! > ! if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node)) > ! { > ! basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt); > ! if (optimize_bb_for_speed_p (bb)) > ! m_optimize_size = false; > ! > ! vertex.weight = vect_slp_node_weight (node); > ! > ! /* We do not handle stores with a permutation, so all > ! incoming permutes must have been materialized. */ > ! if (STMT_VINFO_DATA_REF (rep) > ! && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep))) > ! /* ??? We're forcing materialization in place > ! of the child here, we'd need special handling > ! in materialization to leave layout -1 here. */ > ! partition.layout = 0; > ! > ! /* We cannot change the layout of an operation that is > ! not independent on lanes. Note this is an explicit > ! negative list since that's much shorter than the respective > ! positive one but it's critical to keep maintaining it. */ > ! if (is_gimple_call (STMT_VINFO_STMT (rep))) > ! switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep))) > ! { > ! case CFN_COMPLEX_ADD_ROT90: > ! case CFN_COMPLEX_ADD_ROT270: > ! case CFN_COMPLEX_MUL: > ! case CFN_COMPLEX_MUL_CONJ: > ! case CFN_VEC_ADDSUB: > ! case CFN_VEC_FMADDSUB: > ! case CFN_VEC_FMSUBADD: > ! partition.layout = 0; > ! default:; > ! } > ! } > > ! if (partition.layout == 0) > { > ! /* The node must stay in its current order. */ > ! for (unsigned int layout_i = 1; layout_i < m_perms.length (); > ! ++layout_i) > ! partition_layout_costs (partition_i, layout_i).mark_impossible (); > ! } > ! else if (partition_can_absorb_permutes (partition_i)) > ! /* Calculate the cost of converting NODE to use each permute. > ! If PARTITION.LAYOUT > 0 then that layout's permute will be free. */ > ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); > ! ++layout_i) > ! { > ! auto &layout_costs = partition_layout_costs (partition_i, > ! layout_i); > ! int factor = node_absorption_cost (node, layout_i); > ! if (factor < 0) > ! { > ! layout_costs.mark_impossible (); > ! continue; > ! } > ! gcc_assert (layout_costs.is_possible ()); > ! layout_costs.in_cost = { vertex.weight * factor, m_optimize_size }; > ! } > > ! auto process_edge = [&](graph_edge *ud, unsigned int node2_i, > ! unsigned int partition2_i) > ! { > ! /* Count the number of edges from earlier partitions and the number > ! of edges to later partitions. */ > ! if (partition2_i < partition_i) > ! partition.in_degree += 1; > ! else > ! partition.out_degree += 1; > > ! /* If the current node uses the result of NODE2_I, accumulate > ! the effects of that. */ > ! if (ud->src == int (node_i)) > ! { > ! m_vertices[node2_i].out_weight += vertex.weight; > ! m_vertices[node2_i].out_degree += 1; > } > + }; > + for_each_partition_edge (node_i, process_edge); > + } > > ! /* Layout 0 represents the original order and layout LAYOUT_I represents > ! the effect of reversing m_perms[LAYOUT_I]. If a predecessor partition > ! does not support layout LAYOUT_I and if it also isn't possible to go > ! from layout 0 to layout LAYOUT_I, then it is very difficult for a > ! successor partition to support layout LAYOUT_I. (The only alternative > ! would go to layout LAYOUT_I from a different nonzero layout, but > ! that's a niche case and not really worth supporting.) */ > ! for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i) > ! { > ! if (layout_is_reversible (layout_i)) > ! continue; > ! > ! bool changed; > ! do > ! { > ! changed = false; > ! for (unsigned int node_i : m_partitioned_nodes) > { > ! unsigned int partition_i = m_vertices[node_i].partition; > ! if (partition_layout_costs (partition_i, > ! layout_i).is_possible ()) > ! continue; > ! for (graph_edge *pred = m_slpg->vertices[node_i].pred; > ! pred; pred = pred->pred_next) > { > ! unsigned int partition2_i = m_vertices[pred->src].partition; > ! auto &costs = partition_layout_costs (partition2_i, > ! layout_i); > ! if (costs.is_possible ()) > { > ! costs.mark_impossible (); > ! changed = true; > } > } > + } > + } > + while (changed); > + } > + } > + > + /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I) > + and layout LAYOUT2_I on cross-partition use-to-def edge UD. */ > + > + slpg_layout_cost > + vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud, > + unsigned int node1_i, > + unsigned int layout1_i, > + unsigned int layout2_i) > + { > + auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i; > + auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i; > + auto factor = layout_cost (def_layout_i, use_layout_i); > + > + /* We have a choice of putting the layout change at the site of the > + definition or at the site of the use. Prefer the former when > + optimizing for size or when the execution frequency of the > + definition is no greater than the combined execution frequencies of > + the uses. When putting the layout change at the site of the definition, > + divvy up the cost among all consumers. */ > + slpg_vertex &def_vertex = m_vertices[ud->dest]; > + slpg_vertex &use_vertex = m_vertices[ud->src]; > + if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight) > + { > + slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size }; > + cost.split (def_vertex.out_degree); > + return cost; > + } > + return { use_vertex.weight * factor, m_optimize_size }; > + } > + > + /* UD represents a use-def link between FROM_NODE_I and a node in a later > + partition; FROM_NODE_I could be the definition node or the use node. > + The node at the other end of the link wants to use layout TO_LAYOUT_I; > + return the cost of any necessary fix-ups on edge UD. > + > + At this point, FROM_NODE_I's partition has chosen the cheapest > + layout based on the information available so far, but this choice > + is only provisional. */ > + > + slpg_layout_cost > + vect_optimize_slp_pass::forward_cost (graph_edge *ud, > + unsigned int from_node_i, > + unsigned int to_layout_i) > + { > + slpg_vertex &from_vertex = m_vertices[from_node_i]; > + unsigned int from_partition_i = from_vertex.partition; > + slpg_partition_info &from_partition = m_partitions[from_partition_i]; > + gcc_assert (from_partition.layout >= 0); > + > + /* First calculate the cost on the assumption that FROM_PARTITION sticks > + with its current layout preference. */ > + slpg_layout_cost cost > + = partition_layout_costs (from_partition_i, > + from_partition.layout).in_cost; > + cost.split (from_partition.out_degree); > + cost.add_serial_cost (edge_layout_cost (ud, from_node_i, > + from_partition.layout, to_layout_i)); > + > + /* Take the minimum of that cost and the cost that applies if > + FROM_PARTITION instead switches to TO_LAYOUT_I. */ > + auto &direct_layout_costs = partition_layout_costs (from_partition_i, > + to_layout_i); > + if (direct_layout_costs.is_possible ()) > + { > + slpg_layout_cost direct_cost = direct_layout_costs.in_cost; > + direct_cost.split (from_partition.out_degree); > + if (direct_cost.is_better_than (cost, m_optimize_size)) > + cost = direct_cost; > + } > + > + /* If FROM_PARTITION is likely to be able to absorb the required outgoing > + permute, optimistically assume that that is what will happen. */ > + if (partition_can_absorb_permutes (from_partition_i)) > + { > + int factor = node_absorption_cost (from_vertex.node, to_layout_i); > + if (factor >= 0) > + { > + sreal weight = from_vertex.weight * factor; > + slpg_layout_cost absorb_cost = { weight, m_optimize_size }; > + if (absorb_cost.is_better_than (cost, m_optimize_size)) > + cost = absorb_cost; > + } > + } > + > + return cost; > + } > + > + /* UD represents a use-def link between TO_NODE_I and a node in an earlier > + partition; TO_NODE_I could be the definition node or the use node. > + The node at the other end of the link wants to use layout FROM_LAYOUT_I; > + return the cost of any necessary fix-ups on edge UD. > + > + At this point, TO_NODE_I's partition has a fixed choice of layout. */ > + > + slpg_layout_cost > + vect_optimize_slp_pass::backward_cost (graph_edge *ud, > + unsigned int to_node_i, > + unsigned int from_layout_i) > + { > + slpg_vertex &to_vertex = m_vertices[to_node_i]; > + unsigned int to_partition_i = to_vertex.partition; > + slpg_partition_info &to_partition = m_partitions[to_partition_i]; > + gcc_assert (to_partition.layout >= 0); > + > + slpg_layout_cost cost > + = partition_layout_costs (to_partition_i, > + to_partition.layout).out_cost; > + cost.split (to_partition.in_degree); > + cost.add_serial_cost (edge_layout_cost (ud, to_node_i, to_partition.layout, > + from_layout_i)); > + > + /* If TO_PARTITION is likely to be able to absorb the required input permute, > + optimistically assume that that is what will happen. */ > + if (partition_can_absorb_permutes (to_partition_i)) > + { > + int factor = node_absorption_cost (to_vertex.node, from_layout_i); > + if (factor >= 0) > + { > + sreal weight = to_vertex.weight * factor; > + slpg_layout_cost absorb_cost = { weight, m_optimize_size }; > + if (absorb_cost.is_better_than (cost, m_optimize_size)) > + cost = absorb_cost; > + } > + } > + > + return cost; > + } > > ! /* Make a forward pass through the partitions, accumulating input costs. > ! Make a tentative (provisional) choice of layout for each partition. */ > ! > ! void > ! vect_optimize_slp_pass::forward_pass () > ! { > ! for (unsigned int partition_i = 0; partition_i < m_partitions.length (); > ! ++partition_i) > ! { > ! auto &partition = m_partitions[partition_i]; > ! unsigned int min_layout_i = 0; > ! slpg_layout_cost *min_layout_cost = nullptr; > ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i) > ! { > ! auto &layout_costs = partition_layout_costs (partition_i, layout_i); > ! if (!layout_costs.is_possible ()) > ! continue; > ! > ! /* Accumulate the costs from predecessor partitions. */ > ! for (unsigned int order_i = partition.node_begin; > ! order_i < partition.node_end; ++order_i) > ! { > ! unsigned int node_i = m_partitioned_nodes[order_i]; > ! auto add_cost = [&](graph_edge *ud, unsigned int from_node_i, > ! unsigned int from_partition_i) > { > ! if (from_partition_i < partition_i) > { > ! auto cost = forward_cost (ud, from_node_i, layout_i); > ! layout_costs.in_cost.add_parallel_cost (cost); > } > ! }; > ! for_each_partition_edge (node_i, add_cost); > ! } > > ! /* Record the layout with the lowest cost. Prefer layout 0 in > ! the event of a tie between it and another layout. */ > ! if (!min_layout_cost > ! || layout_costs.in_cost.is_better_than (*min_layout_cost, > ! m_optimize_size)) > ! { > ! min_layout_i = layout_i; > ! min_layout_cost = &layout_costs.in_cost; > } > + } > + gcc_assert (min_layout_cost); > + partition.layout = min_layout_i; > + } > + } > > ! /* Make a backward pass through the partitions, accumulating output costs. > ! Make a final choice of layout for each partition. */ > > ! void > ! vect_optimize_slp_pass::backward_pass () > ! { > ! for (unsigned int partition_i = m_partitions.length (); partition_i-- > 0;) > ! { > ! auto &partition = m_partitions[partition_i]; > ! unsigned int min_layout_i = 0; > ! slpg_layout_cost *min_layout_cost = nullptr; > ! unsigned int seen_min_layout_i = 0; > ! slpg_layout_cost *seen_min_layout_cost = nullptr; > ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i) > ! { > ! auto &layout_costs = partition_layout_costs (partition_i, layout_i); > ! if (!layout_costs.is_possible ()) > continue; > > ! /* Accumulate the costs from successor partitions. Record whether > ! any of them actually uses this layout. */ > ! bool seen = false; > ! for (unsigned int order_i = partition.node_begin; > ! order_i < partition.node_end; ++order_i) > ! { > ! unsigned int node_i = m_partitioned_nodes[order_i]; > ! auto add_cost = [&](graph_edge *ud, unsigned int to_node_i, > ! unsigned int to_partition_i) > ! { > ! if (to_partition_i > partition_i) > ! { > ! auto cost = backward_cost (ud, to_node_i, layout_i); > ! layout_costs.out_cost.add_parallel_cost (cost); > ! seen |= (m_partitions[to_partition_i].layout > ! == int (layout_i)); > ! } > ! }; > ! for_each_partition_edge (node_i, add_cost); > } > > ! /* Locally combine the costs from the forward and backward passes. > ! (This combined cost is not passed on, since that would lead > ! to double counting.) */ > ! layout_costs.cost = layout_costs.in_cost; > ! layout_costs.cost.add_serial_cost (layout_costs.out_cost); > ! > ! /* Record the layout with the lowest cost. Prefer layout 0 in > ! the event of a tie between it and another layout. */ > ! if (!min_layout_cost > ! || layout_costs.cost.is_better_than (*min_layout_cost, > ! m_optimize_size)) > ! { > ! min_layout_i = layout_i; > ! min_layout_cost = &layout_costs.cost; > ! } > ! /* Likewise, but restrict the choice of layouts to those that > ! are used by successor partitions. */ > ! if (seen > ! && (!seen_min_layout_cost > ! || layout_costs.cost.is_better_than (*seen_min_layout_cost, > ! m_optimize_size))) > ! { > ! seen_min_layout_i = min_layout_i; > ! seen_min_layout_cost = &layout_costs.cost; > ! } > ! } > ! gcc_assert (min_layout_cost); > ! > ! /* If the partition can absorb permutes, prefer a layout that has > ! actually been used. */ > ! if (seen_min_layout_cost > ! && partition_can_absorb_permutes (partition_i)) > ! partition.layout = seen_min_layout_i; > ! else > ! partition.layout = min_layout_i; > ! } > ! } > ! > ! /* Return a node that applies layout TO_LAYOUT_I to the original form of NODE. > ! NODE already has the layout that was selected for its partition. */ > ! > ! slp_tree > ! vect_optimize_slp_pass::get_result_with_layout (slp_tree node, > ! unsigned int to_layout_i) > ! { > ! unsigned int result_i = node->vertex * m_perms.length () + to_layout_i; > ! slp_tree result = m_node_layouts[result_i]; > ! if (result) > ! return result; > ! > ! if (SLP_TREE_DEF_TYPE (node) == vect_constant_def > ! || SLP_TREE_DEF_TYPE (node) == vect_external_def) > ! { > ! /* If the vector is uniform or unchanged, there's nothing to do. */ > ! if (to_layout_i == 0 || vect_slp_tree_uniform_p (node)) > ! result = node; > ! else > { > ! auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy (); > ! result = vect_create_new_slp_node (scalar_ops); > ! vect_slp_permute (m_perms[to_layout_i], scalar_ops, true); > } > } > ! else > ! { > ! unsigned int partition_i = m_vertices[node->vertex].partition; > ! unsigned int from_layout_i = m_partitions[partition_i].layout; > ! if (from_layout_i == to_layout_i) > ! return node; > ! > ! /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used > ! twice, with two different layouts, then we could create > ! parallel VEC_PERM_EXPRs rather than add a serial one. */ > ! > ! /* FIXME: When calculating the costs, we considered the possiblity > ! of duplicating a VEC_PERM_EXPR at each use site, rather than > ! having one at the definition site. Do that here if it turns > ! out to be cheaper. */ > ! > ! if (dump_enabled_p ()) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! "inserting permute node in place of %p\n", > ! node); > ! > ! unsigned int num_lanes = SLP_TREE_LANES (node); > ! result = vect_create_new_slp_node (1, VEC_PERM_EXPR); > ! if (SLP_TREE_SCALAR_STMTS (node).length ()) > ! { > ! auto &stmts = SLP_TREE_SCALAR_STMTS (result); > ! stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result)); > ! if (from_layout_i != 0) > ! vect_slp_permute (m_perms[from_layout_i], stmts, false); > ! if (to_layout_i != 0) > ! vect_slp_permute (m_perms[to_layout_i], stmts, true); > ! } > ! SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node); > ! SLP_TREE_LANES (result) = num_lanes; > ! SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node); > ! result->vertex = -1; > ! > ! auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result); > ! lane_perm.create (num_lanes); > ! for (unsigned j = 0; j < num_lanes; ++j) > ! lane_perm.quick_push ({ 0, j }); > ! if (from_layout_i != 0) > ! vect_slp_permute (m_perms[from_layout_i], lane_perm, false); > ! if (to_layout_i != 0) > ! vect_slp_permute (m_perms[to_layout_i], lane_perm, true); > ! > ! node->refcnt++; > ! SLP_TREE_CHILDREN (result).safe_push (node); > ! } > ! m_node_layouts[result_i] = result; > ! return result; > ! } > ! > ! /* Print the partition graph and layout information to the dump file. */ > > ! void > ! vect_optimize_slp_pass::dump () > ! { > ! dump_printf_loc (MSG_NOTE, vect_location, > ! "SLP optimize permutes:\n"); > ! for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i) > { > ! dump_printf_loc (MSG_NOTE, vect_location, " %d: { ", layout_i); > ! const char *sep = ""; > ! for (unsigned int idx : m_perms[layout_i]) > ! { > ! dump_printf (MSG_NOTE, "%s%d", sep, idx); > ! sep = ", "; > ! } > ! dump_printf (MSG_NOTE, " }\n"); > ! } > ! dump_printf_loc (MSG_NOTE, vect_location, > ! "SLP optimize partitions:\n"); > ! for (unsigned int partition_i = 0; partition_i < m_partitions.length (); > ! ++partition_i) > ! { > ! auto &partition = m_partitions[partition_i]; > ! dump_printf_loc (MSG_NOTE, vect_location, " -------------\n"); > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " partition %d (layout %d):\n", > ! partition_i, partition.layout); > ! dump_printf_loc (MSG_NOTE, vect_location, " nodes:\n"); > ! for (unsigned int order_i = partition.node_begin; > ! order_i < partition.node_end; ++order_i) > ! { > ! unsigned int node_i = m_partitioned_nodes[order_i]; > ! slp_tree node = m_vertices[node_i].node; > ! dump_printf_loc (MSG_NOTE, vect_location, " - %p:\n", > ! (void *) m_vertices[node_i].node); > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " weight: %f\n", > ! m_vertices[node_i].weight.to_double ()); > ! if (m_vertices[node_i].out_degree) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " out weight: %f (degree %d)\n", > ! m_vertices[node_i].out_weight.to_double (), > ! m_vertices[node_i].out_degree); > ! if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " op: VEC_PERM_EXPR\n"); > ! else if (auto rep = SLP_TREE_REPRESENTATIVE (node)) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " op template: %G", rep->stmt); > ! } > ! dump_printf_loc (MSG_NOTE, vect_location, " edges:\n"); > ! for (unsigned int order_i = partition.node_begin; > ! order_i < partition.node_end; ++order_i) > ! { > ! unsigned int node_i = m_partitioned_nodes[order_i]; > ! auto print_edge = [&](graph_edge *, unsigned int node2_i, > ! unsigned int partition2_i) > ! { > ! if (partition2_i < partition_i) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " - %p [%d] --> %p\n", > ! m_vertices[node2_i].node, partition2_i, > ! m_vertices[node_i].node); > ! else > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " - %p --> [%d] %p\n", > ! m_vertices[node_i].node, partition2_i, > ! m_vertices[node2_i].node); > ! }; > ! for_each_partition_edge (node_i, print_edge); > ! } > > ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i) > ! { > ! auto &layout_costs = partition_layout_costs (partition_i, layout_i); > ! if (layout_costs.is_possible ()) > ! { > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " layout %d:%s\n", layout_i, > ! partition.layout == int (layout_i) > ! ? " (*)" : ""); > ! #define TEMPLATE "{depth: %f, total: %f}" > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " " TEMPLATE "\n", layout_i, > ! layout_costs.in_cost.depth.to_double (), > ! layout_costs.in_cost.total.to_double ()); > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " + " TEMPLATE "\n", > ! layout_costs.out_cost.depth.to_double (), > ! layout_costs.out_cost.total.to_double ()); > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " = " TEMPLATE "\n", > ! layout_costs.cost.depth.to_double (), > ! layout_costs.cost.total.to_double ()); > ! #undef TEMPLATE > ! } > ! else > ! dump_printf_loc (MSG_NOTE, vect_location, > ! " layout %d: rejected\n", layout_i); > ! } > ! } > ! } > > ! /* Apply the chosen vector layouts to the SLP graph. */ > > ! void > ! vect_optimize_slp_pass::materialize () > ! { > ! m_partition_layout_costs.release (); > ! m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length ()); > > + for (unsigned int node_i : m_partitioned_nodes) > + { > + /* Rearrange the scalar statements to match the chosen layout. */ > + slp_tree node = m_vertices[node_i].node; > + int layout_i = m_partitions[m_vertices[node_i].partition].layout; > + gcc_assert (layout_i >= 0); > + if (layout_i > 0) > + vect_slp_permute (m_perms[layout_i], > + SLP_TREE_SCALAR_STMTS (node), true); > + > + /* Update load and lane permutes. */ > if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > { > /* Apply the common permutes to the input vectors. */ > ! /* FIXME: check whether the target accepts this folding and > ! handle it like a normal node if not. That would include > ! adjusting the follow-up loop below. */ > ! unsigned int opno; > ! slp_tree child; > ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child) > ! { > ! unsigned int partition2_i = m_vertices[child->vertex].partition; > ! int layout2_i = m_partitions[partition2_i].layout; > ! if (layout2_i == 0) > ! continue; > ! > /* If the node is already a permute node we can apply > the permutation to the lane selection, effectively > materializing it on the incoming vectors. */ > *************** > *** 3987,4141 **** > dump_printf_loc (MSG_NOTE, vect_location, > "simplifying permute node %p\n", > node); > ! for (unsigned k = 0; > ! k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k) > ! SLP_TREE_LANE_PERMUTATION (node)[k].second > ! = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second]; > ! } > ! /* Apply the anticipated output permute to the permute and > ! stmt vectors. */ > ! int perm_out = vertices[i].perm_out; > ! if (perm_out > 0) > ! { > ! vect_slp_permute (perms[perm_out], > ! SLP_TREE_SCALAR_STMTS (node), true); > ! vect_slp_permute (perms[perm_out], > ! SLP_TREE_LANE_PERMUTATION (node), true); > ! } > ! } > ! else if (vertices[i].get_perm_materialized () != 0) > ! { > ! if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) > ! /* For loads simply drop the permutation, the load permutation > ! already performs the desired permutation. */ > ! ; > ! else if (SLP_TREE_LANE_PERMUTATION (node).exists ()) > ! gcc_unreachable (); > ! else > ! { > ! if (dump_enabled_p ()) > ! dump_printf_loc (MSG_NOTE, vect_location, > ! "inserting permute node in place of %p\n", > ! node); > ! > ! /* Make a copy of NODE and in-place change it to a > ! VEC_PERM node to permute the lanes of the copy. */ > ! slp_tree copy = new _slp_tree; > ! SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node); > ! SLP_TREE_CHILDREN (node) = vNULL; > ! SLP_TREE_SCALAR_STMTS (copy) > ! = SLP_TREE_SCALAR_STMTS (node).copy (); > ! vect_slp_permute (perms[perm_in], > ! SLP_TREE_SCALAR_STMTS (copy), true); > ! gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ()); > ! SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node); > ! gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ()); > ! SLP_TREE_LANE_PERMUTATION (copy) > ! = SLP_TREE_LANE_PERMUTATION (node); > ! SLP_TREE_LANE_PERMUTATION (node) = vNULL; > ! SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node); > ! copy->refcnt = 1; > ! copy->max_nunits = node->max_nunits; > ! SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node); > ! SLP_TREE_LANES (copy) = SLP_TREE_LANES (node); > ! SLP_TREE_CODE (copy) = SLP_TREE_CODE (node); > ! > ! /* Now turn NODE into a VEC_PERM. */ > ! SLP_TREE_CHILDREN (node).safe_push (copy); > ! SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node)); > ! for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > ! SLP_TREE_LANE_PERMUTATION (node) > ! .quick_push (std::make_pair (0, perms[perm_in][j])); > ! SLP_TREE_CODE (node) = VEC_PERM_EXPR; > ! } > ! } > ! else if (perm_in > 0) /* perm_in == perm_out */ > ! { > ! /* Apply the reverse permutation to our stmts. */ > ! vect_slp_permute (perms[perm_in], > ! SLP_TREE_SCALAR_STMTS (node), true); > ! /* And to the lane/load permutation, which we can simply > ! make regular by design. */ > ! if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) > ! { > ! gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ()); > ! /* ??? When we handle non-bijective permutes the idea > ! is that we can force the load-permutation to be > ! { min, min + 1, min + 2, ... max }. But then the > ! scalar defs might no longer match the lane content > ! which means wrong-code with live lane vectorization. > ! So we possibly have to have NULL entries for those. */ > ! vect_slp_permute (perms[perm_in], > ! SLP_TREE_LOAD_PERMUTATION (node), true); > ! } > ! else if (SLP_TREE_LANE_PERMUTATION (node).exists ()) > ! gcc_unreachable (); > } > } > > ! /* Elide any permutations at BB reduction roots. */ > ! if (is_a (vinfo)) > { > ! for (slp_instance instance : vinfo->slp_instances) > { > ! if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc) > continue; > ! slp_tree old = SLP_INSTANCE_TREE (instance); > ! if (SLP_TREE_CODE (old) == VEC_PERM_EXPR > ! && SLP_TREE_CHILDREN (old).length () == 1) > { > ! slp_tree child = SLP_TREE_CHILDREN (old)[0]; > ! if (SLP_TREE_DEF_TYPE (child) == vect_external_def) > ! { > ! /* Preserve the special VEC_PERM we use to shield existing > ! vector defs from the rest. But make it a no-op. */ > ! unsigned i = 0; > ! for (std::pair &p > ! : SLP_TREE_LANE_PERMUTATION (old)) > ! p.second = i++; > ! } > ! else > ! { > ! SLP_INSTANCE_TREE (instance) = child; > ! SLP_TREE_REF_COUNT (child)++; > ! vect_free_slp_tree (old); > ! } > ! } > ! else if (SLP_TREE_LOAD_PERMUTATION (old).exists () > ! && SLP_TREE_REF_COUNT (old) == 1 > ! && vertices[old->vertex].get_perm_materialized () != 0) > ! { > ! /* ??? For loads the situation is more complex since > ! we can't modify the permute in place in case the > ! node is used multiple times. In fact for loads this > ! should be somehow handled in the propagation engine. */ > ! /* Apply the reverse permutation to our stmts. */ > ! int perm = vertices[old->vertex].get_perm_materialized (); > ! vect_slp_permute (perms[perm], > ! SLP_TREE_SCALAR_STMTS (old), true); > ! vect_slp_permute (perms[perm], > ! SLP_TREE_LOAD_PERMUTATION (old), true); > } > } > } > > ! /* Free the perms vector used for propagation. */ > ! while (!perms.is_empty ()) > ! perms.pop ().release (); > ! free_graph (slpg); > ! > > ! /* Now elide load permutations that are not necessary. */ > ! for (i = 0; i < leafs.length (); ++i) > { > ! node = vertices[leafs[i]].node; > if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > continue; > > /* In basic block vectorization we allow any subchain of an interleaving > chain. > FORNOW: not in loop SLP because of realignment complications. */ > ! if (is_a (vinfo)) > { > bool subchain_p = true; > stmt_vec_info next_load_info = NULL; > --- 5094,5173 ---- > dump_printf_loc (MSG_NOTE, vect_location, > "simplifying permute node %p\n", > node); > ! for (auto &entry : SLP_TREE_LANE_PERMUTATION (node)) > ! if (entry.first == opno) > ! entry.second = m_perms[layout2_i][entry.second]; > ! } > ! /* Reverse the permutation that we chose to reverse. */ > ! if (layout_i > 0) > ! vect_slp_permute (m_perms[layout_i], > ! SLP_TREE_LANE_PERMUTATION (node), true); > ! } > ! else > ! { > ! gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ()); > ! auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node); > ! if (layout_i > 0) > ! /* ??? When we handle non-bijective permutes the idea > ! is that we can force the load-permutation to be > ! { min, min + 1, min + 2, ... max }. But then the > ! scalar defs might no longer match the lane content > ! which means wrong-code with live lane vectorization. > ! So we possibly have to have NULL entries for those. */ > ! vect_slp_permute (m_perms[layout_i], load_perm, true); > } > } > > ! /* Do this before any nodes disappear, since it involves a walk > ! over the leaves. */ > ! remove_redundant_permutes (); > ! > ! /* Replace each child with a correctly laid-out version. */ > ! for (unsigned int node_i : m_partitioned_nodes) > { > ! slp_tree node = m_vertices[node_i].node; > ! > ! /* The loop above adjusted permutes for each child's chosen layout, > ! so we don't need to adjust them here. */ > ! if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > ! continue; > ! > ! int layout_i = m_partitions[m_vertices[node_i].partition].layout; > ! gcc_assert (layout_i >= 0); > ! unsigned j; > ! slp_tree child; > ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child) > { > ! if (!child) > continue; > ! > ! slp_tree new_child = get_result_with_layout (child, layout_i); > ! if (new_child != child) > { > ! vect_free_slp_tree (child); > ! SLP_TREE_CHILDREN (node)[j] = new_child; > ! new_child->refcnt += 1; > } > } > } > + } > > ! /* Elide load permutations that are not necessary. Such permutations might > ! be pre-existing, rather than created by the layout optimizations. */ > > ! void > ! vect_optimize_slp_pass::remove_redundant_permutes () > ! { > ! for (unsigned int node_i : m_leafs) > { > ! slp_tree node = m_vertices[node_i].node; > if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > continue; > > /* In basic block vectorization we allow any subchain of an interleaving > chain. > FORNOW: not in loop SLP because of realignment complications. */ > ! if (is_a (m_vinfo)) > { > bool subchain_p = true; > stmt_vec_info next_load_info = NULL; > *************** > *** 4160,4165 **** > --- 5192,5198 ---- > } > else > { > + loop_vec_info loop_vinfo = as_a (m_vinfo); > stmt_vec_info load_info; > bool this_load_permuted = false; > unsigned j; > *************** > *** 4175,4182 **** > /* The load requires permutation when unrolling exposes > a gap either because the group is larger than the SLP > group-size or because there is a gap between the groups. */ > ! && (known_eq (LOOP_VINFO_VECT_FACTOR > ! (as_a (vinfo)), 1U) > || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info)) > && DR_GROUP_GAP (first_stmt_info) == 0))) > { > --- 5208,5214 ---- > /* The load requires permutation when unrolling exposes > a gap either because the group is larger than the SLP > group-size or because there is a gap between the groups. */ > ! && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U) > || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info)) > && DR_GROUP_GAP (first_stmt_info) == 0))) > { > *************** > *** 4187,4192 **** > --- 5219,5258 ---- > } > } > > + /* Main entry point for the optimization pass. */ > + > + void > + vect_optimize_slp_pass::run () > + { > + build_graph (); > + create_partitions (); > + start_choosing_layouts (); > + if (m_perms.length () > 1) > + { > + forward_pass (); > + backward_pass (); > + if (dump_enabled_p ()) > + dump (); > + materialize (); > + } > + else > + remove_redundant_permutes (); > + /* Free the perms vector used for propagation. */ > + while (!m_perms.is_empty ()) > + m_perms.pop ().release (); > + free_graph (m_slpg); > + } > + > + /* Optimize the SLP graph of VINFO. */ > + > + void > + vect_optimize_slp (vec_info *vinfo) > + { > + if (vinfo->slp_instances.is_empty ()) > + return; > + vect_optimize_slp_pass (vinfo).run (); > + } > + > /* Gather loads reachable from the individual SLP graph entries. */ > > void > *************** > *** 6647,6653 **** > { > stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; > int vec_index = 0; > ! tree vectype = STMT_VINFO_VECTYPE (stmt_info); > unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length (); > unsigned int mask_element; > machine_mode mode; > --- 7713,7719 ---- > { > stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; > int vec_index = 0; > ! tree vectype = SLP_TREE_VECTYPE (node); > unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length (); > unsigned int mask_element; > machine_mode mode; > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)