From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] RFC: Extend SLP permutation optimisations
Date: Wed, 3 Aug 2022 11:23:37 +0000 (UTC) [thread overview]
Message-ID: <nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr> (raw)
In-Reply-To: <mpt7d3rru04.fsf@arm.com>
On Tue, 2 Aug 2022, Richard Sandiford wrote:
> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely. This patch tries to extend it as follows:
>
> - Allow loads to take a different permutation from the one they
> started with, rather than choosing between "original permutation"
> and "no permutation".
>
> - Allow changes in both directions, if the target supports the
> reverse permute operation.
>
> - Treat the placement of permute operations as a two-way dataflow
> problem: after propagating information from leaves to roots (as now),
> propagate information back up the graph.
>
> - Take execution frequency into account when optimising for speed,
> so that (for example) permutes inside loops have a higher cost
> than permutes outside loops.
>
> - Try to reduce the total number of permutes when optimising for
> size, even if that increases the number of permutes on a given
> execution path.
>
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
>
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads. This patch doesn't include that change
> (although it's a possible future addition).
>
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future. The two main ones are:
>
> - Make it easier to represent predicated operations, including
> predicated operations with gaps. E.g.:
>
> a[0] += 1;
> a[1] += 1;
> a[3] += 1;
>
> could be a single load/add/store for SVE. We could handle this
> by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
> (depending on what's being counted). We might need to move
> elements between lanes at various points, like with permutes.
>
> (This would first mean adding support for stores with gaps.)
>
> - Make it easier to switch between an even/odd and unpermuted layout
> when switching between wide and narrow elements. E.g. if a widening
> operation produces an even vector and an odd vector, we should try
> to keep operations on the wide elements in that order rather than
> force them to be permuted back "in order".
>
> To give some examples of what the current patch does:
>
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
> int *__restrict d)
> {
> a[0] = (b[1] << c[3]) - d[1];
> a[1] = (b[0] << c[2]) - d[0];
> a[2] = (b[3] << c[1]) - d[3];
> a[3] = (b[2] << c[0]) - d[2];
> }
>
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time. But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
>
> ldr q1, [x2]
> ldr q0, [x1]
> ext v1.16b, v1.16b, v1.16b, #8 // <------
> sshl v0.4s, v0.4s, v1.4s
> ldr q1, [x3]
> sub v0.4s, v0.4s, v1.4s
> rev64 v0.4s, v0.4s // <------
> str q0, [x0]
> ret
>
> The following function:
>
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
> int *__restrict d)
> {
> a[0] = (b[3] << c[3]) - d[3];
> a[1] = (b[2] << c[2]) - d[2];
> a[2] = (b[1] << c[1]) - d[1];
> a[3] = (b[0] << c[0]) - d[0];
> }
>
> continues to push the reverse down to just before the store,
> like the current code does.
>
> In:
>
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
> int *__restrict d)
> {
> for (int i = 0; i < 100; ++i)
> {
> a[0] = (a[0] + c[3]);
> a[1] = (a[1] + c[2]);
> a[2] = (a[2] + c[1]);
> a[3] = (a[3] + c[0]);
> c += 4;
> }
> }
>
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop. When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
>
> mov x3, x0
> adrp x0, .LC0
> add x1, x2, 1600
> ldr q2, [x0, #:lo12:.LC0]
> ldr q0, [x3]
> mov v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <--------
> .p2align 3,,7
> .L6:
> ldr q1, [x2], 16
> add v0.4s, v0.4s, v1.4s
> cmp x2, x1
> bne .L6
> mov v1.16b, v0.16b
> adrp x0, .LC0
> ldr q2, [x0, #:lo12:.LC0]
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <--------
> str q0, [x3]
> ret
>
> Similarly, for the very artificial testcase:
>
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
> int *__restrict d)
> {
> int a0 = a[0];
> int a1 = a[1];
> int a2 = a[2];
> int a3 = a[3];
> for (int i = 0; i < 100; ++i)
> {
> a0 ^= c[0];
> a1 ^= c[1];
> a2 ^= c[2];
> a3 ^= c[3];
> c += 4;
> for (int j = 0; j < 100; ++j)
> {
> a0 += d[1];
> a1 += d[0];
> a2 += d[3];
> a3 += d[2];
> d += 4;
> }
> b[0] = a0;
> b[1] = a1;
> b[2] = a2;
> b[3] = a3;
> b += 4;
> }
> a[0] = a0;
> a[1] = a1;
> a[2] = a2;
> a[3] = a3;
> }
>
> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
> even though it's part of an SCC that includes the outer loop.
> In other words, this is a motivating case for not assigning
> permutes at SCC granularity. The code we get is:
>
> ldr q0, [x0]
> mov x4, x1
> mov x5, x0
> add x1, x3, 1600
> add x3, x4, 1600
> .p2align 3,,7
> .L11:
> ldr q1, [x2], 16
> sub x0, x1, #1600
> eor v0.16b, v1.16b, v0.16b
> rev64 v0.4s, v0.4s // <---
> .p2align 3,,7
> .L10:
> ldr q1, [x0], 16
> add v0.4s, v0.4s, v1.4s
> cmp x0, x1
> bne .L10
> rev64 v0.4s, v0.4s // <---
> add x1, x0, 1600
> str q0, [x4], 16
> cmp x3, x4
> bne .L11
> str q0, [x5]
> ret
>
> The patch is still incomplete, hence the FIXMEs, but it's at the stage
> where it passes bootstrap & regtest on aarch64-linux-gnu. (There's one
> regression in slp-11b.c, but I think it's arguably restoring the
> expected load-lanes behaviour.) Does this look like a plausible
> direction, before I go filling in the FIXME parts?
Yes, it looks nice! I wonder if you spent much thoughts on
layout optimizing for SLP patterns and/or how to integrate both?
Thanks,
Richard.
> Thanks,
> Richard
>
> ---
> gcc/graphds.cc | 13 +-
> gcc/graphds.h | 3 +-
> gcc/hash-map-traits.h | 74 +-
> gcc/hash-traits.h | 96 ++-
> gcc/tree-vect-slp.cc | 1870 ++++++++++++++++++++++++++++++++---------
> 5 files changed, 1599 insertions(+), 457 deletions(-)
>
> *** /tmp/iBbApg_graphds.cc 2022-08-02 08:16:23.975354155 +0100
> --- gcc/graphds.cc 2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 281,287 ****
> numbers assigned by the previous pass. If SUBGRAPH is not NULL, it
> specifies the subgraph of G whose strongly connected components we want
> to determine. If SKIP_EDGE_P is not NULL, it points to a callback function.
> ! Edge E will be skipped if callback function returns true.
>
> After running this function, v->component is the number of the strongly
> connected component for each vertex of G. Returns the number of the
> --- 281,294 ----
> numbers assigned by the previous pass. If SUBGRAPH is not NULL, it
> specifies the subgraph of G whose strongly connected components we want
> to determine. If SKIP_EDGE_P is not NULL, it points to a callback function.
> ! Edge E will be skipped if callback function returns true. If SCC_GROUPING
> ! is not null, the nodes will be added to it in the following order:
> !
> ! - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag,
> ! A's nodes come before B's nodes.
> !
> ! - All of an SCC's nodes are listed consecutively, although the order
> ! of the nodes within an SCC is not really meaningful.
>
> After running this function, v->component is the number of the strongly
> connected component for each vertex of G. Returns the number of the
> ***************
> *** 289,295 ****
>
> int
> graphds_scc (struct graph *g, bitmap subgraph,
> ! skip_edge_callback skip_edge_p)
> {
> int *queue = XNEWVEC (int, g->n_vertices);
> vec<int> postorder = vNULL;
> --- 296,302 ----
>
> int
> graphds_scc (struct graph *g, bitmap subgraph,
> ! skip_edge_callback skip_edge_p, vec<int> *scc_grouping)
> {
> int *queue = XNEWVEC (int, g->n_vertices);
> vec<int> postorder = vNULL;
> ***************
> *** 317,323 ****
>
> for (i = 0; i < nq; i++)
> queue[i] = postorder[nq - i - 1];
> ! comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p);
>
> free (queue);
> postorder.release ();
> --- 324,330 ----
>
> for (i = 0; i < nq; i++)
> queue[i] = postorder[nq - i - 1];
> ! comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, skip_edge_p);
>
> free (queue);
> postorder.release ();
> *** /tmp/eRUAEk_graphds.h 2022-08-02 08:16:23.987354123 +0100
> --- gcc/graphds.h 2022-08-01 20:44:18.835700593 +0100
> ***************
> *** 58,64 ****
> typedef bool (*skip_edge_callback) (struct graph_edge *);
> int graphds_dfs (struct graph *, int *, int,
> vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL);
> void graphds_domtree (struct graph *, int, int *, int *, int *);
> typedef void (*graphds_edge_callback) (struct graph *,
> struct graph_edge *, void *);
> --- 58,65 ----
> typedef bool (*skip_edge_callback) (struct graph_edge *);
> int graphds_dfs (struct graph *, int *, int,
> vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL,
> ! vec<int> * = NULL);
> void graphds_domtree (struct graph *, int, int *, int *, int *);
> typedef void (*graphds_edge_callback) (struct graph *,
> struct graph_edge *, void *);
> *** /tmp/4zWIek_hash-map-traits.h 2022-08-02 08:16:23.995354102 +0100
> --- gcc/hash-map-traits.h 2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 105,118 ****
> static const bool maybe_mx = false;
> };
>
> ! /* Implement traits for a hash_map with values of type Value for cases
> ! in which the key cannot represent empty and deleted slots. Instead
> ! record empty and deleted entries in Value. Derived classes must
> ! implement the hash and equal_keys functions. */
>
> ! template <typename Value>
> struct unbounded_hashmap_traits
> {
> template <typename T> static inline void remove (T &);
> static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
> template <typename T> static inline bool is_empty (const T &);
> --- 105,123 ----
> static const bool maybe_mx = false;
> };
>
> ! /* Implement traits for a hash_map with keys of type Key and values of
> ! type Value for cases in which the key cannot represent empty and
> ! deleted slots. Instead record empty and deleted entries in Value. */
>
> ! template <typename Key, typename Value>
> struct unbounded_hashmap_traits
> {
> + typedef typename Key::value_type key_type;
> +
> + static hashval_t hash (const typename Key::value_type &);
> + static bool equal_keys (const typename Key::value_type &,
> + const typename Key::compare_type &);
> +
> template <typename T> static inline void remove (T &);
> static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
> template <typename T> static inline bool is_empty (const T &);
> ***************
> *** 121,162 ****
> template <typename T> static inline void mark_deleted (T &);
> };
>
> ! template <typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Value>::remove (T &entry)
> {
> default_hash_traits <Value>::remove (entry.m_value);
> }
>
> ! template <typename Value>
> template <typename T>
> inline bool
> ! unbounded_hashmap_traits <Value>::is_empty (const T &entry)
> {
> return default_hash_traits <Value>::is_empty (entry.m_value);
> }
>
> ! template <typename Value>
> template <typename T>
> inline bool
> ! unbounded_hashmap_traits <Value>::is_deleted (const T &entry)
> {
> return default_hash_traits <Value>::is_deleted (entry.m_value);
> }
>
> ! template <typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Value>::mark_empty (T &entry)
> {
> default_hash_traits <Value>::mark_empty (entry.m_value);
> }
>
> ! template <typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Value>::mark_deleted (T &entry)
> {
> default_hash_traits <Value>::mark_deleted (entry.m_value);
> }
> --- 126,184 ----
> template <typename T> static inline void mark_deleted (T &);
> };
>
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_hashmap_traits <Key, Value>
> ! ::hash (const typename Key::value_type &key)
> ! {
> ! return Key::hash (key);
> ! }
> !
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_hashmap_traits <Key, Value>
> ! ::equal_keys (const typename Key::value_type &x,
> ! const typename Key::compare_type &y)
> ! {
> ! return Key::equal (x, y);
> ! }
> !
> ! template <typename Key, typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Key, Value>::remove (T &entry)
> {
> default_hash_traits <Value>::remove (entry.m_value);
> }
>
> ! template <typename Key, typename Value>
> template <typename T>
> inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_empty (const T &entry)
> {
> return default_hash_traits <Value>::is_empty (entry.m_value);
> }
>
> ! template <typename Key, typename Value>
> template <typename T>
> inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_deleted (const T &entry)
> {
> return default_hash_traits <Value>::is_deleted (entry.m_value);
> }
>
> ! template <typename Key, typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_empty (T &entry)
> {
> default_hash_traits <Value>::mark_empty (entry.m_value);
> }
>
> ! template <typename Key, typename Value>
> template <typename T>
> inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_deleted (T &entry)
> {
> default_hash_traits <Value>::mark_deleted (entry.m_value);
> }
> ***************
> *** 166,190 ****
> slots. */
>
> template <typename Key, typename Value>
> ! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits <Value>
> ! {
> ! typedef Key key_type;
> ! static inline hashval_t hash (Key);
> ! static inline bool equal_keys (Key, Key);
> ! };
> !
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_int_hashmap_traits <Key, Value>::hash (Key k)
> ! {
> ! return k;
> ! }
> !
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_int_hashmap_traits <Key, Value>::equal_keys (Key k1, Key k2)
> ! {
> ! return k1 == k2;
> ! }
>
> #endif // HASH_MAP_TRAITS_H
> --- 188,194 ----
> slots. */
>
> template <typename Key, typename Value>
> ! using unbounded_int_hashmap_traits
> ! = unbounded_hashmap_traits <int_hash_base <Key>, Value>;
>
> #endif // HASH_MAP_TRAITS_H
> *** /tmp/FL7vGb_hash-traits.h 2022-08-02 08:16:24.011354061 +0100
> --- gcc/hash-traits.h 2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 85,125 ****
> {
> }
>
>
> ! /* Hasher for integer type Type in which Empty is a spare value that can be
> ! used to mark empty slots. If Deleted != Empty then Deleted is another
> ! spare value that can be used for deleted slots; if Deleted == Empty then
> ! hash table entries cannot be deleted. */
> !
> ! template <typename Type, Type Empty, Type Deleted = Empty>
> ! struct int_hash : typed_noop_remove <Type>
> {
> typedef Type value_type;
> typedef Type compare_type;
>
> static inline hashval_t hash (value_type);
> static inline bool equal (value_type existing, value_type candidate);
> - static inline void mark_deleted (Type &);
> - static const bool empty_zero_p = Empty == 0;
> - static inline void mark_empty (Type &);
> - static inline bool is_deleted (Type);
> - static inline bool is_empty (Type);
> };
>
> ! template <typename Type, Type Empty, Type Deleted>
> inline hashval_t
> ! int_hash <Type, Empty, Deleted>::hash (value_type x)
> {
> return x;
> }
>
> ! template <typename Type, Type Empty, Type Deleted>
> inline bool
> ! int_hash <Type, Empty, Deleted>::equal (value_type x, value_type y)
> {
> return x == y;
> }
>
> template <typename Type, Type Empty, Type Deleted>
> inline void
> int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> --- 85,135 ----
> {
> }
>
> + /* Base traits for integer type Type, providing just the hash and
> + comparison functionality. */
>
> ! template <typename Type>
> ! struct int_hash_base : typed_noop_remove <Type>
> {
> typedef Type value_type;
> typedef Type compare_type;
>
> static inline hashval_t hash (value_type);
> static inline bool equal (value_type existing, value_type candidate);
> };
>
> ! template <typename Type>
> inline hashval_t
> ! int_hash_base <Type>::hash (value_type x)
> {
> return x;
> }
>
> ! template <typename Type>
> inline bool
> ! int_hash_base <Type>::equal (value_type x, value_type y)
> {
> return x == y;
> }
>
> + /* Hasher for integer type Type in which Empty is a spare value that can be
> + used to mark empty slots. If Deleted != Empty then Deleted is another
> + spare value that can be used for deleted slots; if Deleted == Empty then
> + hash table entries cannot be deleted. */
> +
> + template <typename Type, Type Empty, Type Deleted = Empty>
> + struct int_hash : int_hash_base <Type>
> + {
> + typedef Type value_type;
> + typedef Type compare_type;
> +
> + static inline void mark_deleted (Type &);
> + static const bool empty_zero_p = Empty == 0;
> + static inline void mark_empty (Type &);
> + static inline bool is_deleted (Type);
> + static inline bool is_empty (Type);
> + };
> +
> template <typename Type, Type Empty, Type Deleted>
> inline void
> int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> ***************
> *** 398,403 ****
> --- 408,467 ----
> return T1::is_empty (x.first);
> }
>
> + /* Base traits for vectors, providing just the hash and comparison
> + functionality. Type gives the traits for the element type. */
> +
> + template <typename Type>
> + struct vec_hash_base
> + {
> + typedef vec<typename Type::value_type> value_type;
> + typedef vec<typename Type::compare_type> compare_type;
> +
> + static inline hashval_t hash (value_type);
> + static inline bool equal (value_type, compare_type);
> + };
> +
> + template <typename Type>
> + inline hashval_t
> + vec_hash_base <Type>::hash (value_type x)
> + {
> + inchash::hash hstate;
> + hstate.add_int (x.length ());
> + for (auto &value : x)
> + hstate.merge_hash (Type::hash (value));
> + return hstate.end ();
> + }
> +
> + template <typename Type>
> + inline bool
> + vec_hash_base <Type>::equal (value_type x, compare_type y)
> + {
> + if (x.length () != y.length ())
> + return false;
> + for (unsigned int i = 0; i < x.length (); ++i)
> + if (!Type::equal (x[i], y[i]))
> + return false;
> + return true;
> + }
> +
> + /* Traits for vectors whose contents should be freed normally. */
> +
> + template <typename Type>
> + struct vec_free_hash_base : vec_hash_base <Type>
> + {
> + static void remove (typename vec_hash_base <Type>::value_type &);
> + };
> +
> + template <typename Type>
> + void
> + vec_free_hash_base <Type>
> + ::remove (typename vec_hash_base <Type>::value_type &x)
> + {
> + for (auto &value : x)
> + Type::remove (x);
> + x.release ();
> + }
> +
> template <typename T> struct default_hash_traits : T {};
>
> template <typename T>
> *** /tmp/5ek0ya_tree-vect-slp.cc 2022-08-02 08:16:24.027354017 +0100
> --- gcc/tree-vect-slp.cc 2022-08-01 20:44:22.631690380 +0100
> ***************
> *** 20,25 ****
> --- 20,26 ----
> <http://www.gnu.org/licenses/>. */
>
> #include "config.h"
> + #define INCLUDE_ALGORITHM
> #include "system.h"
> #include "coretypes.h"
> #include "backend.h"
> ***************
> *** 49,54 ****
> --- 50,57 ----
> #include "tree-eh.h"
> #include "tree-cfg.h"
> #include "alloc-pool.h"
> + #include "sreal.h"
> + #include "predict.h"
>
> static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
> slp_tree, stmt_vector_for_cost *);
> ***************
> *** 304,309 ****
> --- 307,322 ----
> oprnds_info.release ();
> }
>
> + /* Return the execution frequency of NODE (so that a higher value indicates
> + a "more important" node when optimizing for speed). */
> +
> + static sreal
> + vect_slp_node_weight (slp_tree node)
> + {
> + stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> + basic_block bb = gimple_bb (stmt_info->stmt);
> + return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
> + }
>
> /* Return true if STMTS contains a pattern statement. */
>
> ***************
> *** 3531,3559 ****
> return opt_result::success ();
> }
>
> ! struct slpg_vertex
> {
> ! slpg_vertex (slp_tree node_)
> ! : node (node_), perm_in (-1), perm_out (-1) {}
>
> ! int get_perm_materialized () const
> ! { return perm_in != perm_out ? perm_in : 0; }
>
> slp_tree node;
> ! /* The common permutation on the incoming lanes (towards SLP children). */
> ! int perm_in;
> ! /* The permutation on the outgoing lanes (towards SLP parents). When
> ! the node is a materialization point for a permute this differs
> ! from perm_in (and is then usually zero). Materialization happens
> ! on the input side. */
> ! int perm_out;
> };
>
> /* Fill the vertices and leafs vector with all nodes in the SLP graph. */
>
> ! static void
> ! vect_slp_build_vertices (hash_set<slp_tree> &visited, slp_tree node,
> ! vec<slpg_vertex> &vertices, vec<int> &leafs)
> {
> unsigned i;
> slp_tree child;
> --- 3544,3994 ----
> return opt_result::success ();
> }
>
> ! /* Estimates the cost of inserting layout changes into the SLP graph. */
> !
> ! struct slpg_layout_cost
> ! {
> ! slpg_layout_cost () = default;
> ! slpg_layout_cost (sreal, bool);
> !
> ! bool operator== (const slpg_layout_cost &) const;
> ! bool operator!= (const slpg_layout_cost &) const;
> !
> ! bool is_better_than (const slpg_layout_cost &, bool) const;
> !
> ! void add_parallel_cost (const slpg_layout_cost &);
> ! void add_serial_cost (const slpg_layout_cost &);
> ! void split (unsigned int);
> !
> ! /* The longest sequence of layout changes needed during any traversal
> ! of the partition dag, weighted by execution frequency.
> !
> ! This is the most important metric when optimizing for speed, since
> ! it helps to ensure that we keep the number of operations on
> ! critical paths to a minimum. */
> ! sreal depth = 0;
> !
> ! /* An estimate of the total number of operations needed. IT is weighted by
> ! execution frequency when optimizing for speed but not when optimizing for
> ! size. In order to avoid double-counting, a node with a fanout of N will
> ! distribute 1/N of its total cost to each successor.
> !
> ! This is the most important metric when optimizing for size, since
> ! it helps to keep the total number of operations to a minimum, */
> ! sreal total = 0;
> ! };
> !
> ! /* Construct costs for a node with weight WEIGHT. A higher weight
> ! indicates more frequent execution. IS_FOR_SIZE is true if we are
> ! optimizing for size rather than speed. */
> !
> ! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size)
> ! : depth (weight), total (is_for_size && weight > 0 ? 1 : weight)
> ! {
> ! }
> !
> ! bool
> ! slpg_layout_cost::operator== (const slpg_layout_cost &other) const
> ! {
> ! return depth == other.depth && total == other.total;
> ! }
> !
> ! bool
> ! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const
> ! {
> ! return !operator== (other);
> ! }
> !
> ! /* Return true if these costs are better than OTHER. IS_FOR_SIZE is
> ! true if we are optimizing for size rather than speed. */
> !
> ! bool
> ! slpg_layout_cost::is_better_than (const slpg_layout_cost &other,
> ! bool is_for_size) const
> ! {
> ! if (is_for_size)
> ! {
> ! if (total != other.total)
> ! return total < other.total;
> ! return depth < other.depth;
> ! }
> ! else
> ! {
> ! if (depth != other.depth)
> ! return depth < other.depth;
> ! return total < other.total;
> ! }
> ! }
> !
> ! /* Increase the costs to account for something with cost INPUT_COST
> ! happening in parallel with the current costs. */
> !
> ! void
> ! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost)
> {
> ! depth = std::max (depth, input_cost.depth);
> ! total += input_cost.total;
> ! }
> !
> ! /* Increase the costs to account for something with cost INPUT_COST
> ! happening in series with the current costs. */
> !
> ! void
> ! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other)
> ! {
> ! depth += other.depth;
> ! total += other.total;
> ! }
> !
> ! /* Split the total cost among TIMES successors or predecessors. */
>
> ! void
> ! slpg_layout_cost::split (unsigned int times)
> ! {
> ! if (times > 1)
> ! total /= times;
> ! }
> !
> ! /* Information about one node in the SLP graph, for use during
> ! vect_optimize_slp_pass. */
> !
> ! struct slpg_vertex
> ! {
> ! slpg_vertex (slp_tree node_) : node (node_) {}
>
> + /* The node itself. */
> slp_tree node;
> !
> ! /* Which partition the node belongs to, or -1 if none. */
> ! int partition = -1;
> !
> ! /* The number of nodes that directly use the result of this one
> ! (i.e. the number of nodes that count this one as a child). */
> ! unsigned int out_degree = 0;
> !
> ! /* The execution frequency of the node. */
> ! sreal weight = 0;
> !
> ! /* The total execution frequency of all nodes that directly use the
> ! result of this one. */
> ! sreal out_weight = 0;
> ! };
> !
> ! /* Information about one partition of the SLP graph, for use during
> ! vect_optimize_slp_pass. */
> !
> ! struct slpg_partition_info
> ! {
> ! /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END)
> ! of m_partitioned_nodes. */
> ! unsigned int node_begin = 0;
> ! unsigned int node_end = 0;
> !
> ! /* Which layout we've chosen to use for this partition, or -1 if
> ! we haven't picked one yet. */
> ! int layout = -1;
> !
> ! /* The number of predecessors and successors in the partition dag.
> ! The predecessors always have lower partition numbers and the
> ! successors always have higher partition numbers.
> !
> ! Note that the directions of these edges are not necessarily the
> ! same as in the data flow graph. For example, if an SCC contains
> ! an inner and an outer loop, the inner loop's partition will have
> ! at least two incoming edges from the outer loop's partition:
> ! one for a live-in value and one for a live-out value. In data
> ! flow terms, the live-in edge would also be from the outer loop
> ! to the inner loop, but the live-out edge would be from the inner
> ! loop to the outer loop. */
> ! unsigned int in_degree = 0;
> ! unsigned int out_degree = 0;
> ! };
> !
> ! /* Information about the costs of using a particular layout for a
> ! particular partition. It can also say that the combination is
> ! invalid. */
> !
> ! struct slpg_partition_layout_costs
> ! {
> ! bool is_possible () const { return cost.total != sreal::min (); }
> ! void mark_impossible () { cost.total = sreal::min (); }
> !
> ! /* The costs inherited from predecessor partitions plus the inherent
> ! cost of the layout within the node itself. */
> ! slpg_layout_cost in_cost;
> !
> ! /* The costs inherited from successor partitions. */
> ! slpg_layout_cost out_cost;
> !
> ! /* The sum of the above, when used. */
> ! slpg_layout_cost cost;
> ! };
> !
> ! /* This class tries to optimize the layout of vectors in order to avoid
> ! unnecessary shuffling. At the moment, the set of possible layouts are
> ! restricted to bijective permutes.
> !
> ! The goal of the pass depends on whether we're optimizing for size or
> ! for speed. When optimizing for size, the goal is to reduce the overall
> ! number of layout changes (including layout changes implied by things
> ! like load permutes). When optimizing for speed, the goal is to
> ! reduce the maximum latency attributable to layout changes on any
> ! non-cyclic path through the data flow graph.
> !
> ! For example, when optimizing a loop nest for speed, we will prefer
> ! to make layout changes outside of a loop rather than inside of a loop,
> ! and will prefer to make layout changes in parallel rather than serially,
> ! even if that increases the overall number of layout changes.
> !
> ! The high-level procedure is:
> !
> ! (1) Build a graph in which edges go from uses (parents) to definitions
> ! (children).
> !
> ! (2) Divide the graph into a dag of strongly-connected components (SCCs).
> !
> ! (3) Partition the nodes in each SCC based on their containing cfg loop,
> ! creating a dag of partitions. The goal is now to assign a layout
> ! to each partition.
> !
> ! (4) Construct a set of vector layouts that are worth considering and
> ! calculate the local cost of using each layout in each partition.
> ! Some partitions might not allow a given layout at all.
> !
> ! (5) Perform a forward walk over the partition dag (from loads to stores)
> ! accumulating the "forward" cost of using each layout. When visiting
> ! each partition, assign a tentative choice of layout to the partition
> ! and use that choice when calculating the cost of using a different
> ! layout in successor partitions.
> !
> ! (6) Perform a backward walk over the partition dag (from stores to loads),
> ! accumulating the "backward" cost of using each layout. When visiting
> ! each partition, make a final choice of layout for that partition based
> ! on the accumulated forward costs (from (5)) and backward costs
> ! (from (6)).
> !
> ! (7) Apply the chosen layouts to the SLP graph.
> !
> ! For example, consider the SLP statements:
> !
> ! S1: a_1 = load
> ! loop:
> ! S2: a_2 = PHI<a_1, a_3>
> ! S3: b_1 = load
> ! S4: a_3 = a_2 + b_1
> ! exit:
> ! S5: a_4 = PHI<a_3>
> ! S6: store a_4
> !
> ! S2 and S4 form an SCC and are part of the same loop. Every other
> ! statement is in a singleton SCC. In this example there is a one-to-one
> ! mapping between SCCs and partitions and the partition dag looks like this;
> !
> ! S1 S3
> ! \ /
> ! S2+S4
> ! |
> ! S5
> ! |
> ! S6
> !
> ! S2, S3 and S4 will have a higher execution frequency than the other
> ! statements, so when optimizing for speed, the goal is to avoid any
> ! layout changes:
> !
> ! - within S3
> ! - within S2+S4
> ! - on the S3->S2+S4 edge
> !
> ! For example, if S3 was originally a reversing load, the goal of the
> ! pass is to make it an unreversed load and change the layout on the
> ! S1->S2+S4 and S2+S4->S5 edges to compensate. (Changing the layout
> ! on S1->S2+S4 and S5->S6 would also be acceptable.)
> !
> ! The difference between SCCs and partitions becomes important if we
> ! add an outer loop:
> !
> ! S1: a_1 = ...
> ! loop1:
> ! S2: a_2 = PHI<a_1, a_6>
> ! S3: b_1 = load
> ! S4: a_3 = a_2 + b_1
> ! loop2:
> ! S5: a_4 = PHI<a_3, a_5>
> ! S6: c_1 = load
> ! S7: a_5 = a_4 + c_1
> ! exit2:
> ! S8: a_6 = PHI<a_5>
> ! S9: store a_6
> ! exit1:
> !
> ! Here, S2, S4, S5, S7 and S8 form a single SCC. However, we usually
> ! do not want restrictions in the outer loop to "infect" the decision
> ! for the inner loop. For example, if an outer-loop node in the SCC
> ! contains a statement with a fixed layout, that should not prevent the
> ! inner loop from using a different layout. Conversely, the inner loop
> ! should not dictate a layout to the outer loop: if the outer loop does
> ! a lot of computation, then it may not be efficient to do all of that
> ! computation in the inner loop's preferred layout.
> !
> ! We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner).
> ! We also try to arrange partitions so that:
> !
> ! - the partition for an outer loop comes before the partition for
> ! an inner loop
> !
> ! - if a sibling loop A dominates a sibling loop B, A's partition
> ! comes before B's
> !
> ! So in this case, the partition dag is:
> !
> ! S1 S3
> ! \ /
> ! S2+S4+S8 S6
> ! | \\ /
> ! | S5+S7
> ! |
> ! S9
> !
> ! There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
> ! one for a reversal of the edge S7->S8.
> !
> ! The backward walk picks a layout for S5+S7 before S2+S4+S8. The choice
> ! for S2+S4+S8 therefore has to balance the cost of using the outer loop's
> ! preferred layout against the cost of changing the layout on entry to the
> ! inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).
> !
> ! ??? Although this works well when optimizing for speed, it has the
> ! downside when optimizing for size that the choice of layout for S5+S7
> ! is completely independent of S9. Perhaps we should not partition SCCs
> ! when optimizing for size?
> !
> ! To give a concrete example of the difference between optimizing
> ! for size and speed, consider:
> !
> ! a[0] = (b[1] << c[3]) - d[1];
> ! a[1] = (b[0] << c[2]) - d[0];
> ! a[2] = (b[3] << c[1]) - d[3];
> ! a[3] = (b[2] << c[0]) - d[2];
> !
> ! There are three different layouts here: one for a, one for b and d,
> ! and one for c. When optimizing for speed it is better to permute
> ! each of b, c and d into the order required by a, since those permutes
> ! happen in parallel. But when optimizing for size, it is better to:
> !
> ! - permute c into the same order as b
> ! - do the arithmetic
> ! - permute the result into the order required by a
> !
> ! This gives 2 permutes rather than 3. */
> !
> ! class vect_optimize_slp_pass
> ! {
> ! public:
> ! vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {}
> ! void run ();
> !
> ! private:
> ! /* Graph building. */
> ! struct loop *containing_loop (slp_tree);
> ! bool is_cfg_latch_edge (graph_edge *);
> ! void build_vertices (hash_set<slp_tree> &, slp_tree);
> ! void build_vertices ();
> ! void build_graph ();
> !
> ! /* Partitioning. */
> ! void create_partitions ();
> ! template<typename T> void for_each_partition_edge (unsigned int, T);
> !
> ! /* Layout selection. */
> ! bool can_change_layouts (unsigned int, unsigned int);
> ! bool layout_is_reversible (unsigned int);
> ! unsigned int layout_cost (unsigned int, unsigned int);
> ! slpg_partition_layout_costs &partition_layout_costs (unsigned int,
> ! unsigned int);
> ! bool partition_can_absorb_permutes (unsigned int);
> ! int node_absorption_cost (slp_tree, unsigned int);
> ! void start_choosing_layouts ();
> !
> ! /* Cost propagation. */
> ! slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int,
> ! unsigned int, unsigned int);
> ! slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
> ! slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int);
> ! void forward_pass ();
> ! void backward_pass ();
> !
> ! /* Rematerialization. */
> ! slp_tree get_result_with_layout (slp_tree, unsigned int);
> ! void materialize ();
> !
> ! /* Clean-up. */
> ! void remove_redundant_permutes ();
> !
> ! void dump ();
> !
> ! vec_info *m_vinfo;
> !
> ! /* True if we should optimize the graph for size, false if we should
> ! optimize it for speed. (It wouldn't be easy to make this decision
> ! more locally.) */
> ! bool m_optimize_size;
> !
> ! /* A graph of all SLP nodes, with edges leading from uses to definitions.
> ! In other words, a node's predecessors are its slp_tree parents and
> ! a node's successors are its slp_tree children. */
> ! graph *m_slpg = nullptr;
> !
> ! /* The vertices of M_SLPG, indexed by slp_tree::vertex. */
> ! auto_vec<slpg_vertex> m_vertices;
> !
> ! /* The list of all leaves of M_SLPG. such as external definitions, constants,
> ! and loads. */
> ! auto_vec<int> m_leafs;
> !
> ! /* This array has one entry for every vector layout that we're considering.
> ! Element 0 is null and indicates "no change". Other entries describe
> ! permutes that are inherent in the current graph and that we would
> ! like to reverse if possible.
> !
> ! For example, a permute { 1, 2, 3, 0 } means that something has
> ! effectively been permuted in that way, such as a load group
> ! { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]).
> ! We'd then like to apply the reverse permute { 3, 0, 1, 2 }
> ! in order to put things "back" in order. */
> ! auto_vec<vec<unsigned> > m_perms;
> !
> ! /* A partitioning of the nodes for which a layout must be chosen.
> ! Each partition represents an <SCC, cfg loop> pair; that is,
> ! nodes in different SCCs belong to different partitions, and nodes
> ! within an SCC are further partitioned according to the containing
> ! cfg loop. Partition <SCC1, L1> comes before <SCC2, L2> if:
> !
> ! - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk
> ! from leaves (such as loads) to roots (such as stores).
> !
> ! - SCC1 == SCC2 and L1's header strictly dominates L2's header. */
> ! auto_vec<slpg_partition_info> m_partitions;
> !
> ! /* The list of all nodes for which a layout must be chosen. Nodes for
> ! partition P come before the nodes for partition P+1. Nodes within a
> ! partition are in reverse postorder. */
> ! auto_vec<unsigned int> m_partitioned_nodes;
> !
> ! /* Index P * num-layouts + L contains the cost of using layout L
> ! for partition P. */
> ! auto_vec<slpg_partition_layout_costs> m_partition_layout_costs;
> !
> ! /* Index N * num-layouts + L, if nonnull, is a node that provides the
> ! original output of node N adjusted to have layout L. */
> ! auto_vec<slp_tree> m_node_layouts;
> };
>
> /* Fill the vertices and leafs vector with all nodes in the SLP graph. */
>
> ! void
> ! vect_optimize_slp_pass::build_vertices (hash_set<slp_tree> &visited,
> ! slp_tree node)
> {
> unsigned i;
> slp_tree child;
> ***************
> *** 3561,3568 ****
> if (visited.add (node))
> return;
>
> ! node->vertex = vertices.length ();
> ! vertices.safe_push (slpg_vertex (node));
>
> bool leaf = true;
> bool force_leaf = false;
> --- 3996,4003 ----
> if (visited.add (node))
> return;
>
> ! node->vertex = m_vertices.length ();
> ! m_vertices.safe_push (slpg_vertex (node));
>
> bool leaf = true;
> bool force_leaf = false;
> ***************
> *** 3570,3576 ****
> if (child)
> {
> leaf = false;
> ! vect_slp_build_vertices (visited, child, vertices, leafs);
> }
> else
> force_leaf = true;
> --- 4005,4011 ----
> if (child)
> {
> leaf = false;
> ! build_vertices (visited, child);
> }
> else
> force_leaf = true;
> ***************
> *** 3580,3600 ****
> and inductions. Force those SLP PHIs to act as leafs to make them
> backwards reachable. */
> if (leaf || force_leaf)
> ! leafs.safe_push (node->vertex);
> }
>
> /* Fill the vertices and leafs vector with all nodes in the SLP graph. */
>
> ! static void
> ! vect_slp_build_vertices (vec_info *info, vec<slpg_vertex> &vertices,
> ! vec<int> &leafs)
> {
> hash_set<slp_tree> visited;
> unsigned i;
> slp_instance instance;
> ! FOR_EACH_VEC_ELT (info->slp_instances, i, instance)
> ! vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), vertices,
> ! leafs);
> }
>
> /* Apply (reverse) bijectite PERM to VEC. */
> --- 4015,4033 ----
> and inductions. Force those SLP PHIs to act as leafs to make them
> backwards reachable. */
> if (leaf || force_leaf)
> ! m_leafs.safe_push (node->vertex);
> }
>
> /* Fill the vertices and leafs vector with all nodes in the SLP graph. */
>
> ! void
> ! vect_optimize_slp_pass::build_vertices ()
> {
> hash_set<slp_tree> visited;
> unsigned i;
> slp_instance instance;
> ! FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
> ! build_vertices (visited, SLP_INSTANCE_TREE (instance));
> }
>
> /* Apply (reverse) bijectite PERM to VEC. */
> ***************
> *** 3625,3699 ****
> }
> }
>
> ! /* Return whether permutations PERM_A and PERM_B as recorded in the
> ! PERMS vector are equal. */
>
> ! static bool
> ! vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
> ! int perm_a, int perm_b)
> {
> ! return (perm_a == perm_b
> ! || (perm_a != -1 && perm_b != -1
> ! && perms[perm_a].length () == perms[perm_b].length ()
> ! && memcmp (&perms[perm_a][0], &perms[perm_b][0],
> ! sizeof (unsigned) * perms[perm_a].length ()) == 0));
> }
>
> ! /* Optimize the SLP graph of VINFO. */
>
> ! void
> ! vect_optimize_slp (vec_info *vinfo)
> {
> ! if (vinfo->slp_instances.is_empty ())
> ! return;
>
> ! slp_tree node;
> ! unsigned i;
> ! auto_vec<slpg_vertex> vertices;
> ! auto_vec<int> leafs;
> ! vect_slp_build_vertices (vinfo, vertices, leafs);
>
> ! struct graph *slpg = new_graph (vertices.length ());
> ! for (slpg_vertex &v : vertices)
> for (slp_tree child : SLP_TREE_CHILDREN (v.node))
> if (child)
> ! add_edge (slpg, v.node->vertex, child->vertex);
>
> ! /* Compute (reverse) postorder on the inverted graph. */
> ! auto_vec<int> ipo;
> ! graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL);
> !
> ! auto_vec<vec<unsigned> > perms;
> ! perms.safe_push (vNULL); /* zero is no permute */
> !
> ! /* Produce initial permutations. */
> ! for (i = 0; i < leafs.length (); ++i)
> ! {
> ! int idx = leafs[i];
> ! slp_tree node = vertices[idx].node;
> !
> ! /* Handle externals and constants optimistically throughout the
> ! iteration. But treat existing vectors as fixed since we
> ! do not handle permuting them below. */
> ! if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! && !SLP_TREE_VEC_DEFS (node).exists ())
> ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! continue;
>
> ! /* Leafs do not change across iterations. Note leafs also double
> ! as entries to the reverse graph. */
> ! if (!slpg->vertices[idx].succ)
> {
> ! vertices[idx].perm_in = 0;
> ! vertices[idx].perm_out = 0;
> }
>
> /* Loads are the only thing generating permutes. */
> if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> continue;
>
> /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! node unpermuted, record this permute. */
> stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> continue;
> --- 4058,4416 ----
> }
> }
>
> ! /* Return the cfg loop that contains NODE. */
>
> ! struct loop *
> ! vect_optimize_slp_pass::containing_loop (slp_tree node)
> {
> ! stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> ! if (!rep)
> ! return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father;
> ! return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father;
> }
>
> ! /* Return true if UD (an edge from a use to a definition) is associated
> ! with a loop latch edge in the cfg. */
>
> ! bool
> ! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud)
> {
> ! slp_tree use = m_vertices[ud->src].node;
> ! slp_tree def = m_vertices[ud->dest].node;
> ! if (SLP_TREE_DEF_TYPE (use) != vect_internal_def
> ! || SLP_TREE_DEF_TYPE (def) != vect_internal_def)
> ! return false;
>
> ! stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use));
> ! return (is_a<gphi *> (use_rep->stmt)
> ! && bb_loop_header_p (gimple_bb (use_rep->stmt))
> ! && containing_loop (def) == containing_loop (use));
> ! }
> !
> ! /* Build the graph. Mark edges that correspond to cfg loop latch edges with
> ! a nonnull data field. */
>
> ! void
> ! vect_optimize_slp_pass::build_graph ()
> ! {
> ! build_vertices ();
> !
> ! m_slpg = new_graph (m_vertices.length ());
> ! for (slpg_vertex &v : m_vertices)
> for (slp_tree child : SLP_TREE_CHILDREN (v.node))
> if (child)
> ! {
> ! graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex);
> ! if (is_cfg_latch_edge (ud))
> ! ud->data = this;
> ! }
> ! }
>
> ! /* Return true if E corresponds to a loop latch edge in the cfg. */
>
> ! static bool
> ! skip_cfg_latch_edges (graph_edge *e)
> ! {
> ! return e->data;
> ! }
> !
> ! /* Create the node partitions. */
> !
> ! void
> ! vect_optimize_slp_pass::create_partitions ()
> ! {
> ! /* Calculate a postorder of the graph, ignoring edges that correspond
> ! to natural latch edges in the cfg. Reading the vector from the
> ! end to the beginning gives the reverse postorder. */
> ! auto_vec<int> initial_rpo;
> ! graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo,
> ! false, NULL, skip_cfg_latch_edges);
> ! gcc_assert (initial_rpo.length () == m_vertices.length ());
> !
> ! /* Calculate the strongly connected components of the graph. */
> ! auto_vec<int> scc_grouping;
> ! unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping);
> !
> ! /* Create a new index order in which all nodes from the same SCC are
> ! consecutive. Use scc_pos to record the index of the first node in
> ! each SCC. */
> ! auto_vec<unsigned int> scc_pos (num_sccs);
> ! int last_component = -1;
> ! unsigned int node_count = 0;
> ! for (unsigned int node_i : scc_grouping)
> ! {
> ! if (last_component != m_slpg->vertices[node_i].component)
> ! {
> ! last_component = m_slpg->vertices[node_i].component;
> ! gcc_assert (last_component == int (scc_pos.length ()));
> ! scc_pos.quick_push (node_count);
> ! }
> ! node_count += 1;
> ! }
> ! gcc_assert (node_count == initial_rpo.length ()
> ! && last_component + 1 == int (num_sccs));
> !
> ! /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes
> ! inside each SCC following the RPO we calculated above. The fact that
> ! we ignored natural latch edges when calculating the RPO should ensure
> ! that, for natural loop nests:
> !
> ! - the first node that we encounter in a cfg loop is the loop header phi
> ! - the loop header phis are in dominance order
> !
> ! Arranging for this is an optimization (see below) rather than a
> ! correctness issue. Unnatural loops with a tangled mess of backedges
> ! will still work correctly, but might give poorer results.
> !
> ! Also update scc_pos so that it gives 1 + the index of the last node
> ! in the SCC. */
> ! m_partitioned_nodes.safe_grow (node_count);
> ! for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;)
> ! {
> ! unsigned int node_i = initial_rpo[old_i];
> ! unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++;
> ! m_partitioned_nodes[new_i] = node_i;
> ! }
> !
> ! /* Partition each SCC based on the containing cfg loop. The order we
> ! constructed above should ensure that, for natural cfg loops, we'll
> ! create sub-SCC partitions for outer loops before the corresponding
> ! sub-SCC partitions for inner loops. Similarly, when one sibling
> ! loop A dominates another sibling loop B, we should create a sub-SCC
> ! partition for A before a sub-SCC partition for B.
> !
> ! As above, nothing depends for correctness on whether this achieves
> ! a natural nesting, but we should get better results when it does. */
> ! m_partitions.reserve (m_vertices.length ());
> ! unsigned int next_partition_i = 0;
> ! hash_map<struct loop *, int> loop_partitions;
> ! unsigned int rpo_begin = 0;
> ! unsigned int num_partitioned_nodes = 0;
> ! for (unsigned int rpo_end : scc_pos)
> ! {
> ! loop_partitions.empty ();
> ! for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i)
> ! {
> ! /* Handle externals and constants optimistically throughout.
> ! But treat existing vectors as fixed since we do not handle
> ! permuting them. */
> ! unsigned int node_i = m_partitioned_nodes[rpo_i];
> ! slp_tree node = m_vertices[node_i].node;
> ! if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! && !SLP_TREE_VEC_DEFS (node).exists ())
> ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! m_vertices[node_i].partition = -1;
> ! else
> ! {
> ! struct loop *loop = containing_loop (node);
> ! bool existed;
> ! auto &partition_i
> ! = loop_partitions.get_or_insert (loop, &existed);
> ! if (!existed)
> ! {
> ! partition_i = next_partition_i++;
> ! m_partitions.quick_push (slpg_partition_info ());
> ! }
> ! m_vertices[node_i].partition = partition_i;
> ! num_partitioned_nodes += 1;
> ! m_partitions[partition_i].node_end += 1;
> ! }
> ! }
> ! rpo_begin = rpo_end;
> ! }
> !
> ! /* Assign ranges of consecutive node indices to each partition,
> ! in partition order. Start with node_end being the same as
> ! node_begin so that the next loop can use it as a counter. */
> ! unsigned int node_begin = 0;
> ! for (auto &partition : m_partitions)
> ! {
> ! partition.node_begin = node_begin;
> ! node_begin += partition.node_end;
> ! partition.node_end = partition.node_begin;
> ! }
> ! gcc_assert (node_begin == num_partitioned_nodes);
> !
> ! /* Finally build the list of nodes in partition order. */
> ! m_partitioned_nodes.truncate (num_partitioned_nodes);
> ! for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i)
> ! {
> ! int partition_i = m_vertices[node_i].partition;
> ! if (partition_i >= 0)
> {
> ! unsigned int order_i = m_partitions[partition_i].node_end++;
> ! m_partitioned_nodes[order_i] = node_i;
> }
> + }
> + }
> +
> + /* Look for edges from earlier partitions into node NODE_I and edges from
> + node NODE_I into later partitions. Call:
> +
> + FN (ud, other_node_i, other_partition_i)
> +
> + for each such use-to-def edge ud, where other_node_i and other_partition_i
> + are the node and partition at the other end of the edge. */
> +
> + template<typename T>
> + void
> + vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn)
> + {
> + int partition_i = m_vertices[node_i].partition;
> + for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> + pred; pred = pred->pred_next)
> + {
> + int src_partition_i = m_vertices[pred->src].partition;
> + if (src_partition_i >= 0 && src_partition_i != partition_i)
> + fn (pred, pred->src, src_partition_i);
> + }
> + for (graph_edge *succ = m_slpg->vertices[node_i].succ;
> + succ; succ = succ->succ_next)
> + {
> + int dest_partition_i = m_vertices[succ->dest].partition;
> + if (dest_partition_i >= 0 && dest_partition_i != partition_i)
> + fn (succ, succ->dest, dest_partition_i);
> + }
> + }
> +
> + /* Return true if the target is able to change a vector from layout
> + FROM_LAYOUT_I to layout TO_LAYOUT_I. */
> +
> + bool
> + vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i,
> + unsigned int to_layout_i)
> + {
> + /* FIXME: test what the target can support. */
> + (void) from_layout_i;
> + (void) to_layout_i;
> + return true;
> + }
> +
> + /* We already know that the target can perform permute m_perms[LAYOUT_I].
> + In our terms this means that we know that the target can go from a
> + layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with
> + it applied (layout 0, the original layout of the SLP graph). Return
> + true if the target can also go in the opposite direction: from the
> + original layout to one in which m_perms[LAYOUT_I] has been reversed. */
> +
> + bool
> + vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i)
> + {
> + return can_change_layouts (0, layout_i);
> + }
> +
> + /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
> + to layout TO_LAYOUT_I. */
> +
> + unsigned int
> + vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i,
> + unsigned int to_layout_i)
> + {
> + if (from_layout_i == to_layout_i)
> + return 0;
> +
> + if (can_change_layouts (from_layout_i, to_layout_i))
> + return 1;
> +
> + /* Change to layout 0 and then out again. */
> + gcc_assert (layout_is_reversible (from_layout_i));
> + return 2;
> + }
> +
> + /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I. */
> +
> + inline slpg_partition_layout_costs &
> + vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i,
> + unsigned int layout_i)
> + {
> + return m_partition_layout_costs[partition_i * m_perms.length () + layout_i];
> + }
> +
> + /* Return true if partition PARTITION_I is structually allowed to absorb
> + neighboring permutes. */
> +
> + bool
> + vect_optimize_slp_pass::
> + partition_can_absorb_permutes (unsigned int partition_i)
> + {
> + auto &partition = m_partitions[partition_i];
> + if (partition.node_end != partition.node_begin + 1)
> + return false;
> +
> + slp_tree node = m_vertices[m_partitioned_nodes[partition.node_begin]].node;
> + if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> + return true;
> +
> + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> + if (rep
> + && STMT_VINFO_DATA_REF (rep)
> + && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
> + return true;
> +
> + return false;
> + }
> +
> + /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the
> + permute that would be required to go from layout 0 to layout LAYOUT_I.
> + Return the cost of the change if so, in the same arbitrary units as
> + for layout_cost. Return -1 otherwise. */
> +
> + int
> + vect_optimize_slp_pass::node_absorption_cost (slp_tree node,
> + unsigned int layout_i)
> + {
> + if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> + /* FIXME: check whether NODE can handle cases in which both the inputs
> + and outputs have layout LAYOUT_I. Return -1 if not. */
> + return 0;
> +
> + auto &vertex = m_vertices[node->vertex];
> + auto &partition = m_partitions[vertex.partition];
> + if (partition.layout == int (layout_i))
> + return 0;
> + /* FIXME: check whether NODE can be changed from its current layout
> + to layout LAYOUT_I. Return -1 if not. */
> + return 1;
> + }
> +
> + /* Decide which element layouts we should consider using. Mark which
> + partitions can potentially use which layouts and the local costs of
> + each choice. */
> +
> + void
> + vect_optimize_slp_pass::start_choosing_layouts ()
> + {
> + /* Used to assign unique permute indices. */
> + using perm_hash = unbounded_hashmap_traits<
> + vec_free_hash_base<int_hash_base<unsigned>>,
> + int_hash<int, -1, -2>
> + >;
> + hash_map<vec<unsigned>, int, perm_hash> layout_ids;
> +
> + /* Layout 0 is "no change". */
> + m_perms.safe_push (vNULL);
> +
> + /* Create layouts from existing permutations. */
> + /* FIXME: cap the maximum number of layouts, to avoid quadraticness. */
> + for (unsigned int node_i : m_leafs)
> + {
> + int partition_i = m_vertices[node_i].partition;
> + if (partition_i < 0)
> + continue;
> +
> + /* Leafs also double as entries to the reverse graph. Allow the
> + layout of those to be changed. */
> + if (!m_slpg->vertices[node_i].succ)
> + m_partitions[partition_i].layout = 0;
>
> /* Loads are the only thing generating permutes. */
> + slp_tree node = m_vertices[node_i].node;
> if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> continue;
>
> /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! node unpermuted, record a layout that reverses this permute. */
> ! gcc_assert (m_partitions[partition_i].layout == 0);
> stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> continue;
> ***************
> *** 3708,3720 ****
> if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> any_permute = true;
> }
> - /* If there's no permute no need to split one out. */
> - if (!any_permute)
> - continue;
> /* If the span doesn't match we'd disrupt VF computation, avoid
> that for now. */
> if (imax - imin + 1 != SLP_TREE_LANES (node))
> continue;
>
> /* For now only handle true permutes, like
> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
> --- 4425,4442 ----
> if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> any_permute = true;
> }
> /* If the span doesn't match we'd disrupt VF computation, avoid
> that for now. */
> if (imax - imin + 1 != SLP_TREE_LANES (node))
> continue;
> + /* If there's no permute no need to split one out. In this case
> + we can consider turning the load into a permuted load, if that
> + turns out to be cheaper than alternatives. */
> + if (!any_permute)
> + {
> + m_partitions[partition_i].layout = -1;
> + continue;
> + }
>
> /* For now only handle true permutes, like
> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
> ***************
> *** 3735,3985 ****
> perm.safe_grow (SLP_TREE_LANES (node), true);
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> ! perms.safe_push (perm);
> ! vertices[idx].perm_in = perms.length () - 1;
> ! vertices[idx].perm_out = perms.length () - 1;
> }
>
> ! /* In addition to the above we have to mark outgoing permutes facing
> ! non-reduction graph entries that are not represented as to be
> ! materialized. */
> ! for (slp_instance instance : vinfo->slp_instances)
> if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
> {
> ! /* Just setting perm_out isn't enough for the propagation to
> ! pick this up. */
> ! vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0;
> ! vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0;
> }
>
> ! /* Propagate permutes along the graph and compute materialization points. */
> ! bool changed;
> ! bool do_materialization = false;
> ! unsigned iteration = 0;
> ! do
> ! {
> ! changed = false;
> ! ++iteration;
> !
> ! if (dump_enabled_p ())
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! "SLP optimize iteration %d\n", iteration);
>
> ! for (i = vertices.length (); i > 0 ; --i)
> {
> ! int idx = ipo[i-1];
> ! slp_tree node = vertices[idx].node;
>
> ! /* Handle externals and constants optimistically throughout the
> ! iteration. */
> ! if (SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! continue;
>
> ! /* We still eventually have failed backedge SLP nodes in the
> ! graph, those are only cancelled when analyzing operations.
> ! Simply treat them as transparent ops, propagating permutes
> ! through them. */
> ! if (SLP_TREE_DEF_TYPE (node) == vect_internal_def)
> ! {
> ! /* We do not handle stores with a permutation, so all
> ! incoming permutes must have been materialized. */
> ! stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> ! if (STMT_VINFO_DATA_REF (rep)
> ! && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! {
> ! /* ??? We're forcing materialization in place
> ! of the child here, we'd need special handling
> ! in materialization to leave perm_in -1 here. */
> ! vertices[idx].perm_in = 0;
> ! vertices[idx].perm_out = 0;
> ! }
> ! /* We cannot move a permute across an operation that is
> ! not independent on lanes. Note this is an explicit
> ! negative list since that's much shorter than the respective
> ! positive one but it's critical to keep maintaining it. */
> ! if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! {
> ! case CFN_COMPLEX_ADD_ROT90:
> ! case CFN_COMPLEX_ADD_ROT270:
> ! case CFN_COMPLEX_MUL:
> ! case CFN_COMPLEX_MUL_CONJ:
> ! case CFN_VEC_ADDSUB:
> ! case CFN_VEC_FMADDSUB:
> ! case CFN_VEC_FMSUBADD:
> ! vertices[idx].perm_in = 0;
> ! vertices[idx].perm_out = 0;
> ! default:;
> ! }
> }
>
> ! if (!slpg->vertices[idx].succ)
> ! /* Pick up pre-computed leaf values. */
> ! ;
> ! else
> {
> ! bool any_succ_perm_out_m1 = false;
> ! int perm_in = vertices[idx].perm_in;
> ! for (graph_edge *succ = slpg->vertices[idx].succ;
> ! succ; succ = succ->succ_next)
> {
> ! int succ_idx = succ->dest;
> ! int succ_perm = vertices[succ_idx].perm_out;
> ! /* Handle unvisited (and constant) nodes optimistically. */
> ! /* ??? But for constants once we want to handle
> ! non-bijective permutes we have to verify the permute,
> ! when unifying lanes, will not unify different constants.
> ! For example see gcc.dg/vect/bb-slp-14.c for a case
> ! that would break. */
> ! if (succ_perm == -1)
> ! {
> ! /* When we handled a non-leaf optimistically, note
> ! that so we can adjust its outgoing permute below. */
> ! slp_tree succ_node = vertices[succ_idx].node;
> ! if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! any_succ_perm_out_m1 = true;
> ! continue;
> ! }
> ! if (perm_in == -1)
> ! perm_in = succ_perm;
> ! else if (succ_perm == 0
> ! || !vect_slp_perms_eq (perms, perm_in, succ_perm))
> {
> ! perm_in = 0;
> ! break;
> }
> }
>
> ! /* Adjust any incoming permutes we treated optimistically. */
> ! if (perm_in != -1 && any_succ_perm_out_m1)
> {
> ! for (graph_edge *succ = slpg->vertices[idx].succ;
> ! succ; succ = succ->succ_next)
> {
> ! slp_tree succ_node = vertices[succ->dest].node;
> ! if (vertices[succ->dest].perm_out == -1
> ! && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! {
> ! vertices[succ->dest].perm_out = perm_in;
> ! /* And ensure this propagates. */
> ! if (vertices[succ->dest].perm_in == -1)
> ! vertices[succ->dest].perm_in = perm_in;
> ! }
> }
> ! changed = true;
> ! }
>
> ! if (!vect_slp_perms_eq (perms, perm_in,
> ! vertices[idx].perm_in))
> ! {
> ! /* Make sure we eventually converge. */
> ! gcc_checking_assert (vertices[idx].perm_in == -1
> ! || perm_in == 0);
> ! vertices[idx].perm_in = perm_in;
> !
> ! /* While we can handle VEC_PERM nodes as transparent
> ! pass-through they can be a cheap materialization
> ! point as well. In addition they can act as source
> ! of a random permutation as well.
> ! The following ensures that former materialization
> ! points that now have zero incoming permutes no
> ! longer appear as such and that former "any" permutes
> ! get pass-through. We keep VEC_PERM nodes optimistic
> ! as "any" outgoing permute though. */
> ! if (vertices[idx].perm_out != 0
> ! && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
> ! vertices[idx].perm_out = perm_in;
> ! changed = true;
> ! }
> }
>
> ! /* Elide pruning at materialization points in the first
> ! iteration phase. */
> ! if (!do_materialization)
> ! continue;
>
> ! int perm = vertices[idx].perm_out;
> ! if (perm == 0 || perm == -1)
> continue;
>
> ! /* Decide on permute materialization. Look whether there's
> ! a use (pred) edge that is permuted differently than us.
> ! In that case mark ourselves so the permutation is applied. */
> ! bool all_preds_permuted = slpg->vertices[idx].pred != NULL;
> ! if (all_preds_permuted)
> ! for (graph_edge *pred = slpg->vertices[idx].pred;
> ! pred; pred = pred->pred_next)
> ! {
> ! int pred_perm = vertices[pred->src].perm_in;
> ! gcc_checking_assert (pred_perm != -1);
> ! if (!vect_slp_perms_eq (perms, perm, pred_perm))
> ! {
> ! all_preds_permuted = false;
> ! break;
> ! }
> ! }
> ! if (!all_preds_permuted)
> ! {
> ! vertices[idx].perm_out = 0;
> ! changed = true;
> }
> - }
>
> ! /* If the initial propagation converged, switch on materialization
> ! and re-propagate. */
> ! if (!changed && !do_materialization)
> {
> ! do_materialization = true;
> ! changed = true;
> }
> }
> ! while (changed);
> ! statistics_histogram_event (cfun, "SLP optimize perm iterations", iteration);
>
> ! /* Materialize. */
> ! for (i = 0; i < vertices.length (); ++i)
> {
> ! int perm_in = vertices[i].perm_in;
> ! slp_tree node = vertices[i].node;
>
> ! /* First permute invariant/external original successors, we handle
> ! those optimistically during propagation and duplicate them if
> ! they are used with different permutations. */
> ! unsigned j;
> ! slp_tree child;
> ! if (perm_in > 0)
> ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
> ! {
> ! if (!child
> ! || (SLP_TREE_DEF_TYPE (child) != vect_constant_def
> ! && SLP_TREE_DEF_TYPE (child) != vect_external_def))
> ! continue;
>
> ! /* If the vector is uniform there's nothing to do. */
> ! if (vect_slp_tree_uniform_p (child))
> ! continue;
>
> ! /* We can end up sharing some externals via two_operator
> ! handling. Be prepared to unshare those. */
> ! if (child->refcnt != 1)
> ! {
> ! gcc_assert (slpg->vertices[child->vertex].pred->pred_next);
> ! SLP_TREE_CHILDREN (node)[j] = child
> ! = vect_create_new_slp_node
> ! (SLP_TREE_SCALAR_OPS (child).copy ());
> ! }
> ! vect_slp_permute (perms[perm_in],
> ! SLP_TREE_SCALAR_OPS (child), true);
> ! }
>
> if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> {
> /* Apply the common permutes to the input vectors. */
> ! if (perm_in > 0)
> ! {
> /* If the node is already a permute node we can apply
> the permutation to the lane selection, effectively
> materializing it on the incoming vectors. */
> --- 4457,5092 ----
> perm.safe_grow (SLP_TREE_LANES (node), true);
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> ! bool existed;
> ! int &layout_i = layout_ids.get_or_insert (perm, &existed);
> ! if (existed)
> ! perm.release ();
> ! else
> ! {
> ! layout_i = m_perms.length ();
> ! m_perms.safe_push (perm);
> ! }
> ! m_partitions[partition_i].layout = layout_i;
> }
>
> ! /* Initially assume that every layout is possible and has zero cost
> ! in every partition. */
> ! m_partition_layout_costs.safe_grow_cleared (m_partitions.length ()
> ! * m_perms.length ());
> !
> ! /* We have to mark outgoing permutes facing non-reduction graph
> ! entries that are not represented as to be materialized. */
> ! for (slp_instance instance : m_vinfo->slp_instances)
> if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
> {
> ! unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex;
> ! m_partitions[m_vertices[node_i].partition].layout = 0;
> }
>
> ! /* Check which layouts each node and partition can handle. Calculate the
> ! weights associated with inserting layout changes on edges. */
> ! m_optimize_size = true;
> ! for (unsigned int node_i : m_partitioned_nodes)
> ! {
> ! auto &vertex = m_vertices[node_i];
> ! unsigned int partition_i = vertex.partition;
> ! auto &partition = m_partitions[partition_i];
> ! slp_tree node = vertex.node;
> !
> ! if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node))
> ! {
> ! basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt);
> ! if (optimize_bb_for_speed_p (bb))
> ! m_optimize_size = false;
> !
> ! vertex.weight = vect_slp_node_weight (node);
> !
> ! /* We do not handle stores with a permutation, so all
> ! incoming permutes must have been materialized. */
> ! if (STMT_VINFO_DATA_REF (rep)
> ! && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! /* ??? We're forcing materialization in place
> ! of the child here, we'd need special handling
> ! in materialization to leave layout -1 here. */
> ! partition.layout = 0;
> !
> ! /* We cannot change the layout of an operation that is
> ! not independent on lanes. Note this is an explicit
> ! negative list since that's much shorter than the respective
> ! positive one but it's critical to keep maintaining it. */
> ! if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! {
> ! case CFN_COMPLEX_ADD_ROT90:
> ! case CFN_COMPLEX_ADD_ROT270:
> ! case CFN_COMPLEX_MUL:
> ! case CFN_COMPLEX_MUL_CONJ:
> ! case CFN_VEC_ADDSUB:
> ! case CFN_VEC_FMADDSUB:
> ! case CFN_VEC_FMSUBADD:
> ! partition.layout = 0;
> ! default:;
> ! }
> ! }
>
> ! if (partition.layout == 0)
> {
> ! /* The node must stay in its current order. */
> ! for (unsigned int layout_i = 1; layout_i < m_perms.length ();
> ! ++layout_i)
> ! partition_layout_costs (partition_i, layout_i).mark_impossible ();
> ! }
> ! else if (partition_can_absorb_permutes (partition_i))
> ! /* Calculate the cost of converting NODE to use each permute.
> ! If PARTITION.LAYOUT > 0 then that layout's permute will be free. */
> ! for (unsigned int layout_i = 0; layout_i < m_perms.length ();
> ! ++layout_i)
> ! {
> ! auto &layout_costs = partition_layout_costs (partition_i,
> ! layout_i);
> ! int factor = node_absorption_cost (node, layout_i);
> ! if (factor < 0)
> ! {
> ! layout_costs.mark_impossible ();
> ! continue;
> ! }
> ! gcc_assert (layout_costs.is_possible ());
> ! layout_costs.in_cost = { vertex.weight * factor, m_optimize_size };
> ! }
>
> ! auto process_edge = [&](graph_edge *ud, unsigned int node2_i,
> ! unsigned int partition2_i)
> ! {
> ! /* Count the number of edges from earlier partitions and the number
> ! of edges to later partitions. */
> ! if (partition2_i < partition_i)
> ! partition.in_degree += 1;
> ! else
> ! partition.out_degree += 1;
>
> ! /* If the current node uses the result of NODE2_I, accumulate
> ! the effects of that. */
> ! if (ud->src == int (node_i))
> ! {
> ! m_vertices[node2_i].out_weight += vertex.weight;
> ! m_vertices[node2_i].out_degree += 1;
> }
> + };
> + for_each_partition_edge (node_i, process_edge);
> + }
>
> ! /* Layout 0 represents the original order and layout LAYOUT_I represents
> ! the effect of reversing m_perms[LAYOUT_I]. If a predecessor partition
> ! does not support layout LAYOUT_I and if it also isn't possible to go
> ! from layout 0 to layout LAYOUT_I, then it is very difficult for a
> ! successor partition to support layout LAYOUT_I. (The only alternative
> ! would go to layout LAYOUT_I from a different nonzero layout, but
> ! that's a niche case and not really worth supporting.) */
> ! for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
> ! {
> ! if (layout_is_reversible (layout_i))
> ! continue;
> !
> ! bool changed;
> ! do
> ! {
> ! changed = false;
> ! for (unsigned int node_i : m_partitioned_nodes)
> {
> ! unsigned int partition_i = m_vertices[node_i].partition;
> ! if (partition_layout_costs (partition_i,
> ! layout_i).is_possible ())
> ! continue;
> ! for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> ! pred; pred = pred->pred_next)
> {
> ! unsigned int partition2_i = m_vertices[pred->src].partition;
> ! auto &costs = partition_layout_costs (partition2_i,
> ! layout_i);
> ! if (costs.is_possible ())
> {
> ! costs.mark_impossible ();
> ! changed = true;
> }
> }
> + }
> + }
> + while (changed);
> + }
> + }
> +
> + /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
> + and layout LAYOUT2_I on cross-partition use-to-def edge UD. */
> +
> + slpg_layout_cost
> + vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud,
> + unsigned int node1_i,
> + unsigned int layout1_i,
> + unsigned int layout2_i)
> + {
> + auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i;
> + auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i;
> + auto factor = layout_cost (def_layout_i, use_layout_i);
> +
> + /* We have a choice of putting the layout change at the site of the
> + definition or at the site of the use. Prefer the former when
> + optimizing for size or when the execution frequency of the
> + definition is no greater than the combined execution frequencies of
> + the uses. When putting the layout change at the site of the definition,
> + divvy up the cost among all consumers. */
> + slpg_vertex &def_vertex = m_vertices[ud->dest];
> + slpg_vertex &use_vertex = m_vertices[ud->src];
> + if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight)
> + {
> + slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size };
> + cost.split (def_vertex.out_degree);
> + return cost;
> + }
> + return { use_vertex.weight * factor, m_optimize_size };
> + }
> +
> + /* UD represents a use-def link between FROM_NODE_I and a node in a later
> + partition; FROM_NODE_I could be the definition node or the use node.
> + The node at the other end of the link wants to use layout TO_LAYOUT_I;
> + return the cost of any necessary fix-ups on edge UD.
> +
> + At this point, FROM_NODE_I's partition has chosen the cheapest
> + layout based on the information available so far, but this choice
> + is only provisional. */
> +
> + slpg_layout_cost
> + vect_optimize_slp_pass::forward_cost (graph_edge *ud,
> + unsigned int from_node_i,
> + unsigned int to_layout_i)
> + {
> + slpg_vertex &from_vertex = m_vertices[from_node_i];
> + unsigned int from_partition_i = from_vertex.partition;
> + slpg_partition_info &from_partition = m_partitions[from_partition_i];
> + gcc_assert (from_partition.layout >= 0);
> +
> + /* First calculate the cost on the assumption that FROM_PARTITION sticks
> + with its current layout preference. */
> + slpg_layout_cost cost
> + = partition_layout_costs (from_partition_i,
> + from_partition.layout).in_cost;
> + cost.split (from_partition.out_degree);
> + cost.add_serial_cost (edge_layout_cost (ud, from_node_i,
> + from_partition.layout, to_layout_i));
> +
> + /* Take the minimum of that cost and the cost that applies if
> + FROM_PARTITION instead switches to TO_LAYOUT_I. */
> + auto &direct_layout_costs = partition_layout_costs (from_partition_i,
> + to_layout_i);
> + if (direct_layout_costs.is_possible ())
> + {
> + slpg_layout_cost direct_cost = direct_layout_costs.in_cost;
> + direct_cost.split (from_partition.out_degree);
> + if (direct_cost.is_better_than (cost, m_optimize_size))
> + cost = direct_cost;
> + }
> +
> + /* If FROM_PARTITION is likely to be able to absorb the required outgoing
> + permute, optimistically assume that that is what will happen. */
> + if (partition_can_absorb_permutes (from_partition_i))
> + {
> + int factor = node_absorption_cost (from_vertex.node, to_layout_i);
> + if (factor >= 0)
> + {
> + sreal weight = from_vertex.weight * factor;
> + slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + if (absorb_cost.is_better_than (cost, m_optimize_size))
> + cost = absorb_cost;
> + }
> + }
> +
> + return cost;
> + }
> +
> + /* UD represents a use-def link between TO_NODE_I and a node in an earlier
> + partition; TO_NODE_I could be the definition node or the use node.
> + The node at the other end of the link wants to use layout FROM_LAYOUT_I;
> + return the cost of any necessary fix-ups on edge UD.
> +
> + At this point, TO_NODE_I's partition has a fixed choice of layout. */
> +
> + slpg_layout_cost
> + vect_optimize_slp_pass::backward_cost (graph_edge *ud,
> + unsigned int to_node_i,
> + unsigned int from_layout_i)
> + {
> + slpg_vertex &to_vertex = m_vertices[to_node_i];
> + unsigned int to_partition_i = to_vertex.partition;
> + slpg_partition_info &to_partition = m_partitions[to_partition_i];
> + gcc_assert (to_partition.layout >= 0);
> +
> + slpg_layout_cost cost
> + = partition_layout_costs (to_partition_i,
> + to_partition.layout).out_cost;
> + cost.split (to_partition.in_degree);
> + cost.add_serial_cost (edge_layout_cost (ud, to_node_i, to_partition.layout,
> + from_layout_i));
> +
> + /* If TO_PARTITION is likely to be able to absorb the required input permute,
> + optimistically assume that that is what will happen. */
> + if (partition_can_absorb_permutes (to_partition_i))
> + {
> + int factor = node_absorption_cost (to_vertex.node, from_layout_i);
> + if (factor >= 0)
> + {
> + sreal weight = to_vertex.weight * factor;
> + slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + if (absorb_cost.is_better_than (cost, m_optimize_size))
> + cost = absorb_cost;
> + }
> + }
> +
> + return cost;
> + }
>
> ! /* Make a forward pass through the partitions, accumulating input costs.
> ! Make a tentative (provisional) choice of layout for each partition. */
> !
> ! void
> ! vect_optimize_slp_pass::forward_pass ()
> ! {
> ! for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> ! ++partition_i)
> ! {
> ! auto &partition = m_partitions[partition_i];
> ! unsigned int min_layout_i = 0;
> ! slpg_layout_cost *min_layout_cost = nullptr;
> ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! {
> ! auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! if (!layout_costs.is_possible ())
> ! continue;
> !
> ! /* Accumulate the costs from predecessor partitions. */
> ! for (unsigned int order_i = partition.node_begin;
> ! order_i < partition.node_end; ++order_i)
> ! {
> ! unsigned int node_i = m_partitioned_nodes[order_i];
> ! auto add_cost = [&](graph_edge *ud, unsigned int from_node_i,
> ! unsigned int from_partition_i)
> {
> ! if (from_partition_i < partition_i)
> {
> ! auto cost = forward_cost (ud, from_node_i, layout_i);
> ! layout_costs.in_cost.add_parallel_cost (cost);
> }
> ! };
> ! for_each_partition_edge (node_i, add_cost);
> ! }
>
> ! /* Record the layout with the lowest cost. Prefer layout 0 in
> ! the event of a tie between it and another layout. */
> ! if (!min_layout_cost
> ! || layout_costs.in_cost.is_better_than (*min_layout_cost,
> ! m_optimize_size))
> ! {
> ! min_layout_i = layout_i;
> ! min_layout_cost = &layout_costs.in_cost;
> }
> + }
> + gcc_assert (min_layout_cost);
> + partition.layout = min_layout_i;
> + }
> + }
>
> ! /* Make a backward pass through the partitions, accumulating output costs.
> ! Make a final choice of layout for each partition. */
>
> ! void
> ! vect_optimize_slp_pass::backward_pass ()
> ! {
> ! for (unsigned int partition_i = m_partitions.length (); partition_i-- > 0;)
> ! {
> ! auto &partition = m_partitions[partition_i];
> ! unsigned int min_layout_i = 0;
> ! slpg_layout_cost *min_layout_cost = nullptr;
> ! unsigned int seen_min_layout_i = 0;
> ! slpg_layout_cost *seen_min_layout_cost = nullptr;
> ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! {
> ! auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! if (!layout_costs.is_possible ())
> continue;
>
> ! /* Accumulate the costs from successor partitions. Record whether
> ! any of them actually uses this layout. */
> ! bool seen = false;
> ! for (unsigned int order_i = partition.node_begin;
> ! order_i < partition.node_end; ++order_i)
> ! {
> ! unsigned int node_i = m_partitioned_nodes[order_i];
> ! auto add_cost = [&](graph_edge *ud, unsigned int to_node_i,
> ! unsigned int to_partition_i)
> ! {
> ! if (to_partition_i > partition_i)
> ! {
> ! auto cost = backward_cost (ud, to_node_i, layout_i);
> ! layout_costs.out_cost.add_parallel_cost (cost);
> ! seen |= (m_partitions[to_partition_i].layout
> ! == int (layout_i));
> ! }
> ! };
> ! for_each_partition_edge (node_i, add_cost);
> }
>
> ! /* Locally combine the costs from the forward and backward passes.
> ! (This combined cost is not passed on, since that would lead
> ! to double counting.) */
> ! layout_costs.cost = layout_costs.in_cost;
> ! layout_costs.cost.add_serial_cost (layout_costs.out_cost);
> !
> ! /* Record the layout with the lowest cost. Prefer layout 0 in
> ! the event of a tie between it and another layout. */
> ! if (!min_layout_cost
> ! || layout_costs.cost.is_better_than (*min_layout_cost,
> ! m_optimize_size))
> ! {
> ! min_layout_i = layout_i;
> ! min_layout_cost = &layout_costs.cost;
> ! }
> ! /* Likewise, but restrict the choice of layouts to those that
> ! are used by successor partitions. */
> ! if (seen
> ! && (!seen_min_layout_cost
> ! || layout_costs.cost.is_better_than (*seen_min_layout_cost,
> ! m_optimize_size)))
> ! {
> ! seen_min_layout_i = min_layout_i;
> ! seen_min_layout_cost = &layout_costs.cost;
> ! }
> ! }
> ! gcc_assert (min_layout_cost);
> !
> ! /* If the partition can absorb permutes, prefer a layout that has
> ! actually been used. */
> ! if (seen_min_layout_cost
> ! && partition_can_absorb_permutes (partition_i))
> ! partition.layout = seen_min_layout_i;
> ! else
> ! partition.layout = min_layout_i;
> ! }
> ! }
> !
> ! /* Return a node that applies layout TO_LAYOUT_I to the original form of NODE.
> ! NODE already has the layout that was selected for its partition. */
> !
> ! slp_tree
> ! vect_optimize_slp_pass::get_result_with_layout (slp_tree node,
> ! unsigned int to_layout_i)
> ! {
> ! unsigned int result_i = node->vertex * m_perms.length () + to_layout_i;
> ! slp_tree result = m_node_layouts[result_i];
> ! if (result)
> ! return result;
> !
> ! if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
> ! || SLP_TREE_DEF_TYPE (node) == vect_external_def)
> ! {
> ! /* If the vector is uniform or unchanged, there's nothing to do. */
> ! if (to_layout_i == 0 || vect_slp_tree_uniform_p (node))
> ! result = node;
> ! else
> {
> ! auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy ();
> ! result = vect_create_new_slp_node (scalar_ops);
> ! vect_slp_permute (m_perms[to_layout_i], scalar_ops, true);
> }
> }
> ! else
> ! {
> ! unsigned int partition_i = m_vertices[node->vertex].partition;
> ! unsigned int from_layout_i = m_partitions[partition_i].layout;
> ! if (from_layout_i == to_layout_i)
> ! return node;
> !
> ! /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used
> ! twice, with two different layouts, then we could create
> ! parallel VEC_PERM_EXPRs rather than add a serial one. */
> !
> ! /* FIXME: When calculating the costs, we considered the possiblity
> ! of duplicating a VEC_PERM_EXPR at each use site, rather than
> ! having one at the definition site. Do that here if it turns
> ! out to be cheaper. */
> !
> ! if (dump_enabled_p ())
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! "inserting permute node in place of %p\n",
> ! node);
> !
> ! unsigned int num_lanes = SLP_TREE_LANES (node);
> ! result = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> ! if (SLP_TREE_SCALAR_STMTS (node).length ())
> ! {
> ! auto &stmts = SLP_TREE_SCALAR_STMTS (result);
> ! stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result));
> ! if (from_layout_i != 0)
> ! vect_slp_permute (m_perms[from_layout_i], stmts, false);
> ! if (to_layout_i != 0)
> ! vect_slp_permute (m_perms[to_layout_i], stmts, true);
> ! }
> ! SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node);
> ! SLP_TREE_LANES (result) = num_lanes;
> ! SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node);
> ! result->vertex = -1;
> !
> ! auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result);
> ! lane_perm.create (num_lanes);
> ! for (unsigned j = 0; j < num_lanes; ++j)
> ! lane_perm.quick_push ({ 0, j });
> ! if (from_layout_i != 0)
> ! vect_slp_permute (m_perms[from_layout_i], lane_perm, false);
> ! if (to_layout_i != 0)
> ! vect_slp_permute (m_perms[to_layout_i], lane_perm, true);
> !
> ! node->refcnt++;
> ! SLP_TREE_CHILDREN (result).safe_push (node);
> ! }
> ! m_node_layouts[result_i] = result;
> ! return result;
> ! }
> !
> ! /* Print the partition graph and layout information to the dump file. */
>
> ! void
> ! vect_optimize_slp_pass::dump ()
> ! {
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! "SLP optimize permutes:\n");
> ! for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
> {
> ! dump_printf_loc (MSG_NOTE, vect_location, " %d: { ", layout_i);
> ! const char *sep = "";
> ! for (unsigned int idx : m_perms[layout_i])
> ! {
> ! dump_printf (MSG_NOTE, "%s%d", sep, idx);
> ! sep = ", ";
> ! }
> ! dump_printf (MSG_NOTE, " }\n");
> ! }
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! "SLP optimize partitions:\n");
> ! for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> ! ++partition_i)
> ! {
> ! auto &partition = m_partitions[partition_i];
> ! dump_printf_loc (MSG_NOTE, vect_location, " -------------\n");
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " partition %d (layout %d):\n",
> ! partition_i, partition.layout);
> ! dump_printf_loc (MSG_NOTE, vect_location, " nodes:\n");
> ! for (unsigned int order_i = partition.node_begin;
> ! order_i < partition.node_end; ++order_i)
> ! {
> ! unsigned int node_i = m_partitioned_nodes[order_i];
> ! slp_tree node = m_vertices[node_i].node;
> ! dump_printf_loc (MSG_NOTE, vect_location, " - %p:\n",
> ! (void *) m_vertices[node_i].node);
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " weight: %f\n",
> ! m_vertices[node_i].weight.to_double ());
> ! if (m_vertices[node_i].out_degree)
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " out weight: %f (degree %d)\n",
> ! m_vertices[node_i].out_weight.to_double (),
> ! m_vertices[node_i].out_degree);
> ! if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " op: VEC_PERM_EXPR\n");
> ! else if (auto rep = SLP_TREE_REPRESENTATIVE (node))
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " op template: %G", rep->stmt);
> ! }
> ! dump_printf_loc (MSG_NOTE, vect_location, " edges:\n");
> ! for (unsigned int order_i = partition.node_begin;
> ! order_i < partition.node_end; ++order_i)
> ! {
> ! unsigned int node_i = m_partitioned_nodes[order_i];
> ! auto print_edge = [&](graph_edge *, unsigned int node2_i,
> ! unsigned int partition2_i)
> ! {
> ! if (partition2_i < partition_i)
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " - %p [%d] --> %p\n",
> ! m_vertices[node2_i].node, partition2_i,
> ! m_vertices[node_i].node);
> ! else
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " - %p --> [%d] %p\n",
> ! m_vertices[node_i].node, partition2_i,
> ! m_vertices[node2_i].node);
> ! };
> ! for_each_partition_edge (node_i, print_edge);
> ! }
>
> ! for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! {
> ! auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! if (layout_costs.is_possible ())
> ! {
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " layout %d:%s\n", layout_i,
> ! partition.layout == int (layout_i)
> ! ? " (*)" : "");
> ! #define TEMPLATE "{depth: %f, total: %f}"
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " " TEMPLATE "\n", layout_i,
> ! layout_costs.in_cost.depth.to_double (),
> ! layout_costs.in_cost.total.to_double ());
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " + " TEMPLATE "\n",
> ! layout_costs.out_cost.depth.to_double (),
> ! layout_costs.out_cost.total.to_double ());
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " = " TEMPLATE "\n",
> ! layout_costs.cost.depth.to_double (),
> ! layout_costs.cost.total.to_double ());
> ! #undef TEMPLATE
> ! }
> ! else
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! " layout %d: rejected\n", layout_i);
> ! }
> ! }
> ! }
>
> ! /* Apply the chosen vector layouts to the SLP graph. */
>
> ! void
> ! vect_optimize_slp_pass::materialize ()
> ! {
> ! m_partition_layout_costs.release ();
> ! m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length ());
>
> + for (unsigned int node_i : m_partitioned_nodes)
> + {
> + /* Rearrange the scalar statements to match the chosen layout. */
> + slp_tree node = m_vertices[node_i].node;
> + int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> + gcc_assert (layout_i >= 0);
> + if (layout_i > 0)
> + vect_slp_permute (m_perms[layout_i],
> + SLP_TREE_SCALAR_STMTS (node), true);
> +
> + /* Update load and lane permutes. */
> if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> {
> /* Apply the common permutes to the input vectors. */
> ! /* FIXME: check whether the target accepts this folding and
> ! handle it like a normal node if not. That would include
> ! adjusting the follow-up loop below. */
> ! unsigned int opno;
> ! slp_tree child;
> ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child)
> ! {
> ! unsigned int partition2_i = m_vertices[child->vertex].partition;
> ! int layout2_i = m_partitions[partition2_i].layout;
> ! if (layout2_i == 0)
> ! continue;
> !
> /* If the node is already a permute node we can apply
> the permutation to the lane selection, effectively
> materializing it on the incoming vectors. */
> ***************
> *** 3987,4141 ****
> dump_printf_loc (MSG_NOTE, vect_location,
> "simplifying permute node %p\n",
> node);
> ! for (unsigned k = 0;
> ! k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k)
> ! SLP_TREE_LANE_PERMUTATION (node)[k].second
> ! = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second];
> ! }
> ! /* Apply the anticipated output permute to the permute and
> ! stmt vectors. */
> ! int perm_out = vertices[i].perm_out;
> ! if (perm_out > 0)
> ! {
> ! vect_slp_permute (perms[perm_out],
> ! SLP_TREE_SCALAR_STMTS (node), true);
> ! vect_slp_permute (perms[perm_out],
> ! SLP_TREE_LANE_PERMUTATION (node), true);
> ! }
> ! }
> ! else if (vertices[i].get_perm_materialized () != 0)
> ! {
> ! if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! /* For loads simply drop the permutation, the load permutation
> ! already performs the desired permutation. */
> ! ;
> ! else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! gcc_unreachable ();
> ! else
> ! {
> ! if (dump_enabled_p ())
> ! dump_printf_loc (MSG_NOTE, vect_location,
> ! "inserting permute node in place of %p\n",
> ! node);
> !
> ! /* Make a copy of NODE and in-place change it to a
> ! VEC_PERM node to permute the lanes of the copy. */
> ! slp_tree copy = new _slp_tree;
> ! SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node);
> ! SLP_TREE_CHILDREN (node) = vNULL;
> ! SLP_TREE_SCALAR_STMTS (copy)
> ! = SLP_TREE_SCALAR_STMTS (node).copy ();
> ! vect_slp_permute (perms[perm_in],
> ! SLP_TREE_SCALAR_STMTS (copy), true);
> ! gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ());
> ! SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node);
> ! gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ());
> ! SLP_TREE_LANE_PERMUTATION (copy)
> ! = SLP_TREE_LANE_PERMUTATION (node);
> ! SLP_TREE_LANE_PERMUTATION (node) = vNULL;
> ! SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node);
> ! copy->refcnt = 1;
> ! copy->max_nunits = node->max_nunits;
> ! SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node);
> ! SLP_TREE_LANES (copy) = SLP_TREE_LANES (node);
> ! SLP_TREE_CODE (copy) = SLP_TREE_CODE (node);
> !
> ! /* Now turn NODE into a VEC_PERM. */
> ! SLP_TREE_CHILDREN (node).safe_push (copy);
> ! SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node));
> ! for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> ! SLP_TREE_LANE_PERMUTATION (node)
> ! .quick_push (std::make_pair (0, perms[perm_in][j]));
> ! SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> ! }
> ! }
> ! else if (perm_in > 0) /* perm_in == perm_out */
> ! {
> ! /* Apply the reverse permutation to our stmts. */
> ! vect_slp_permute (perms[perm_in],
> ! SLP_TREE_SCALAR_STMTS (node), true);
> ! /* And to the lane/load permutation, which we can simply
> ! make regular by design. */
> ! if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! {
> ! gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! /* ??? When we handle non-bijective permutes the idea
> ! is that we can force the load-permutation to be
> ! { min, min + 1, min + 2, ... max }. But then the
> ! scalar defs might no longer match the lane content
> ! which means wrong-code with live lane vectorization.
> ! So we possibly have to have NULL entries for those. */
> ! vect_slp_permute (perms[perm_in],
> ! SLP_TREE_LOAD_PERMUTATION (node), true);
> ! }
> ! else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! gcc_unreachable ();
> }
> }
>
> ! /* Elide any permutations at BB reduction roots. */
> ! if (is_a <bb_vec_info> (vinfo))
> {
> ! for (slp_instance instance : vinfo->slp_instances)
> {
> ! if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc)
> continue;
> ! slp_tree old = SLP_INSTANCE_TREE (instance);
> ! if (SLP_TREE_CODE (old) == VEC_PERM_EXPR
> ! && SLP_TREE_CHILDREN (old).length () == 1)
> {
> ! slp_tree child = SLP_TREE_CHILDREN (old)[0];
> ! if (SLP_TREE_DEF_TYPE (child) == vect_external_def)
> ! {
> ! /* Preserve the special VEC_PERM we use to shield existing
> ! vector defs from the rest. But make it a no-op. */
> ! unsigned i = 0;
> ! for (std::pair<unsigned, unsigned> &p
> ! : SLP_TREE_LANE_PERMUTATION (old))
> ! p.second = i++;
> ! }
> ! else
> ! {
> ! SLP_INSTANCE_TREE (instance) = child;
> ! SLP_TREE_REF_COUNT (child)++;
> ! vect_free_slp_tree (old);
> ! }
> ! }
> ! else if (SLP_TREE_LOAD_PERMUTATION (old).exists ()
> ! && SLP_TREE_REF_COUNT (old) == 1
> ! && vertices[old->vertex].get_perm_materialized () != 0)
> ! {
> ! /* ??? For loads the situation is more complex since
> ! we can't modify the permute in place in case the
> ! node is used multiple times. In fact for loads this
> ! should be somehow handled in the propagation engine. */
> ! /* Apply the reverse permutation to our stmts. */
> ! int perm = vertices[old->vertex].get_perm_materialized ();
> ! vect_slp_permute (perms[perm],
> ! SLP_TREE_SCALAR_STMTS (old), true);
> ! vect_slp_permute (perms[perm],
> ! SLP_TREE_LOAD_PERMUTATION (old), true);
> }
> }
> }
>
> ! /* Free the perms vector used for propagation. */
> ! while (!perms.is_empty ())
> ! perms.pop ().release ();
> ! free_graph (slpg);
> !
>
> ! /* Now elide load permutations that are not necessary. */
> ! for (i = 0; i < leafs.length (); ++i)
> {
> ! node = vertices[leafs[i]].node;
> if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> continue;
>
> /* In basic block vectorization we allow any subchain of an interleaving
> chain.
> FORNOW: not in loop SLP because of realignment complications. */
> ! if (is_a <bb_vec_info> (vinfo))
> {
> bool subchain_p = true;
> stmt_vec_info next_load_info = NULL;
> --- 5094,5173 ----
> dump_printf_loc (MSG_NOTE, vect_location,
> "simplifying permute node %p\n",
> node);
> ! for (auto &entry : SLP_TREE_LANE_PERMUTATION (node))
> ! if (entry.first == opno)
> ! entry.second = m_perms[layout2_i][entry.second];
> ! }
> ! /* Reverse the permutation that we chose to reverse. */
> ! if (layout_i > 0)
> ! vect_slp_permute (m_perms[layout_i],
> ! SLP_TREE_LANE_PERMUTATION (node), true);
> ! }
> ! else
> ! {
> ! gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node);
> ! if (layout_i > 0)
> ! /* ??? When we handle non-bijective permutes the idea
> ! is that we can force the load-permutation to be
> ! { min, min + 1, min + 2, ... max }. But then the
> ! scalar defs might no longer match the lane content
> ! which means wrong-code with live lane vectorization.
> ! So we possibly have to have NULL entries for those. */
> ! vect_slp_permute (m_perms[layout_i], load_perm, true);
> }
> }
>
> ! /* Do this before any nodes disappear, since it involves a walk
> ! over the leaves. */
> ! remove_redundant_permutes ();
> !
> ! /* Replace each child with a correctly laid-out version. */
> ! for (unsigned int node_i : m_partitioned_nodes)
> {
> ! slp_tree node = m_vertices[node_i].node;
> !
> ! /* The loop above adjusted permutes for each child's chosen layout,
> ! so we don't need to adjust them here. */
> ! if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! continue;
> !
> ! int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> ! gcc_assert (layout_i >= 0);
> ! unsigned j;
> ! slp_tree child;
> ! FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
> {
> ! if (!child)
> continue;
> !
> ! slp_tree new_child = get_result_with_layout (child, layout_i);
> ! if (new_child != child)
> {
> ! vect_free_slp_tree (child);
> ! SLP_TREE_CHILDREN (node)[j] = new_child;
> ! new_child->refcnt += 1;
> }
> }
> }
> + }
>
> ! /* Elide load permutations that are not necessary. Such permutations might
> ! be pre-existing, rather than created by the layout optimizations. */
>
> ! void
> ! vect_optimize_slp_pass::remove_redundant_permutes ()
> ! {
> ! for (unsigned int node_i : m_leafs)
> {
> ! slp_tree node = m_vertices[node_i].node;
> if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> continue;
>
> /* In basic block vectorization we allow any subchain of an interleaving
> chain.
> FORNOW: not in loop SLP because of realignment complications. */
> ! if (is_a <bb_vec_info> (m_vinfo))
> {
> bool subchain_p = true;
> stmt_vec_info next_load_info = NULL;
> ***************
> *** 4160,4165 ****
> --- 5192,5198 ----
> }
> else
> {
> + loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> stmt_vec_info load_info;
> bool this_load_permuted = false;
> unsigned j;
> ***************
> *** 4175,4182 ****
> /* The load requires permutation when unrolling exposes
> a gap either because the group is larger than the SLP
> group-size or because there is a gap between the groups. */
> ! && (known_eq (LOOP_VINFO_VECT_FACTOR
> ! (as_a <loop_vec_info> (vinfo)), 1U)
> || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> && DR_GROUP_GAP (first_stmt_info) == 0)))
> {
> --- 5208,5214 ----
> /* The load requires permutation when unrolling exposes
> a gap either because the group is larger than the SLP
> group-size or because there is a gap between the groups. */
> ! && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> && DR_GROUP_GAP (first_stmt_info) == 0)))
> {
> ***************
> *** 4187,4192 ****
> --- 5219,5258 ----
> }
> }
>
> + /* Main entry point for the optimization pass. */
> +
> + void
> + vect_optimize_slp_pass::run ()
> + {
> + build_graph ();
> + create_partitions ();
> + start_choosing_layouts ();
> + if (m_perms.length () > 1)
> + {
> + forward_pass ();
> + backward_pass ();
> + if (dump_enabled_p ())
> + dump ();
> + materialize ();
> + }
> + else
> + remove_redundant_permutes ();
> + /* Free the perms vector used for propagation. */
> + while (!m_perms.is_empty ())
> + m_perms.pop ().release ();
> + free_graph (m_slpg);
> + }
> +
> + /* Optimize the SLP graph of VINFO. */
> +
> + void
> + vect_optimize_slp (vec_info *vinfo)
> + {
> + if (vinfo->slp_instances.is_empty ())
> + return;
> + vect_optimize_slp_pass (vinfo).run ();
> + }
> +
> /* Gather loads reachable from the individual SLP graph entries. */
>
> void
> ***************
> *** 6647,6653 ****
> {
> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> int vec_index = 0;
> ! tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> unsigned int mask_element;
> machine_mode mode;
> --- 7713,7719 ----
> {
> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> int vec_index = 0;
> ! tree vectype = SLP_TREE_VECTYPE (node);
> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> unsigned int mask_element;
> machine_mode mode;
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
next prev parent reply other threads:[~2022-08-03 11:23 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-02 7:20 Richard Sandiford
2022-08-03 11:23 ` Richard Biener [this message]
2022-08-04 9:06 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr \
--to=rguenther@suse.de \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).