public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] RFC: Extend SLP permutation optimisations
Date: Wed, 3 Aug 2022 11:23:37 +0000 (UTC)	[thread overview]
Message-ID: <nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr> (raw)
In-Reply-To: <mpt7d3rru04.fsf@arm.com>

On Tue, 2 Aug 2022, Richard Sandiford wrote:

> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely.  This patch tries to extend it as follows:
> 
> - Allow loads to take a different permutation from the one they
>   started with, rather than choosing between "original permutation"
>   and "no permutation".
> 
> - Allow changes in both directions, if the target supports the
>   reverse permute operation.
> 
> - Treat the placement of permute operations as a two-way dataflow
>   problem: after propagating information from leaves to roots (as now),
>   propagate information back up the graph.
> 
> - Take execution frequency into account when optimising for speed,
>   so that (for example) permutes inside loops have a higher cost
>   than permutes outside loops.
> 
> - Try to reduce the total number of permutes when optimising for
>   size, even if that increases the number of permutes on a given
>   execution path.
> 
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
> 
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads.  This patch doesn't include that change
> (although it's a possible future addition).
> 
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
> 
> - Make it easier to represent predicated operations, including
>   predicated operations with gaps.  E.g.:
> 
>      a[0] += 1;
>      a[1] += 1;
>      a[3] += 1;
> 
>   could be a single load/add/store for SVE.  We could handle this
>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>   (depending on what's being counted).  We might need to move
>   elements between lanes at various points, like with permutes.
> 
>   (This would first mean adding support for stores with gaps.)
> 
> - Make it easier to switch between an even/odd and unpermuted layout
>   when switching between wide and narrow elements.  E.g. if a widening
>   operation produces an even vector and an odd vector, we should try
>   to keep operations on the wide elements in that order rather than
>   force them to be permuted back "in order".
> 
> To give some examples of what the current patch does:
> 
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[1] << c[3]) - d[1];
>   a[1] = (b[0] << c[2]) - d[0];
>   a[2] = (b[3] << c[1]) - d[3];
>   a[3] = (b[2] << c[0]) - d[2];
> }
> 
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time.  But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
> 
>         ldr     q1, [x2]
>         ldr     q0, [x1]
>         ext     v1.16b, v1.16b, v1.16b, #8     // <------
>         sshl    v0.4s, v0.4s, v1.4s
>         ldr     q1, [x3]
>         sub     v0.4s, v0.4s, v1.4s
>         rev64   v0.4s, v0.4s                   // <------
>         str     q0, [x0]
>         ret
> 
> The following function:
> 
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[3] << c[3]) - d[3];
>   a[1] = (b[2] << c[2]) - d[2];
>   a[2] = (b[1] << c[1]) - d[1];
>   a[3] = (b[0] << c[0]) - d[0];
> }
> 
> continues to push the reverse down to just before the store,
> like the current code does.
> 
> In:
> 
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       a[0] = (a[0] + c[3]);
>       a[1] = (a[1] + c[2]);
>       a[2] = (a[2] + c[1]);
>       a[3] = (a[3] + c[0]);
>       c += 4;
>     }
> }
> 
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop.  When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
> 
>         mov     x3, x0
>         adrp    x0, .LC0
>         add     x1, x2, 1600
>         ldr     q2, [x0, #:lo12:.LC0]
>         ldr     q0, [x3]
>         mov     v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         .p2align 3,,7
> .L6:
>         ldr     q1, [x2], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x2, x1
>         bne     .L6
>         mov     v1.16b, v0.16b
>         adrp    x0, .LC0
>         ldr     q2, [x0, #:lo12:.LC0]
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         str     q0, [x3]
>         ret
> 
> Similarly, for the very artificial testcase:
> 
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   int a0 = a[0];
>   int a1 = a[1];
>   int a2 = a[2];
>   int a3 = a[3];
>   for (int i = 0; i < 100; ++i)
>     {
>       a0 ^= c[0];
>       a1 ^= c[1];
>       a2 ^= c[2];
>       a3 ^= c[3];
>       c += 4;
>       for (int j = 0; j < 100; ++j)
> 	{
> 	  a0 += d[1];
> 	  a1 += d[0];
> 	  a2 += d[3];
> 	  a3 += d[2];
> 	  d += 4;
> 	}
>       b[0] = a0;
>       b[1] = a1;
>       b[2] = a2;
>       b[3] = a3;
>       b += 4;
>     }
>   a[0] = a0;
>   a[1] = a1;
>   a[2] = a2;
>   a[3] = a3;
> }
> 
> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
> even though it's part of an SCC that includes the outer loop.
> In other words, this is a motivating case for not assigning
> permutes at SCC granularity.  The code we get is:
> 
>         ldr     q0, [x0]
>         mov     x4, x1
>         mov     x5, x0
>         add     x1, x3, 1600
>         add     x3, x4, 1600
>         .p2align 3,,7
> .L11:
>         ldr     q1, [x2], 16
>         sub     x0, x1, #1600
>         eor     v0.16b, v1.16b, v0.16b
>         rev64   v0.4s, v0.4s              // <---
>         .p2align 3,,7
> .L10:
>         ldr     q1, [x0], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x0, x1
>         bne     .L10
>         rev64   v0.4s, v0.4s              // <---
>         add     x1, x0, 1600
>         str     q0, [x4], 16
>         cmp     x3, x4
>         bne     .L11
>         str     q0, [x5]
>         ret
> 
> The patch is still incomplete, hence the FIXMEs, but it's at the stage
> where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
> regression in slp-11b.c, but I think it's arguably restoring the
> expected load-lanes behaviour.)  Does this look like a plausible
> direction, before I go filling in the FIXME parts?

Yes, it looks nice!  I wonder if you spent much thoughts on
layout optimizing for SLP patterns and/or how to integrate both?

Thanks,
Richard.

> Thanks,
> Richard
> 
> ---
>  gcc/graphds.cc        |   13 +-
>  gcc/graphds.h         |    3 +-
>  gcc/hash-map-traits.h |   74 +-
>  gcc/hash-traits.h     |   96 ++-
>  gcc/tree-vect-slp.cc  | 1870 ++++++++++++++++++++++++++++++++---------
>  5 files changed, 1599 insertions(+), 457 deletions(-)
> 
> *** /tmp/iBbApg_graphds.cc	2022-08-02 08:16:23.975354155 +0100
> --- gcc/graphds.cc	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 281,287 ****
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
> !    Edge E will be skipped if callback function returns true.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> --- 281,294 ----
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
> !    Edge E will be skipped if callback function returns true.  If SCC_GROUPING
> !    is not null, the nodes will be added to it in the following order:
> ! 
> !    - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag,
> !      A's nodes come before B's nodes.
> ! 
> !    - All of an SCC's nodes are listed consecutively, although the order
> !      of the nodes within an SCC is not really meaningful.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> ***************
> *** 289,295 ****
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> ! 	     skip_edge_callback skip_edge_p)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> --- 296,302 ----
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> ! 	     skip_edge_callback skip_edge_p, vec<int> *scc_grouping)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> ***************
> *** 317,323 ****
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> --- 324,330 ----
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> *** /tmp/eRUAEk_graphds.h	2022-08-02 08:16:23.987354123 +0100
> --- gcc/graphds.h	2022-08-01 20:44:18.835700593 +0100
> ***************
> *** 58,64 ****
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>   		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>   				       struct graph_edge *, void *);
> --- 58,65 ----
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>   		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL,
> ! 		 vec<int> * = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>   				       struct graph_edge *, void *);
> *** /tmp/4zWIek_hash-map-traits.h	2022-08-02 08:16:23.995354102 +0100
> --- gcc/hash-map-traits.h	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 105,118 ****
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with values of type Value for cases
> !    in which the key cannot represent empty and deleted slots.  Instead
> !    record empty and deleted entries in Value.  Derived classes must
> !    implement the hash and equal_keys functions.  */
>   
> ! template <typename Value>
>   struct unbounded_hashmap_traits
>   {
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> --- 105,123 ----
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with keys of type Key and values of
> !    type Value for cases in which the key cannot represent empty and
> !    deleted slots.  Instead record empty and deleted entries in Value.  */
>   
> ! template <typename Key, typename Value>
>   struct unbounded_hashmap_traits
>   {
> +   typedef typename Key::value_type key_type;
> + 
> +   static hashval_t hash (const typename Key::value_type &);
> +   static bool equal_keys (const typename Key::value_type &,
> + 			  const typename Key::compare_type &);
> + 
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> ***************
> *** 121,162 ****
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> --- 126,184 ----
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_hashmap_traits <Key, Value>
> ! ::hash (const typename Key::value_type &key)
> ! {
> !   return Key::hash (key);
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_hashmap_traits <Key, Value>
> ! ::equal_keys (const typename Key::value_type &x,
> ! 	      const typename Key::compare_type &y)
> ! {
> !   return Key::equal (x, y);
> ! }
> ! 
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> ***************
> *** 166,190 ****
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits <Value>
> ! {
> !   typedef Key key_type;
> !   static inline hashval_t hash (Key);
> !   static inline bool equal_keys (Key, Key);
> ! };
> ! 
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_int_hashmap_traits <Key, Value>::hash (Key k)
> ! {
> !   return k;
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_int_hashmap_traits <Key, Value>::equal_keys (Key k1, Key k2)
> ! {
> !   return k1 == k2;
> ! }
>   
>   #endif // HASH_MAP_TRAITS_H
> --- 188,194 ----
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! using unbounded_int_hashmap_traits
> !   = unbounded_hashmap_traits <int_hash_base <Key>, Value>;
>   
>   #endif // HASH_MAP_TRAITS_H
> *** /tmp/FL7vGb_hash-traits.h	2022-08-02 08:16:24.011354061 +0100
> --- gcc/hash-traits.h	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 85,125 ****
>   {
>   }
>   
>   
> ! /* Hasher for integer type Type in which Empty is a spare value that can be
> !    used to mark empty slots.  If Deleted != Empty then Deleted is another
> !    spare value that can be used for deleted slots; if Deleted == Empty then
> !    hash table entries cannot be deleted.  */
> ! 
> ! template <typename Type, Type Empty, Type Deleted = Empty>
> ! struct int_hash : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
> -   static inline void mark_deleted (Type &);
> -   static const bool empty_zero_p = Empty == 0;
> -   static inline void mark_empty (Type &);
> -   static inline bool is_deleted (Type);
> -   static inline bool is_empty (Type);
>   };
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline hashval_t
> ! int_hash <Type, Empty, Deleted>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline bool
> ! int_hash <Type, Empty, Deleted>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> --- 85,135 ----
>   {
>   }
>   
> + /* Base traits for integer type Type, providing just the hash and
> +    comparison functionality.  */
>   
> ! template <typename Type>
> ! struct int_hash_base : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
>   };
>   
> ! template <typename Type>
>   inline hashval_t
> ! int_hash_base <Type>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type>
>   inline bool
> ! int_hash_base <Type>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
> + /* Hasher for integer type Type in which Empty is a spare value that can be
> +    used to mark empty slots.  If Deleted != Empty then Deleted is another
> +    spare value that can be used for deleted slots; if Deleted == Empty then
> +    hash table entries cannot be deleted.  */
> + 
> + template <typename Type, Type Empty, Type Deleted = Empty>
> + struct int_hash : int_hash_base <Type>
> + {
> +   typedef Type value_type;
> +   typedef Type compare_type;
> + 
> +   static inline void mark_deleted (Type &);
> +   static const bool empty_zero_p = Empty == 0;
> +   static inline void mark_empty (Type &);
> +   static inline bool is_deleted (Type);
> +   static inline bool is_empty (Type);
> + };
> + 
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> ***************
> *** 398,403 ****
> --- 408,467 ----
>     return T1::is_empty (x.first);
>   }
>   
> + /* Base traits for vectors, providing just the hash and comparison
> +    functionality.  Type gives the traits for the element type.  */
> + 
> + template <typename Type>
> + struct vec_hash_base
> + {
> +   typedef vec<typename Type::value_type> value_type;
> +   typedef vec<typename Type::compare_type> compare_type;
> + 
> +   static inline hashval_t hash (value_type);
> +   static inline bool equal (value_type, compare_type);
> + };
> + 
> + template <typename Type>
> + inline hashval_t
> + vec_hash_base <Type>::hash (value_type x)
> + {
> +   inchash::hash hstate;
> +   hstate.add_int (x.length ());
> +   for (auto &value : x)
> +     hstate.merge_hash (Type::hash (value));
> +   return hstate.end ();
> + }
> + 
> + template <typename Type>
> + inline bool
> + vec_hash_base <Type>::equal (value_type x, compare_type y)
> + {
> +   if (x.length () != y.length ())
> +     return false;
> +   for (unsigned int i = 0; i < x.length (); ++i)
> +     if (!Type::equal (x[i], y[i]))
> +       return false;
> +   return true;
> + }
> + 
> + /* Traits for vectors whose contents should be freed normally.  */
> + 
> + template <typename Type>
> + struct vec_free_hash_base : vec_hash_base <Type>
> + {
> +   static void remove (typename vec_hash_base <Type>::value_type &);
> + };
> + 
> + template <typename Type>
> + void
> + vec_free_hash_base <Type>
> + ::remove (typename vec_hash_base <Type>::value_type &x)
> + {
> +   for (auto &value : x)
> +     Type::remove (x);
> +   x.release ();
> + }
> + 
>   template <typename T> struct default_hash_traits : T {};
>   
>   template <typename T>
> *** /tmp/5ek0ya_tree-vect-slp.cc	2022-08-02 08:16:24.027354017 +0100
> --- gcc/tree-vect-slp.cc	2022-08-01 20:44:22.631690380 +0100
> ***************
> *** 20,25 ****
> --- 20,26 ----
>   <http://www.gnu.org/licenses/>.  */
>   
>   #include "config.h"
> + #define INCLUDE_ALGORITHM
>   #include "system.h"
>   #include "coretypes.h"
>   #include "backend.h"
> ***************
> *** 49,54 ****
> --- 50,57 ----
>   #include "tree-eh.h"
>   #include "tree-cfg.h"
>   #include "alloc-pool.h"
> + #include "sreal.h"
> + #include "predict.h"
>   
>   static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
>   					  slp_tree, stmt_vector_for_cost *);
> ***************
> *** 304,309 ****
> --- 307,322 ----
>     oprnds_info.release ();
>   }
>   
> + /* Return the execution frequency of NODE (so that a higher value indicates
> +    a "more important" node when optimizing for speed).  */
> + 
> + static sreal
> + vect_slp_node_weight (slp_tree node)
> + {
> +   stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> +   basic_block bb = gimple_bb (stmt_info->stmt);
> +   return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
> + }
>   
>   /* Return true if STMTS contains a pattern statement.  */
>   
> ***************
> *** 3531,3559 ****
>     return opt_result::success ();
>   }
>   
> ! struct slpg_vertex
>   {
> !   slpg_vertex (slp_tree node_)
> !     : node (node_), perm_in (-1), perm_out (-1) {}
>   
> !   int get_perm_materialized () const
> !     { return perm_in != perm_out ? perm_in : 0; }
>   
>     slp_tree node;
> !   /* The common permutation on the incoming lanes (towards SLP children).  */
> !   int perm_in;
> !   /* The permutation on the outgoing lanes (towards SLP parents).  When
> !      the node is a materialization point for a permute this differs
> !      from perm_in (and is then usually zero).  Materialization happens
> !      on the input side.  */
> !   int perm_out;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (hash_set<slp_tree> &visited, slp_tree node,
> ! 			 vec<slpg_vertex> &vertices, vec<int> &leafs)
>   {
>     unsigned i;
>     slp_tree child;
> --- 3544,3994 ----
>     return opt_result::success ();
>   }
>   
> ! /* Estimates the cost of inserting layout changes into the SLP graph.  */
> ! 
> ! struct slpg_layout_cost
> ! {
> !   slpg_layout_cost () = default;
> !   slpg_layout_cost (sreal, bool);
> ! 
> !   bool operator== (const slpg_layout_cost &) const;
> !   bool operator!= (const slpg_layout_cost &) const;
> ! 
> !   bool is_better_than (const slpg_layout_cost &, bool) const;
> ! 
> !   void add_parallel_cost (const slpg_layout_cost &);
> !   void add_serial_cost (const slpg_layout_cost &);
> !   void split (unsigned int);
> ! 
> !   /* The longest sequence of layout changes needed during any traversal
> !      of the partition dag, weighted by execution frequency.
> ! 
> !      This is the most important metric when optimizing for speed, since
> !      it helps to ensure that we keep the number of operations on
> !      critical paths to a minimum.  */
> !   sreal depth = 0;
> ! 
> !   /* An estimate of the total number of operations needed.  IT is weighted by
> !      execution frequency when optimizing for speed but not when optimizing for
> !      size.  In order to avoid double-counting, a node with a fanout of N will
> !      distribute 1/N of its total cost to each successor.
> ! 
> !      This is the most important metric when optimizing for size, since
> !      it helps to keep the total number of operations to a minimum,  */
> !   sreal total = 0;
> ! };
> ! 
> ! /* Construct costs for a node with weight WEIGHT.  A higher weight
> !    indicates more frequent execution.  IS_FOR_SIZE is true if we are
> !    optimizing for size rather than speed.  */
> ! 
> ! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size)
> !   : depth (weight), total (is_for_size && weight > 0 ? 1 : weight)
> ! {
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator== (const slpg_layout_cost &other) const
> ! {
> !   return depth == other.depth && total == other.total;
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const
> ! {
> !   return !operator== (other);
> ! }
> ! 
> ! /* Return true if these costs are better than OTHER.  IS_FOR_SIZE is
> !    true if we are optimizing for size rather than speed.  */
> ! 
> ! bool
> ! slpg_layout_cost::is_better_than (const slpg_layout_cost &other,
> ! 				  bool is_for_size) const
> ! {
> !   if (is_for_size)
> !     {
> !       if (total != other.total)
> ! 	return total < other.total;
> !       return depth < other.depth;
> !     }
> !   else
> !     {
> !       if (depth != other.depth)
> ! 	return depth < other.depth;
> !       return total < other.total;
> !     }
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in parallel with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost)
>   {
> !   depth = std::max (depth, input_cost.depth);
> !   total += input_cost.total;
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in series with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other)
> ! {
> !   depth += other.depth;
> !   total += other.total;
> ! }
> ! 
> ! /* Split the total cost among TIMES successors or predecessors.  */
>   
> ! void
> ! slpg_layout_cost::split (unsigned int times)
> ! {
> !   if (times > 1)
> !     total /= times;
> ! }
> ! 
> ! /* Information about one node in the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_vertex
> ! {
> !   slpg_vertex (slp_tree node_) : node (node_) {}
>   
> +   /* The node itself.  */
>     slp_tree node;
> ! 
> !   /* Which partition the node belongs to, or -1 if none.  */
> !   int partition = -1;
> ! 
> !   /* The number of nodes that directly use the result of this one
> !      (i.e. the number of nodes that count this one as a child).  */
> !   unsigned int out_degree = 0;
> ! 
> !   /* The execution frequency of the node.  */
> !   sreal weight = 0;
> ! 
> !   /* The total execution frequency of all nodes that directly use the
> !      result of this one.  */
> !   sreal out_weight = 0;
> ! };
> ! 
> ! /* Information about one partition of the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_partition_info
> ! {
> !   /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END)
> !      of m_partitioned_nodes.  */
> !   unsigned int node_begin = 0;
> !   unsigned int node_end = 0;
> ! 
> !   /* Which layout we've chosen to use for this partition, or -1 if
> !      we haven't picked one yet.  */
> !   int layout = -1;
> ! 
> !   /* The number of predecessors and successors in the partition dag.
> !      The predecessors always have lower partition numbers and the
> !      successors always have higher partition numbers.
> ! 
> !      Note that the directions of these edges are not necessarily the
> !      same as in the data flow graph.  For example, if an SCC contains
> !      an inner and an outer loop, the inner loop's partition will have
> !      at least two incoming edges from the outer loop's partition:
> !      one for a live-in value and one for a live-out value.  In data
> !      flow terms, the live-in edge would also be from the outer loop
> !      to the inner loop, but the live-out edge would be from the inner
> !      loop to the outer loop.  */
> !   unsigned int in_degree = 0;
> !   unsigned int out_degree = 0;
> ! };
> ! 
> ! /* Information about the costs of using a particular layout for a
> !    particular partition.  It can also say that the combination is
> !    invalid.  */
> ! 
> ! struct slpg_partition_layout_costs
> ! {
> !   bool is_possible () const { return cost.total != sreal::min (); }
> !   void mark_impossible () { cost.total = sreal::min (); }
> ! 
> !   /* The costs inherited from predecessor partitions plus the inherent
> !      cost of the layout within the node itself.  */
> !   slpg_layout_cost in_cost;
> ! 
> !   /* The costs inherited from successor partitions.  */
> !   slpg_layout_cost out_cost;
> ! 
> !   /* The sum of the above, when used.  */
> !   slpg_layout_cost cost;
> ! };
> ! 
> ! /* This class tries to optimize the layout of vectors in order to avoid
> !    unnecessary shuffling.  At the moment, the set of possible layouts are
> !    restricted to bijective permutes.
> ! 
> !    The goal of the pass depends on whether we're optimizing for size or
> !    for speed.  When optimizing for size, the goal is to reduce the overall
> !    number of layout changes (including layout changes implied by things
> !    like load permutes).  When optimizing for speed, the goal is to
> !    reduce the maximum latency attributable to layout changes on any
> !    non-cyclic path through the data flow graph.
> ! 
> !    For example, when optimizing a loop nest for speed, we will prefer
> !    to make layout changes outside of a loop rather than inside of a loop,
> !    and will prefer to make layout changes in parallel rather than serially,
> !    even if that increases the overall number of layout changes.
> ! 
> !    The high-level procedure is:
> ! 
> !    (1) Build a graph in which edges go from uses (parents) to definitions
> !        (children).
> ! 
> !    (2) Divide the graph into a dag of strongly-connected components (SCCs).
> ! 
> !    (3) Partition the nodes in each SCC based on their containing cfg loop,
> !        creating a dag of partitions.  The goal is now to assign a layout
> !        to each partition.
> ! 
> !    (4) Construct a set of vector layouts that are worth considering and
> !        calculate the local cost of using each layout in each partition.
> !        Some partitions might not allow a given layout at all.
> ! 
> !    (5) Perform a forward walk over the partition dag (from loads to stores)
> !        accumulating the "forward" cost of using each layout.  When visiting
> !        each partition, assign a tentative choice of layout to the partition
> !        and use that choice when calculating the cost of using a different
> !        layout in successor partitions.
> ! 
> !    (6) Perform a backward walk over the partition dag (from stores to loads),
> !        accumulating the "backward" cost of using each layout.  When visiting
> !        each partition, make a final choice of layout for that partition based
> !        on the accumulated forward costs (from (5)) and backward costs
> !        (from (6)).
> ! 
> !    (7) Apply the chosen layouts to the SLP graph.
> ! 
> !    For example, consider the SLP statements:
> ! 
> !    S1:      a_1 = load
> !        loop:
> !    S2:      a_2 = PHI<a_1, a_3>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        exit:
> !    S5:      a_4 = PHI<a_3>
> !    S6:      store a_4
> ! 
> !    S2 and S4 form an SCC and are part of the same loop.  Every other
> !    statement is in a singleton SCC.  In this example there is a one-to-one
> !    mapping between SCCs and partitions and the partition dag looks like this;
> ! 
> ! 	S1     S3
> ! 	 \     /
> ! 	  S2+S4
> ! 	    |
> ! 	   S5
> ! 	    |
> ! 	   S6
> ! 
> !    S2, S3 and S4 will have a higher execution frequency than the other
> !    statements, so when optimizing for speed, the goal is to avoid any
> !    layout changes:
> ! 
> !    - within S3
> !    - within S2+S4
> !    - on the S3->S2+S4 edge
> ! 
> !    For example, if S3 was originally a reversing load, the goal of the
> !    pass is to make it an unreversed load and change the layout on the
> !    S1->S2+S4 and S2+S4->S5 edges to compensate.  (Changing the layout
> !    on S1->S2+S4 and S5->S6 would also be acceptable.)
> ! 
> !    The difference between SCCs and partitions becomes important if we
> !    add an outer loop:
> ! 
> !    S1:      a_1 = ...
> !        loop1:
> !    S2:      a_2 = PHI<a_1, a_6>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        loop2:
> !    S5:      a_4 = PHI<a_3, a_5>
> !    S6:      c_1 = load
> !    S7:      a_5 = a_4 + c_1
> !        exit2:
> !    S8:      a_6 = PHI<a_5>
> !    S9:      store a_6
> !        exit1:
> ! 
> !    Here, S2, S4, S5, S7 and S8 form a single SCC.  However, we usually
> !    do not want restrictions in the outer loop to "infect" the decision
> !    for the inner loop.  For example, if an outer-loop node in the SCC
> !    contains a statement with a fixed layout, that should not prevent the
> !    inner loop from using a different layout.  Conversely, the inner loop
> !    should not dictate a layout to the outer loop: if the outer loop does
> !    a lot of computation, then it may not be efficient to do all of that
> !    computation in the inner loop's preferred layout.
> ! 
> !    We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner).
> !    We also try to arrange partitions so that:
> ! 
> !    - the partition for an outer loop comes before the partition for
> !      an inner loop
> ! 
> !    - if a sibling loop A dominates a sibling loop B, A's partition
> !      comes before B's
> ! 
> !    So in this case, the partition dag is:
> ! 
> ! 	S1        S3
> ! 	 \        /
> ! 	  S2+S4+S8   S6
> ! 	   |   \\    /
> ! 	   |    S5+S7
> ! 	   |
> ! 	  S9
> ! 
> !    There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
> !    one for a reversal of the edge S7->S8.
> ! 
> !    The backward walk picks a layout for S5+S7 before S2+S4+S8.  The choice
> !    for S2+S4+S8 therefore has to balance the cost of using the outer loop's
> !    preferred layout against the cost of changing the layout on entry to the
> !    inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).
> ! 
> !    ??? Although this works well when optimizing for speed, it has the
> !    downside when optimizing for size that the choice of layout for S5+S7
> !    is completely independent of S9.  Perhaps we should not partition SCCs
> !    when optimizing for size?
> ! 
> !    To give a concrete example of the difference between optimizing
> !    for size and speed, consider:
> ! 
> !    a[0] = (b[1] << c[3]) - d[1];
> !    a[1] = (b[0] << c[2]) - d[0];
> !    a[2] = (b[3] << c[1]) - d[3];
> !    a[3] = (b[2] << c[0]) - d[2];
> ! 
> !    There are three different layouts here: one for a, one for b and d,
> !    and one for c.  When optimizing for speed it is better to permute
> !    each of b, c and d into the order required by a, since those permutes
> !    happen in parallel.  But when optimizing for size, it is better to:
> ! 
> !    - permute c into the same order as b
> !    - do the arithmetic
> !    - permute the result into the order required by a
> ! 
> !    This gives 2 permutes rather than 3.  */
> ! 
> ! class vect_optimize_slp_pass
> ! {
> ! public:
> !   vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {}
> !   void run ();
> ! 
> ! private:
> !   /* Graph building.  */
> !   struct loop *containing_loop (slp_tree);
> !   bool is_cfg_latch_edge (graph_edge *);
> !   void build_vertices (hash_set<slp_tree> &, slp_tree);
> !   void build_vertices ();
> !   void build_graph ();
> ! 
> !   /* Partitioning.  */
> !   void create_partitions ();
> !   template<typename T> void for_each_partition_edge (unsigned int, T);
> ! 
> !   /* Layout selection.  */
> !   bool can_change_layouts (unsigned int, unsigned int);
> !   bool layout_is_reversible (unsigned int);
> !   unsigned int layout_cost (unsigned int, unsigned int);
> !   slpg_partition_layout_costs &partition_layout_costs (unsigned int,
> ! 						       unsigned int);
> !   bool partition_can_absorb_permutes (unsigned int);
> !   int node_absorption_cost (slp_tree, unsigned int);
> !   void start_choosing_layouts ();
> ! 
> !   /* Cost propagation.  */
> !   slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int,
> ! 				     unsigned int, unsigned int);
> !   slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
> !   slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int);
> !   void forward_pass ();
> !   void backward_pass ();
> ! 
> !   /* Rematerialization.  */
> !   slp_tree get_result_with_layout (slp_tree, unsigned int);
> !   void materialize ();
> ! 
> !   /* Clean-up.  */
> !   void remove_redundant_permutes ();
> ! 
> !   void dump ();
> ! 
> !   vec_info *m_vinfo;
> ! 
> !   /* True if we should optimize the graph for size, false if we should
> !      optimize it for speed.  (It wouldn't be easy to make this decision
> !      more locally.)  */
> !   bool m_optimize_size;
> ! 
> !   /* A graph of all SLP nodes, with edges leading from uses to definitions.
> !      In other words, a node's predecessors are its slp_tree parents and
> !      a node's successors are its slp_tree children.  */
> !   graph *m_slpg = nullptr;
> ! 
> !   /* The vertices of M_SLPG, indexed by slp_tree::vertex.  */
> !   auto_vec<slpg_vertex> m_vertices;
> ! 
> !   /* The list of all leaves of M_SLPG. such as external definitions, constants,
> !      and loads.  */
> !   auto_vec<int> m_leafs;
> ! 
> !   /* This array has one entry for every vector layout that we're considering.
> !      Element 0 is null and indicates "no change".  Other entries describe
> !      permutes that are inherent in the current graph and that we would
> !      like to reverse if possible.
> ! 
> !      For example, a permute { 1, 2, 3, 0 } means that something has
> !      effectively been permuted in that way, such as a load group
> !      { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]).
> !      We'd then like to apply the reverse permute { 3, 0, 1, 2 }
> !      in order to put things "back" in order.  */
> !   auto_vec<vec<unsigned> > m_perms;
> ! 
> !   /* A partitioning of the nodes for which a layout must be chosen.
> !      Each partition represents an <SCC, cfg loop> pair; that is,
> !      nodes in different SCCs belong to different partitions, and nodes
> !      within an SCC are further partitioned according to the containing
> !      cfg loop.  Partition <SCC1, L1> comes before <SCC2, L2> if:
> ! 
> !      - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk
> !        from leaves (such as loads) to roots (such as stores).
> ! 
> !      - SCC1 == SCC2 and L1's header strictly dominates L2's header.  */
> !   auto_vec<slpg_partition_info> m_partitions;
> ! 
> !   /* The list of all nodes for which a layout must be chosen.  Nodes for
> !      partition P come before the nodes for partition P+1.  Nodes within a
> !      partition are in reverse postorder.  */
> !   auto_vec<unsigned int> m_partitioned_nodes;
> ! 
> !   /* Index P * num-layouts + L contains the cost of using layout L
> !      for partition P.  */
> !   auto_vec<slpg_partition_layout_costs> m_partition_layout_costs;
> ! 
> !   /* Index N * num-layouts + L, if nonnull, is a node that provides the
> !      original output of node N adjusted to have layout L.  */
> !   auto_vec<slp_tree> m_node_layouts;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices (hash_set<slp_tree> &visited,
> ! 					slp_tree node)
>   {
>     unsigned i;
>     slp_tree child;
> ***************
> *** 3561,3568 ****
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = vertices.length ();
> !   vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> --- 3996,4003 ----
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = m_vertices.length ();
> !   m_vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> ***************
> *** 3570,3576 ****
>       if (child)
>         {
>   	leaf = false;
> ! 	vect_slp_build_vertices (visited, child, vertices, leafs);
>         }
>       else
>         force_leaf = true;
> --- 4005,4011 ----
>       if (child)
>         {
>   	leaf = false;
> ! 	build_vertices (visited, child);
>         }
>       else
>         force_leaf = true;
> ***************
> *** 3580,3600 ****
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (vec_info *info, vec<slpg_vertex> &vertices,
> ! 			 vec<int> &leafs)
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (info->slp_instances, i, instance)
> !     vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), vertices,
> ! 			     leafs);
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> --- 4015,4033 ----
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     m_leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices ()
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
> !     build_vertices (visited, SLP_INSTANCE_TREE (instance));
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> ***************
> *** 3625,3699 ****
>       }
>   }
>   
> ! /* Return whether permutations PERM_A and PERM_B as recorded in the
> !    PERMS vector are equal.  */
>   
> ! static bool
> ! vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
> ! 		   int perm_a, int perm_b)
>   {
> !   return (perm_a == perm_b
> ! 	  || (perm_a != -1 && perm_b != -1
> ! 	      && perms[perm_a].length () == perms[perm_b].length ()
> ! 	      && memcmp (&perms[perm_a][0], &perms[perm_b][0],
> ! 			 sizeof (unsigned) * perms[perm_a].length ()) == 0));
>   }
>   
> ! /* Optimize the SLP graph of VINFO.  */
>   
> ! void
> ! vect_optimize_slp (vec_info *vinfo)
>   {
> !   if (vinfo->slp_instances.is_empty ())
> !     return;
>   
> !   slp_tree node;
> !   unsigned i;
> !   auto_vec<slpg_vertex> vertices;
> !   auto_vec<int> leafs;
> !   vect_slp_build_vertices (vinfo, vertices, leafs);
>   
> !   struct graph *slpg = new_graph (vertices.length ());
> !   for (slpg_vertex &v : vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> ! 	add_edge (slpg, v.node->vertex, child->vertex);
>   
> !   /* Compute (reverse) postorder on the inverted graph.  */
> !   auto_vec<int> ipo;
> !   graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL);
> ! 
> !   auto_vec<vec<unsigned> > perms;
> !   perms.safe_push (vNULL); /* zero is no permute */
> ! 
> !   /* Produce initial permutations.  */
> !   for (i = 0; i < leafs.length (); ++i)
> !     {
> !       int idx = leafs[i];
> !       slp_tree node = vertices[idx].node;
> ! 
> !       /* Handle externals and constants optimistically throughout the
> ! 	 iteration.  But treat existing vectors as fixed since we
> ! 	 do not handle permuting them below.  */
> !       if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	   && !SLP_TREE_VEC_DEFS (node).exists ())
> ! 	  || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	continue;
>   
> !       /* Leafs do not change across iterations.  Note leafs also double
> ! 	 as entries to the reverse graph.  */
> !       if (!slpg->vertices[idx].succ)
>   	{
> ! 	  vertices[idx].perm_in = 0;
> ! 	  vertices[idx].perm_out = 0;
>   	}
>   
>         /* Loads are the only thing generating permutes.  */
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! 	 node unpermuted, record this permute.  */
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>   	continue;
> --- 4058,4416 ----
>       }
>   }
>   
> ! /* Return the cfg loop that contains NODE.  */
>   
> ! struct loop *
> ! vect_optimize_slp_pass::containing_loop (slp_tree node)
>   {
> !   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> !   if (!rep)
> !     return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father;
> !   return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father;
>   }
>   
> ! /* Return true if UD (an edge from a use to a definition) is associated
> !    with a loop latch edge in the cfg.  */
>   
> ! bool
> ! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud)
>   {
> !   slp_tree use = m_vertices[ud->src].node;
> !   slp_tree def = m_vertices[ud->dest].node;
> !   if (SLP_TREE_DEF_TYPE (use) != vect_internal_def
> !       || SLP_TREE_DEF_TYPE (def) != vect_internal_def)
> !     return false;
>   
> !   stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use));
> !   return (is_a<gphi *> (use_rep->stmt)
> ! 	  && bb_loop_header_p (gimple_bb (use_rep->stmt))
> ! 	  && containing_loop (def) == containing_loop (use));
> ! }
> ! 
> ! /* Build the graph.  Mark edges that correspond to cfg loop latch edges with
> !    a nonnull data field.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_graph ()
> ! {
> !   build_vertices ();
> ! 
> !   m_slpg = new_graph (m_vertices.length ());
> !   for (slpg_vertex &v : m_vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> ! 	{
> ! 	  graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex);
> ! 	  if (is_cfg_latch_edge (ud))
> ! 	    ud->data = this;
> ! 	}
> ! }
>   
> ! /* Return true if E corresponds to a loop latch edge in the cfg.  */
>   
> ! static bool
> ! skip_cfg_latch_edges (graph_edge *e)
> ! {
> !   return e->data;
> ! }
> ! 
> ! /* Create the node partitions.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::create_partitions ()
> ! {
> !   /* Calculate a postorder of the graph, ignoring edges that correspond
> !      to natural latch edges in the cfg.  Reading the vector from the
> !      end to the beginning gives the reverse postorder.  */
> !   auto_vec<int> initial_rpo;
> !   graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo,
> ! 	       false, NULL, skip_cfg_latch_edges);
> !   gcc_assert (initial_rpo.length () == m_vertices.length ());
> ! 
> !   /* Calculate the strongly connected components of the graph.  */
> !   auto_vec<int> scc_grouping;
> !   unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping);
> ! 
> !   /* Create a new index order in which all nodes from the same SCC are
> !      consecutive.  Use scc_pos to record the index of the first node in
> !      each SCC.  */
> !   auto_vec<unsigned int> scc_pos (num_sccs);
> !   int last_component = -1;
> !   unsigned int node_count = 0;
> !   for (unsigned int node_i : scc_grouping)
> !     {
> !       if (last_component != m_slpg->vertices[node_i].component)
> ! 	{
> ! 	  last_component = m_slpg->vertices[node_i].component;
> ! 	  gcc_assert (last_component == int (scc_pos.length ()));
> ! 	  scc_pos.quick_push (node_count);
> ! 	}
> !       node_count += 1;
> !     }
> !   gcc_assert (node_count == initial_rpo.length ()
> ! 	      && last_component + 1 == int (num_sccs));
> ! 
> !   /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes
> !      inside each SCC following the RPO we calculated above.  The fact that
> !      we ignored natural latch edges when calculating the RPO should ensure
> !      that, for natural loop nests:
> ! 
> !      - the first node that we encounter in a cfg loop is the loop header phi
> !      - the loop header phis are in dominance order
> ! 
> !      Arranging for this is an optimization (see below) rather than a
> !      correctness issue.  Unnatural loops with a tangled mess of backedges
> !      will still work correctly, but might give poorer results.
> ! 
> !      Also update scc_pos so that it gives 1 + the index of the last node
> !      in the SCC.  */
> !   m_partitioned_nodes.safe_grow (node_count);
> !   for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;)
> !     {
> !       unsigned int node_i = initial_rpo[old_i];
> !       unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++;
> !       m_partitioned_nodes[new_i] = node_i;
> !     }
> ! 
> !   /* Partition each SCC based on the containing cfg loop.  The order we
> !      constructed above should ensure that, for natural cfg loops, we'll
> !      create sub-SCC partitions for outer loops before the corresponding
> !      sub-SCC partitions for inner loops.  Similarly, when one sibling
> !      loop A dominates another sibling loop B, we should create a sub-SCC
> !      partition for A before a sub-SCC partition for B.
> ! 
> !      As above, nothing depends for correctness on whether this achieves
> !      a natural nesting, but we should get better results when it does.  */
> !   m_partitions.reserve (m_vertices.length ());
> !   unsigned int next_partition_i = 0;
> !   hash_map<struct loop *, int> loop_partitions;
> !   unsigned int rpo_begin = 0;
> !   unsigned int num_partitioned_nodes = 0;
> !   for (unsigned int rpo_end : scc_pos)
> !     {
> !       loop_partitions.empty ();
> !       for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i)
> ! 	{
> ! 	  /* Handle externals and constants optimistically throughout.
> ! 	     But treat existing vectors as fixed since we do not handle
> ! 	     permuting them.  */
> ! 	  unsigned int node_i = m_partitioned_nodes[rpo_i];
> ! 	  slp_tree node = m_vertices[node_i].node;
> ! 	  if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	       && !SLP_TREE_VEC_DEFS (node).exists ())
> ! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	    m_vertices[node_i].partition = -1;
> ! 	  else
> ! 	    {
> ! 	      struct loop *loop = containing_loop (node);
> ! 	      bool existed;
> ! 	      auto &partition_i
> ! 		= loop_partitions.get_or_insert (loop, &existed);
> ! 	      if (!existed)
> ! 		{
> ! 		  partition_i = next_partition_i++;
> ! 		  m_partitions.quick_push (slpg_partition_info ());
> ! 		}
> ! 	      m_vertices[node_i].partition = partition_i;
> ! 	      num_partitioned_nodes += 1;
> ! 	      m_partitions[partition_i].node_end += 1;
> ! 	    }
> ! 	}
> !       rpo_begin = rpo_end;
> !     }
> ! 
> !   /* Assign ranges of consecutive node indices to each partition,
> !      in partition order.  Start with node_end being the same as
> !      node_begin so that the next loop can use it as a counter.  */
> !   unsigned int node_begin = 0;
> !   for (auto &partition : m_partitions)
> !     {
> !       partition.node_begin = node_begin;
> !       node_begin += partition.node_end;
> !       partition.node_end = partition.node_begin;
> !     }
> !   gcc_assert (node_begin == num_partitioned_nodes);
> ! 
> !   /* Finally build the list of nodes in partition order.  */
> !   m_partitioned_nodes.truncate (num_partitioned_nodes);
> !   for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i)
> !     {
> !       int partition_i = m_vertices[node_i].partition;
> !       if (partition_i >= 0)
>   	{
> ! 	  unsigned int order_i = m_partitions[partition_i].node_end++;
> ! 	  m_partitioned_nodes[order_i] = node_i;
>   	}
> +     }
> + }
> + 
> + /* Look for edges from earlier partitions into node NODE_I and edges from
> +    node NODE_I into later partitions.  Call:
> + 
> +       FN (ud, other_node_i, other_partition_i)
> + 
> +    for each such use-to-def edge ud, where other_node_i and other_partition_i
> +    are the node and partition at the other end of the edge.  */
> + 
> + template<typename T>
> + void
> + vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn)
> + {
> +   int partition_i = m_vertices[node_i].partition;
> +   for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> +        pred; pred = pred->pred_next)
> +     {
> +       int src_partition_i = m_vertices[pred->src].partition;
> +       if (src_partition_i >= 0 && src_partition_i != partition_i)
> + 	fn (pred, pred->src, src_partition_i);
> +     }
> +   for (graph_edge *succ = m_slpg->vertices[node_i].succ;
> +        succ; succ = succ->succ_next)
> +     {
> +       int dest_partition_i = m_vertices[succ->dest].partition;
> +       if (dest_partition_i >= 0 && dest_partition_i != partition_i)
> + 	fn (succ, succ->dest, dest_partition_i);
> +     }
> + }
> + 
> + /* Return true if the target is able to change a vector from layout
> +    FROM_LAYOUT_I to layout TO_LAYOUT_I.  */
> + 
> + bool
> + vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i,
> + 					    unsigned int to_layout_i)
> + {
> +   /* FIXME: test what the target can support.  */
> +   (void) from_layout_i;
> +   (void) to_layout_i;
> +   return true;
> + }
> + 
> + /* We already know that the target can perform permute m_perms[LAYOUT_I].
> +    In our terms this means that we know that the target can go from a
> +    layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with
> +    it applied (layout 0, the original layout of the SLP graph).  Return
> +    true if the target can also go in the opposite direction: from the
> +    original layout to one in which m_perms[LAYOUT_I] has been reversed.  */
> + 
> + bool
> + vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i)
> + {
> +   return can_change_layouts (0, layout_i);
> + }
> + 
> + /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
> +    to layout TO_LAYOUT_I.  */
> + 
> + unsigned int
> + vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i,
> + 				     unsigned int to_layout_i)
> + {
> +   if (from_layout_i == to_layout_i)
> +     return 0;
> + 
> +   if (can_change_layouts (from_layout_i, to_layout_i))
> +     return 1;
> + 
> +   /* Change to layout 0 and then out again.  */
> +   gcc_assert (layout_is_reversible (from_layout_i));
> +   return 2;
> + }
> + 
> + /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.  */
> + 
> + inline slpg_partition_layout_costs &
> + vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i,
> + 						unsigned int layout_i)
> + {
> +   return m_partition_layout_costs[partition_i * m_perms.length () + layout_i];
> + }
> + 
> + /* Return true if partition PARTITION_I is structually allowed to absorb
> +    neighboring permutes.  */
> + 
> + bool
> + vect_optimize_slp_pass::
> + partition_can_absorb_permutes (unsigned int partition_i)
> + {
> +   auto &partition = m_partitions[partition_i];
> +   if (partition.node_end != partition.node_begin + 1)
> +     return false;
> + 
> +   slp_tree node = m_vertices[m_partitioned_nodes[partition.node_begin]].node;
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     return true;
> + 
> +   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> +   if (rep
> +       && STMT_VINFO_DATA_REF (rep)
> +       && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
> +     return true;
> + 
> +   return false;
> + }
> + 
> + /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the
> +    permute that would be required to go from layout 0 to layout LAYOUT_I.
> +    Return the cost of the change if so, in the same arbitrary units as
> +    for layout_cost.  Return -1 otherwise.  */
> + 
> + int
> + vect_optimize_slp_pass::node_absorption_cost (slp_tree node,
> + 					      unsigned int layout_i)
> + {
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     /* FIXME: check whether NODE can handle cases in which both the inputs
> +        and outputs have layout LAYOUT_I.  Return -1 if not.  */
> +     return 0;
> + 
> +   auto &vertex = m_vertices[node->vertex];
> +   auto &partition = m_partitions[vertex.partition];
> +   if (partition.layout == int (layout_i))
> +     return 0;
> +   /* FIXME: check whether NODE can be changed from its current layout
> +      to layout LAYOUT_I.  Return -1 if not.  */
> +   return 1;
> + }
> + 
> + /* Decide which element layouts we should consider using.  Mark which
> +    partitions can potentially use which layouts and the local costs of
> +    each choice.  */
> + 
> + void
> + vect_optimize_slp_pass::start_choosing_layouts ()
> + {
> +   /* Used to assign unique permute indices.  */
> +   using perm_hash = unbounded_hashmap_traits<
> +     vec_free_hash_base<int_hash_base<unsigned>>,
> +     int_hash<int, -1, -2>
> +   >;
> +   hash_map<vec<unsigned>, int, perm_hash> layout_ids;
> + 
> +   /* Layout 0 is "no change".  */
> +   m_perms.safe_push (vNULL);
> + 
> +   /* Create layouts from existing permutations.  */
> +   /* FIXME: cap the maximum number of layouts, to avoid quadraticness.  */
> +   for (unsigned int node_i : m_leafs)
> +     {
> +       int partition_i = m_vertices[node_i].partition;
> +       if (partition_i < 0)
> + 	continue;
> + 
> +       /* Leafs also double as entries to the reverse graph.  Allow the
> + 	 layout of those to be changed.  */
> +       if (!m_slpg->vertices[node_i].succ)
> + 	m_partitions[partition_i].layout = 0;
>   
>         /* Loads are the only thing generating permutes.  */
> +       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! 	 node unpermuted, record a layout that reverses this permute.  */
> !       gcc_assert (m_partitions[partition_i].layout == 0);
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>   	continue;
> ***************
> *** 3708,3720 ****
>   	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>   	    any_permute = true;
>   	}
> -       /* If there's no permute no need to split one out.  */
> -       if (!any_permute)
> - 	continue;
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>   	 that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>   	continue;
>   
>         /* For now only handle true permutes, like
>   	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> --- 4425,4442 ----
>   	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>   	    any_permute = true;
>   	}
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>   	 that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>   	continue;
> +       /* If there's no permute no need to split one out.  In this case
> + 	 we can consider turning the load into a permuted load, if that
> + 	 turns out to be cheaper than alternatives.  */
> +       if (!any_permute)
> + 	{
> + 	  m_partitions[partition_i].layout = -1;
> + 	  continue;
> + 	}
>   
>         /* For now only handle true permutes, like
>   	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> ***************
> *** 3735,3985 ****
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>   	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       perms.safe_push (perm);
> !       vertices[idx].perm_in = perms.length () - 1;
> !       vertices[idx].perm_out = perms.length () - 1;
>       }
>   
> !   /* In addition to the above we have to mark outgoing permutes facing
> !      non-reduction graph entries that are not represented as to be
> !      materialized.  */
> !   for (slp_instance instance : vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> ! 	/* Just setting perm_out isn't enough for the propagation to
> ! 	   pick this up.  */
> ! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0;
> ! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0;
>         }
>   
> !   /* Propagate permutes along the graph and compute materialization points.  */
> !   bool changed;
> !   bool do_materialization = false;
> !   unsigned iteration = 0;
> !   do
> !     {
> !       changed = false;
> !       ++iteration;
> ! 
> !       if (dump_enabled_p ())
> ! 	dump_printf_loc (MSG_NOTE, vect_location,
> ! 			 "SLP optimize iteration %d\n", iteration);
>   
> !       for (i = vertices.length (); i > 0 ; --i)
>   	{
> ! 	  int idx = ipo[i-1];
> ! 	  slp_tree node = vertices[idx].node;
>   
> ! 	  /* Handle externals and constants optimistically throughout the
> ! 	     iteration.  */
> ! 	  if (SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	    continue;
>   
> ! 	  /* We still eventually have failed backedge SLP nodes in the
> ! 	     graph, those are only cancelled when analyzing operations.
> ! 	     Simply treat them as transparent ops, propagating permutes
> ! 	     through them.  */
> ! 	  if (SLP_TREE_DEF_TYPE (node) == vect_internal_def)
> ! 	    {
> ! 	      /* We do not handle stores with a permutation, so all
> ! 		 incoming permutes must have been materialized.  */
> ! 	      stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> ! 	      if (STMT_VINFO_DATA_REF (rep)
> ! 		  && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! 		{
> ! 		  /* ???  We're forcing materialization in place
> ! 		     of the child here, we'd need special handling
> ! 		     in materialization to leave perm_in -1 here.  */
> ! 		  vertices[idx].perm_in = 0;
> ! 		  vertices[idx].perm_out = 0;
> ! 		}
> ! 	      /* We cannot move a permute across an operation that is
> ! 		 not independent on lanes.  Note this is an explicit
> ! 		 negative list since that's much shorter than the respective
> ! 		 positive one but it's critical to keep maintaining it.  */
> ! 	      if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! 		switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! 		  {
> ! 		  case CFN_COMPLEX_ADD_ROT90:
> ! 		  case CFN_COMPLEX_ADD_ROT270:
> ! 		  case CFN_COMPLEX_MUL:
> ! 		  case CFN_COMPLEX_MUL_CONJ:
> ! 		  case CFN_VEC_ADDSUB:
> ! 		  case CFN_VEC_FMADDSUB:
> ! 		  case CFN_VEC_FMSUBADD:
> ! 		    vertices[idx].perm_in = 0;
> ! 		    vertices[idx].perm_out = 0;
> ! 		  default:;
> ! 		  }
>   	    }
>   
> ! 	  if (!slpg->vertices[idx].succ)
> ! 	    /* Pick up pre-computed leaf values.  */
> ! 	    ;
> ! 	  else
>   	    {
> ! 	      bool any_succ_perm_out_m1 = false;
> ! 	      int perm_in = vertices[idx].perm_in;
> ! 	      for (graph_edge *succ = slpg->vertices[idx].succ;
> ! 		   succ; succ = succ->succ_next)
>   		{
> ! 		  int succ_idx = succ->dest;
> ! 		  int succ_perm = vertices[succ_idx].perm_out;
> ! 		  /* Handle unvisited (and constant) nodes optimistically.  */
> ! 		  /* ???  But for constants once we want to handle
> ! 		     non-bijective permutes we have to verify the permute,
> ! 		     when unifying lanes, will not unify different constants.
> ! 		     For example see gcc.dg/vect/bb-slp-14.c for a case
> ! 		     that would break.  */
> ! 		  if (succ_perm == -1)
> ! 		    {
> ! 		      /* When we handled a non-leaf optimistically, note
> ! 			 that so we can adjust its outgoing permute below.  */
> ! 		      slp_tree succ_node = vertices[succ_idx].node;
> ! 		      if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! 			any_succ_perm_out_m1 = true;
> ! 		      continue;
> ! 		    }
> ! 		  if (perm_in == -1)
> ! 		    perm_in = succ_perm;
> ! 		  else if (succ_perm == 0
> ! 			   || !vect_slp_perms_eq (perms, perm_in, succ_perm))
>   		    {
> ! 		      perm_in = 0;
> ! 		      break;
>   		    }
>   		}
>   
> ! 	      /* Adjust any incoming permutes we treated optimistically.  */
> ! 	      if (perm_in != -1 && any_succ_perm_out_m1)
>   		{
> ! 		  for (graph_edge *succ = slpg->vertices[idx].succ;
> ! 		       succ; succ = succ->succ_next)
>   		    {
> ! 		      slp_tree succ_node = vertices[succ->dest].node;
> ! 		      if (vertices[succ->dest].perm_out == -1
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! 			{
> ! 			  vertices[succ->dest].perm_out = perm_in;
> ! 			  /* And ensure this propagates.  */
> ! 			  if (vertices[succ->dest].perm_in == -1)
> ! 			    vertices[succ->dest].perm_in = perm_in;
> ! 			}
>   		    }
> ! 		  changed = true;
> ! 		}
>   
> ! 	      if (!vect_slp_perms_eq (perms, perm_in,
> ! 				      vertices[idx].perm_in))
> ! 		{
> ! 		  /* Make sure we eventually converge.  */
> ! 		  gcc_checking_assert (vertices[idx].perm_in == -1
> ! 				       || perm_in == 0);
> ! 		  vertices[idx].perm_in = perm_in;
> ! 
> ! 		  /* While we can handle VEC_PERM nodes as transparent
> ! 		     pass-through they can be a cheap materialization
> ! 		     point as well.  In addition they can act as source
> ! 		     of a random permutation as well.
> ! 		     The following ensures that former materialization
> ! 		     points that now have zero incoming permutes no
> ! 		     longer appear as such and that former "any" permutes
> ! 		     get pass-through.  We keep VEC_PERM nodes optimistic
> ! 		     as "any" outgoing permute though.  */
> ! 		  if (vertices[idx].perm_out != 0
> ! 		      && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
> ! 		    vertices[idx].perm_out = perm_in;
> ! 		  changed = true;
> ! 		}
>   	    }
>   
> ! 	  /* Elide pruning at materialization points in the first
> ! 	     iteration phase.  */
> ! 	  if (!do_materialization)
> ! 	    continue;
>   
> ! 	  int perm = vertices[idx].perm_out;
> ! 	  if (perm == 0 || perm == -1)
>   	    continue;
>   
> ! 	  /* Decide on permute materialization.  Look whether there's
> ! 	     a use (pred) edge that is permuted differently than us.
> ! 	     In that case mark ourselves so the permutation is applied.  */
> ! 	  bool all_preds_permuted = slpg->vertices[idx].pred != NULL;
> ! 	  if (all_preds_permuted)
> ! 	    for (graph_edge *pred = slpg->vertices[idx].pred;
> ! 		 pred; pred = pred->pred_next)
> ! 	      {
> ! 		int pred_perm = vertices[pred->src].perm_in;
> ! 		gcc_checking_assert (pred_perm != -1);
> ! 		if (!vect_slp_perms_eq (perms, perm, pred_perm))
> ! 		  {
> ! 		    all_preds_permuted = false;
> ! 		    break;
> ! 		  }
> ! 	      }
> ! 	  if (!all_preds_permuted)
> ! 	    {
> ! 	      vertices[idx].perm_out = 0;
> ! 	      changed = true;
>   	    }
> - 	}
>   
> !       /* If the initial propagation converged, switch on materialization
> ! 	 and re-propagate.  */
> !       if (!changed && !do_materialization)
>   	{
> ! 	  do_materialization = true;
> ! 	  changed = true;
>   	}
>       }
> !   while (changed);
> !   statistics_histogram_event (cfun, "SLP optimize perm iterations", iteration);
>   
> !   /* Materialize.  */
> !   for (i = 0; i < vertices.length (); ++i)
>       {
> !       int perm_in = vertices[i].perm_in;
> !       slp_tree node = vertices[i].node;
>   
> !       /* First permute invariant/external original successors, we handle
> ! 	 those optimistically during propagation and duplicate them if
> ! 	 they are used with different permutations.  */
> !       unsigned j;
> !       slp_tree child;
> !       if (perm_in > 0)
> ! 	FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
> ! 	  {
> ! 	    if (!child
> ! 		|| (SLP_TREE_DEF_TYPE (child) != vect_constant_def
> ! 		    && SLP_TREE_DEF_TYPE (child) != vect_external_def))
> ! 	      continue;
>   
> ! 	    /* If the vector is uniform there's nothing to do.  */
> ! 	    if (vect_slp_tree_uniform_p (child))
> ! 	      continue;
>   
> ! 	    /* We can end up sharing some externals via two_operator
> ! 	       handling.  Be prepared to unshare those.  */
> ! 	    if (child->refcnt != 1)
> ! 	      {
> ! 		gcc_assert (slpg->vertices[child->vertex].pred->pred_next);
> ! 		SLP_TREE_CHILDREN (node)[j] = child
> ! 		  = vect_create_new_slp_node
> ! 		      (SLP_TREE_SCALAR_OPS (child).copy ());
> ! 	      }
> ! 	    vect_slp_permute (perms[perm_in],
> ! 			      SLP_TREE_SCALAR_OPS (child), true);
> ! 	  }
>   
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>   	{
>   	  /* Apply the common permutes to the input vectors.  */
> ! 	  if (perm_in > 0)
> ! 	    {
>   	      /* If the node is already a permute node we can apply
>   		 the permutation to the lane selection, effectively
>   		 materializing it on the incoming vectors.  */
> --- 4457,5092 ----
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>   	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       bool existed;
> !       int &layout_i = layout_ids.get_or_insert (perm, &existed);
> !       if (existed)
> ! 	perm.release ();
> !       else
> ! 	{
> ! 	  layout_i = m_perms.length ();
> ! 	  m_perms.safe_push (perm);
> ! 	}
> !       m_partitions[partition_i].layout = layout_i;
>       }
>   
> !   /* Initially assume that every layout is possible and has zero cost
> !      in every partition.  */
> !   m_partition_layout_costs.safe_grow_cleared (m_partitions.length ()
> ! 					      * m_perms.length ());
> ! 
> !   /* We have to mark outgoing permutes facing non-reduction graph
> !      entries that are not represented as to be materialized.  */
> !   for (slp_instance instance : m_vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> ! 	unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex;
> ! 	m_partitions[m_vertices[node_i].partition].layout = 0;
>         }
>   
> !   /* Check which layouts each node and partition can handle.  Calculate the
> !      weights associated with inserting layout changes on edges.  */
> !   m_optimize_size = true;
> !   for (unsigned int node_i : m_partitioned_nodes)
> !     {
> !       auto &vertex = m_vertices[node_i];
> !       unsigned int partition_i = vertex.partition;
> !       auto &partition = m_partitions[partition_i];
> !       slp_tree node = vertex.node;
> ! 
> !       if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node))
> ! 	{
> ! 	  basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt);
> ! 	  if (optimize_bb_for_speed_p (bb))
> ! 	    m_optimize_size = false;
> ! 
> ! 	  vertex.weight = vect_slp_node_weight (node);
> ! 
> ! 	  /* We do not handle stores with a permutation, so all
> ! 	     incoming permutes must have been materialized.  */
> ! 	  if (STMT_VINFO_DATA_REF (rep)
> ! 	      && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! 	    /* ???  We're forcing materialization in place
> ! 	       of the child here, we'd need special handling
> ! 	       in materialization to leave layout -1 here.  */
> ! 	    partition.layout = 0;
> ! 
> ! 	  /* We cannot change the layout of an operation that is
> ! 	     not independent on lanes.  Note this is an explicit
> ! 	     negative list since that's much shorter than the respective
> ! 	     positive one but it's critical to keep maintaining it.  */
> ! 	  if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! 	    switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! 	      {
> ! 	      case CFN_COMPLEX_ADD_ROT90:
> ! 	      case CFN_COMPLEX_ADD_ROT270:
> ! 	      case CFN_COMPLEX_MUL:
> ! 	      case CFN_COMPLEX_MUL_CONJ:
> ! 	      case CFN_VEC_ADDSUB:
> ! 	      case CFN_VEC_FMADDSUB:
> ! 	      case CFN_VEC_FMSUBADD:
> ! 		partition.layout = 0;
> ! 	      default:;
> ! 	      }
> ! 	}
>   
> !       if (partition.layout == 0)
>   	{
> ! 	  /* The node must stay in its current order.  */
> ! 	  for (unsigned int layout_i = 1; layout_i < m_perms.length ();
> ! 	       ++layout_i)
> ! 	    partition_layout_costs (partition_i, layout_i).mark_impossible ();
> ! 	}
> !       else if (partition_can_absorb_permutes (partition_i))
> ! 	/* Calculate the cost of converting NODE to use each permute.
> ! 	   If PARTITION.LAYOUT > 0 then that layout's permute will be free.  */
> ! 	for (unsigned int layout_i = 0; layout_i < m_perms.length ();
> ! 	     ++layout_i)
> ! 	  {
> ! 	    auto &layout_costs = partition_layout_costs (partition_i,
> ! 							 layout_i);
> ! 	    int factor = node_absorption_cost (node, layout_i);
> ! 	    if (factor < 0)
> ! 	      {
> ! 		layout_costs.mark_impossible ();
> ! 		continue;
> ! 	      }
> ! 	    gcc_assert (layout_costs.is_possible ());
> ! 	    layout_costs.in_cost = { vertex.weight * factor, m_optimize_size };
> ! 	  }
>   
> !       auto process_edge = [&](graph_edge *ud, unsigned int node2_i,
> ! 			      unsigned int partition2_i)
> ! 	{
> ! 	  /* Count the number of edges from earlier partitions and the number
> ! 	     of edges to later partitions.  */
> ! 	  if (partition2_i < partition_i)
> ! 	    partition.in_degree += 1;
> ! 	  else
> ! 	    partition.out_degree += 1;
>   
> ! 	  /* If the current node uses the result of NODE2_I, accumulate
> ! 	     the effects of that.  */
> ! 	  if (ud->src == int (node_i))
> ! 	    {
> ! 	      m_vertices[node2_i].out_weight += vertex.weight;
> ! 	      m_vertices[node2_i].out_degree += 1;
>   	    }
> + 	};
> +       for_each_partition_edge (node_i, process_edge);
> +     }
>   
> !   /* Layout 0 represents the original order and layout LAYOUT_I represents
> !      the effect of reversing m_perms[LAYOUT_I].  If a predecessor partition
> !      does not support layout LAYOUT_I and if it also isn't possible to go
> !      from layout 0 to layout LAYOUT_I, then it is very difficult for a
> !      successor partition to support layout LAYOUT_I.  (The only alternative
> !      would go to layout LAYOUT_I from a different nonzero layout, but
> !      that's a niche case and not really worth supporting.)  */
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
> !     {
> !       if (layout_is_reversible (layout_i))
> ! 	continue;
> ! 
> !       bool changed;
> !       do
> ! 	{
> ! 	  changed = false;
> ! 	  for (unsigned int node_i : m_partitioned_nodes)
>   	    {
> ! 	      unsigned int partition_i = m_vertices[node_i].partition;
> ! 	      if (partition_layout_costs (partition_i,
> ! 					  layout_i).is_possible ())
> ! 		continue;
> ! 	      for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> ! 		   pred; pred = pred->pred_next)
>   		{
> ! 		  unsigned int partition2_i = m_vertices[pred->src].partition;
> ! 		  auto &costs = partition_layout_costs (partition2_i,
> ! 							layout_i);
> ! 		  if (costs.is_possible ())
>   		    {
> ! 		      costs.mark_impossible ();
> ! 		      changed = true;
>   		    }
>   		}
> + 	    }
> + 	}
> +       while (changed);
> +     }
> + }
> + 
> + /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
> +    and layout LAYOUT2_I on cross-partition use-to-def edge UD.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud,
> + 					  unsigned int node1_i,
> + 					  unsigned int layout1_i,
> + 					  unsigned int layout2_i)
> + {
> +   auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i;
> +   auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i;
> +   auto factor = layout_cost (def_layout_i, use_layout_i);
> + 
> +   /* We have a choice of putting the layout change at the site of the
> +      definition or at the site of the use.  Prefer the former when
> +      optimizing for size or when the execution frequency of the
> +      definition is no greater than the combined execution frequencies of
> +      the uses.  When putting the layout change at the site of the definition,
> +      divvy up the cost among all consumers.  */
> +   slpg_vertex &def_vertex = m_vertices[ud->dest];
> +   slpg_vertex &use_vertex = m_vertices[ud->src];
> +   if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight)
> +     {
> +       slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size };
> +       cost.split (def_vertex.out_degree);
> +       return cost;
> +     }
> +   return { use_vertex.weight * factor, m_optimize_size };
> + }
> + 
> + /* UD represents a use-def link between FROM_NODE_I and a node in a later
> +    partition; FROM_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout TO_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, FROM_NODE_I's partition has chosen the cheapest
> +    layout based on the information available so far, but this choice
> +    is only provisional.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::forward_cost (graph_edge *ud,
> + 				      unsigned int from_node_i,
> + 				      unsigned int to_layout_i)
> + {
> +   slpg_vertex &from_vertex = m_vertices[from_node_i];
> +   unsigned int from_partition_i = from_vertex.partition;
> +   slpg_partition_info &from_partition = m_partitions[from_partition_i];
> +   gcc_assert (from_partition.layout >= 0);
> + 
> +   /* First calculate the cost on the assumption that FROM_PARTITION sticks
> +      with its current layout preference.  */
> +   slpg_layout_cost cost
> +     = partition_layout_costs (from_partition_i,
> + 			      from_partition.layout).in_cost;
> +   cost.split (from_partition.out_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, from_node_i,
> + 					  from_partition.layout, to_layout_i));
> + 
> +   /* Take the minimum of that cost and the cost that applies if
> +      FROM_PARTITION instead switches to TO_LAYOUT_I.  */
> +   auto &direct_layout_costs = partition_layout_costs (from_partition_i,
> + 						      to_layout_i);
> +   if (direct_layout_costs.is_possible ())
> +     {
> +       slpg_layout_cost direct_cost = direct_layout_costs.in_cost;
> +       direct_cost.split (from_partition.out_degree);
> +       if (direct_cost.is_better_than (cost, m_optimize_size))
> + 	cost = direct_cost;
> +     }
> + 
> +   /* If FROM_PARTITION is likely to be able to absorb the required outgoing
> +      permute, optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (from_partition_i))
> +     {
> +       int factor = node_absorption_cost (from_vertex.node, to_layout_i);
> +       if (factor >= 0)
> + 	{
> + 	  sreal weight = from_vertex.weight * factor;
> + 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
> + 	    cost = absorb_cost;
> + 	}
> +     }
> + 
> +   return cost;
> + }
> + 
> + /* UD represents a use-def link between TO_NODE_I and a node in an earlier
> +    partition; TO_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout FROM_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, TO_NODE_I's partition has a fixed choice of layout.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::backward_cost (graph_edge *ud,
> + 				       unsigned int to_node_i,
> + 				       unsigned int from_layout_i)
> + {
> +   slpg_vertex &to_vertex = m_vertices[to_node_i];
> +   unsigned int to_partition_i = to_vertex.partition;
> +   slpg_partition_info &to_partition = m_partitions[to_partition_i];
> +   gcc_assert (to_partition.layout >= 0);
> + 
> +   slpg_layout_cost cost
> +     = partition_layout_costs (to_partition_i,
> + 			      to_partition.layout).out_cost;
> +   cost.split (to_partition.in_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, to_node_i, to_partition.layout,
> + 					  from_layout_i));
> + 
> +   /* If TO_PARTITION is likely to be able to absorb the required input permute,
> +      optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (to_partition_i))
> +     {
> +       int factor = node_absorption_cost (to_vertex.node, from_layout_i);
> +       if (factor >= 0)
> + 	{
> + 	  sreal weight = to_vertex.weight * factor;
> + 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
> + 	    cost = absorb_cost;
> + 	}
> +     }
> + 
> +   return cost;
> + }
>   
> ! /* Make a forward pass through the partitions, accumulating input costs.
> !    Make a tentative (provisional) choice of layout for each partition.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::forward_pass ()
> ! {
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (!layout_costs.is_possible ())
> ! 	    continue;
> ! 
> ! 	  /* Accumulate the costs from predecessor partitions.  */
> ! 	  for (unsigned int order_i = partition.node_begin;
> ! 	       order_i < partition.node_end; ++order_i)
> ! 	    {
> ! 	      unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	      auto add_cost = [&](graph_edge *ud, unsigned int from_node_i,
> ! 				  unsigned int from_partition_i)
>   		{
> ! 		  if (from_partition_i < partition_i)
>   		    {
> ! 		      auto cost = forward_cost (ud, from_node_i, layout_i);
> ! 		      layout_costs.in_cost.add_parallel_cost (cost);
>   		    }
> ! 		};
> ! 	      for_each_partition_edge (node_i, add_cost);
> ! 	    }
>   
> ! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
> ! 	     the event of a tie between it and another layout.  */
> ! 	  if (!min_layout_cost
> ! 	      || layout_costs.in_cost.is_better_than (*min_layout_cost,
> ! 						      m_optimize_size))
> ! 	    {
> ! 	      min_layout_i = layout_i;
> ! 	      min_layout_cost = &layout_costs.in_cost;
>   	    }
> + 	}
> +       gcc_assert (min_layout_cost);
> +       partition.layout = min_layout_i;
> +     }
> + }
>   
> ! /* Make a backward pass through the partitions, accumulating output costs.
> !    Make a final choice of layout for each partition.  */
>   
> ! void
> ! vect_optimize_slp_pass::backward_pass ()
> ! {
> !   for (unsigned int partition_i = m_partitions.length (); partition_i-- > 0;)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       unsigned int seen_min_layout_i = 0;
> !       slpg_layout_cost *seen_min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (!layout_costs.is_possible ())
>   	    continue;
>   
> ! 	  /* Accumulate the costs from successor partitions.  Record whether
> ! 	     any of them actually uses this layout. */
> ! 	  bool seen = false;
> ! 	  for (unsigned int order_i = partition.node_begin;
> ! 	       order_i < partition.node_end; ++order_i)
> ! 	    {
> ! 	      unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	      auto add_cost = [&](graph_edge *ud, unsigned int to_node_i,
> ! 				  unsigned int to_partition_i)
> ! 		{
> ! 		  if (to_partition_i > partition_i)
> ! 		    {
> ! 		      auto cost = backward_cost (ud, to_node_i, layout_i);
> ! 		      layout_costs.out_cost.add_parallel_cost (cost);
> ! 		      seen |= (m_partitions[to_partition_i].layout
> ! 			       == int (layout_i));
> ! 		    }
> ! 		};
> ! 	      for_each_partition_edge (node_i, add_cost);
>   	    }
>   
> ! 	  /* Locally combine the costs from the forward and backward passes.
> ! 	     (This combined cost is not passed on, since that would lead
> ! 	     to double counting.)  */
> ! 	  layout_costs.cost = layout_costs.in_cost;
> ! 	  layout_costs.cost.add_serial_cost (layout_costs.out_cost);
> ! 
> ! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
> ! 	     the event of a tie between it and another layout.  */
> ! 	  if (!min_layout_cost
> ! 	      || layout_costs.cost.is_better_than (*min_layout_cost,
> ! 						   m_optimize_size))
> ! 	    {
> ! 	      min_layout_i = layout_i;
> ! 	      min_layout_cost = &layout_costs.cost;
> ! 	    }
> ! 	  /* Likewise, but restrict the choice of layouts to those that
> ! 	     are used by successor partitions.  */
> ! 	  if (seen
> ! 	      && (!seen_min_layout_cost
> ! 		  || layout_costs.cost.is_better_than (*seen_min_layout_cost,
> ! 						       m_optimize_size)))
> ! 	    {
> ! 	      seen_min_layout_i = min_layout_i;
> ! 	      seen_min_layout_cost = &layout_costs.cost;
> ! 	    }
> ! 	}
> !       gcc_assert (min_layout_cost);
> ! 
> !       /* If the partition can absorb permutes, prefer a layout that has
> ! 	 actually been used.  */
> !       if (seen_min_layout_cost
> ! 	  && partition_can_absorb_permutes (partition_i))
> ! 	partition.layout = seen_min_layout_i;
> !       else
> ! 	partition.layout = min_layout_i;
> !     }
> ! }
> ! 
> ! /* Return a node that applies layout TO_LAYOUT_I to the original form of NODE.
> !    NODE already has the layout that was selected for its partition.  */
> ! 
> ! slp_tree
> ! vect_optimize_slp_pass::get_result_with_layout (slp_tree node,
> ! 						unsigned int to_layout_i)
> ! {
> !   unsigned int result_i = node->vertex * m_perms.length () + to_layout_i;
> !   slp_tree result = m_node_layouts[result_i];
> !   if (result)
> !     return result;
> ! 
> !   if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
> !       || SLP_TREE_DEF_TYPE (node) == vect_external_def)
> !     {
> !       /* If the vector is uniform or unchanged, there's nothing to do.  */
> !       if (to_layout_i == 0 || vect_slp_tree_uniform_p (node))
> ! 	result = node;
> !       else
>   	{
> ! 	  auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy ();
> ! 	  result = vect_create_new_slp_node (scalar_ops);
> ! 	  vect_slp_permute (m_perms[to_layout_i], scalar_ops, true);
>   	}
>       }
> !   else
> !     {
> !       unsigned int partition_i = m_vertices[node->vertex].partition;
> !       unsigned int from_layout_i = m_partitions[partition_i].layout;
> !       if (from_layout_i == to_layout_i)
> ! 	return node;
> ! 
> !       /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used
> ! 	 twice, with two different layouts, then we could create
> ! 	 parallel VEC_PERM_EXPRs rather than add a serial one.  */
> ! 
> !       /* FIXME: When calculating the costs, we considered the possiblity
> ! 	 of duplicating a VEC_PERM_EXPR at each use site, rather than
> ! 	 having one at the definition site.  Do that here if it turns
> ! 	 out to be cheaper.  */
> ! 
> !       if (dump_enabled_p ())
> ! 	dump_printf_loc (MSG_NOTE, vect_location,
> ! 			 "inserting permute node in place of %p\n",
> ! 			 node);
> ! 
> !       unsigned int num_lanes = SLP_TREE_LANES (node);
> !       result = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> !       if (SLP_TREE_SCALAR_STMTS (node).length ())
> ! 	{
> ! 	  auto &stmts = SLP_TREE_SCALAR_STMTS (result);
> ! 	  stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result));
> ! 	  if (from_layout_i != 0)
> ! 	    vect_slp_permute (m_perms[from_layout_i], stmts, false);
> ! 	  if (to_layout_i != 0)
> ! 	    vect_slp_permute (m_perms[to_layout_i], stmts, true);
> ! 	}
> !       SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node);
> !       SLP_TREE_LANES (result) = num_lanes;
> !       SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node);
> !       result->vertex = -1;
> ! 
> !       auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result);
> !       lane_perm.create (num_lanes);
> !       for (unsigned j = 0; j < num_lanes; ++j)
> ! 	lane_perm.quick_push ({ 0, j });
> !       if (from_layout_i != 0)
> ! 	vect_slp_permute (m_perms[from_layout_i], lane_perm, false);
> !       if (to_layout_i != 0)
> ! 	vect_slp_permute (m_perms[to_layout_i], lane_perm, true);
> ! 
> !       node->refcnt++;
> !       SLP_TREE_CHILDREN (result).safe_push (node);
> !     }
> !   m_node_layouts[result_i] = result;
> !   return result;
> ! }
> ! 
> ! /* Print the partition graph and layout information to the dump file.  */
>   
> ! void
> ! vect_optimize_slp_pass::dump ()
> ! {
> !   dump_printf_loc (MSG_NOTE, vect_location,
> ! 		   "SLP optimize permutes:\n");
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
>       {
> !       dump_printf_loc (MSG_NOTE, vect_location, "  %d: { ", layout_i);
> !       const char *sep = "";
> !       for (unsigned int idx : m_perms[layout_i])
> ! 	{
> ! 	  dump_printf (MSG_NOTE, "%s%d", sep, idx);
> ! 	  sep = ", ";
> ! 	}
> !       dump_printf (MSG_NOTE, " }\n");
> !     }
> !   dump_printf_loc (MSG_NOTE, vect_location,
> ! 		   "SLP optimize partitions:\n");
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       dump_printf_loc (MSG_NOTE, vect_location,  "  -------------\n");
> !       dump_printf_loc (MSG_NOTE, vect_location,
> ! 		       "  partition %d (layout %d):\n",
> ! 		       partition_i, partition.layout);
> !       dump_printf_loc (MSG_NOTE, vect_location, "    nodes:\n");
> !       for (unsigned int order_i = partition.node_begin;
> ! 	   order_i < partition.node_end; ++order_i)
> ! 	{
> ! 	  unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	  slp_tree node = m_vertices[node_i].node;
> ! 	  dump_printf_loc (MSG_NOTE, vect_location, "      - %p:\n",
> ! 			   (void *) m_vertices[node_i].node);
> ! 	  dump_printf_loc (MSG_NOTE, vect_location,
> ! 			   "          weight: %f\n",
> ! 			   m_vertices[node_i].weight.to_double ());
> ! 	  if (m_vertices[node_i].out_degree)
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          out weight: %f (degree %d)\n",
> ! 			     m_vertices[node_i].out_weight.to_double (),
> ! 			     m_vertices[node_i].out_degree);
> ! 	  if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          op: VEC_PERM_EXPR\n");
> ! 	  else if (auto rep = SLP_TREE_REPRESENTATIVE (node))
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          op template: %G", rep->stmt);
> ! 	}
> !       dump_printf_loc (MSG_NOTE, vect_location, "    edges:\n");
> !       for (unsigned int order_i = partition.node_begin;
> ! 	   order_i < partition.node_end; ++order_i)
> ! 	{
> ! 	  unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	  auto print_edge = [&](graph_edge *, unsigned int node2_i,
> ! 				unsigned int partition2_i)
> ! 	    {
> ! 	      if (partition2_i < partition_i)
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "      - %p [%d] --> %p\n",
> ! 				 m_vertices[node2_i].node, partition2_i,
> ! 				 m_vertices[node_i].node);
> ! 	      else
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "      - %p --> [%d] %p\n",
> ! 				 m_vertices[node_i].node, partition2_i,
> ! 				 m_vertices[node2_i].node);
> ! 	    };
> ! 	  for_each_partition_edge (node_i, print_edge);
> ! 	}
>   
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (layout_costs.is_possible ())
> ! 	    {
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "    layout %d:%s\n", layout_i,
> ! 			       partition.layout == int (layout_i)
> ! 			       ? " (*)" : "");
> ! #define TEMPLATE "{depth: %f, total: %f}"
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "        " TEMPLATE "\n", layout_i,
> ! 			       layout_costs.in_cost.depth.to_double (),
> ! 			       layout_costs.in_cost.total.to_double ());
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "      + " TEMPLATE "\n",
> ! 			       layout_costs.out_cost.depth.to_double (),
> ! 			       layout_costs.out_cost.total.to_double ());
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "      = " TEMPLATE "\n",
> ! 			       layout_costs.cost.depth.to_double (),
> ! 			       layout_costs.cost.total.to_double ());
> ! #undef TEMPLATE
> ! 	    }
> ! 	  else
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "    layout %d: rejected\n", layout_i);
> ! 	}
> !     }
> ! }
>   
> ! /* Apply the chosen vector layouts to the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::materialize ()
> ! {
> !   m_partition_layout_costs.release ();
> !   m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length ());
>   
> +   for (unsigned int node_i : m_partitioned_nodes)
> +     {
> +       /* Rearrange the scalar statements to match the chosen layout.  */
> +       slp_tree node = m_vertices[node_i].node;
> +       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> +       gcc_assert (layout_i >= 0);
> +       if (layout_i > 0)
> + 	vect_slp_permute (m_perms[layout_i],
> + 			  SLP_TREE_SCALAR_STMTS (node), true);
> + 
> +       /* Update load and lane permutes.  */
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>   	{
>   	  /* Apply the common permutes to the input vectors.  */
> ! 	  /* FIXME: check whether the target accepts this folding and
> ! 	     handle it like a normal node if not.  That would include
> ! 	     adjusting the follow-up loop below.  */
> ! 	  unsigned int opno;
> ! 	  slp_tree child;
> ! 	  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child)
> ! 	    {
> ! 	      unsigned int partition2_i = m_vertices[child->vertex].partition;
> ! 	      int layout2_i = m_partitions[partition2_i].layout;
> ! 	      if (layout2_i == 0)
> ! 		continue;
> ! 
>   	      /* If the node is already a permute node we can apply
>   		 the permutation to the lane selection, effectively
>   		 materializing it on the incoming vectors.  */
> ***************
> *** 3987,4141 ****
>   		dump_printf_loc (MSG_NOTE, vect_location,
>   				 "simplifying permute node %p\n",
>   				 node);
> ! 	      for (unsigned k = 0;
> ! 		   k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k)
> ! 		SLP_TREE_LANE_PERMUTATION (node)[k].second
> ! 		  = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second];
> ! 	    }
> ! 	  /* Apply the anticipated output permute to the permute and
> ! 	     stmt vectors.  */
> ! 	  int perm_out = vertices[i].perm_out;
> ! 	  if (perm_out > 0)
> ! 	    {
> ! 	      vect_slp_permute (perms[perm_out],
> ! 				SLP_TREE_SCALAR_STMTS (node), true);
> ! 	      vect_slp_permute (perms[perm_out],
> ! 				SLP_TREE_LANE_PERMUTATION (node), true);
> ! 	    }
> ! 	}
> !       else if (vertices[i].get_perm_materialized () != 0)
> ! 	{
> ! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! 	    /* For loads simply drop the permutation, the load permutation
> ! 	       already performs the desired permutation.  */
> ! 	    ;
> ! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! 	    gcc_unreachable ();
> ! 	  else
> ! 	    {
> ! 	      if (dump_enabled_p ())
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "inserting permute node in place of %p\n",
> ! 				 node);
> ! 
> ! 	      /* Make a copy of NODE and in-place change it to a
> ! 		 VEC_PERM node to permute the lanes of the copy.  */
> ! 	      slp_tree copy = new _slp_tree;
> ! 	      SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node);
> ! 	      SLP_TREE_CHILDREN (node) = vNULL;
> ! 	      SLP_TREE_SCALAR_STMTS (copy)
> ! 		= SLP_TREE_SCALAR_STMTS (node).copy ();
> ! 	      vect_slp_permute (perms[perm_in],
> ! 				SLP_TREE_SCALAR_STMTS (copy), true);
> ! 	      gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ());
> ! 	      SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node);
> ! 	      gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ());
> ! 	      SLP_TREE_LANE_PERMUTATION (copy)
> ! 		= SLP_TREE_LANE_PERMUTATION (node);
> ! 	      SLP_TREE_LANE_PERMUTATION (node) = vNULL;
> ! 	      SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node);
> ! 	      copy->refcnt = 1;
> ! 	      copy->max_nunits = node->max_nunits;
> ! 	      SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node);
> ! 	      SLP_TREE_LANES (copy) = SLP_TREE_LANES (node);
> ! 	      SLP_TREE_CODE (copy) = SLP_TREE_CODE (node);
> ! 
> ! 	      /* Now turn NODE into a VEC_PERM.  */
> ! 	      SLP_TREE_CHILDREN (node).safe_push (copy);
> ! 	      SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node));
> ! 	      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> ! 		SLP_TREE_LANE_PERMUTATION (node)
> ! 		  .quick_push (std::make_pair (0, perms[perm_in][j]));
> ! 	      SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> ! 	    }
> ! 	}
> !       else if (perm_in > 0) /* perm_in == perm_out */
> ! 	{
> ! 	  /* Apply the reverse permutation to our stmts.  */
> ! 	  vect_slp_permute (perms[perm_in],
> ! 			    SLP_TREE_SCALAR_STMTS (node), true);
> ! 	  /* And to the lane/load permutation, which we can simply
> ! 	     make regular by design.  */
> ! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! 	    {
> ! 	      gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! 	      /* ???  When we handle non-bijective permutes the idea
> ! 		 is that we can force the load-permutation to be
> ! 		 { min, min + 1, min + 2, ... max }.  But then the
> ! 		 scalar defs might no longer match the lane content
> ! 		 which means wrong-code with live lane vectorization.
> ! 		 So we possibly have to have NULL entries for those.  */
> ! 	      vect_slp_permute (perms[perm_in],
> ! 				SLP_TREE_LOAD_PERMUTATION (node), true);
> ! 	    }
> ! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! 	    gcc_unreachable ();
>   	}
>       }
>   
> !   /* Elide any permutations at BB reduction roots.  */
> !   if (is_a <bb_vec_info> (vinfo))
>       {
> !       for (slp_instance instance : vinfo->slp_instances)
>   	{
> ! 	  if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc)
>   	    continue;
> ! 	  slp_tree old = SLP_INSTANCE_TREE (instance);
> ! 	  if (SLP_TREE_CODE (old) == VEC_PERM_EXPR
> ! 	      && SLP_TREE_CHILDREN (old).length () == 1)
>   	    {
> ! 	      slp_tree child = SLP_TREE_CHILDREN (old)[0];
> ! 	      if (SLP_TREE_DEF_TYPE (child) == vect_external_def)
> ! 		{
> ! 		  /* Preserve the special VEC_PERM we use to shield existing
> ! 		     vector defs from the rest.  But make it a no-op.  */
> ! 		  unsigned i = 0;
> ! 		  for (std::pair<unsigned, unsigned> &p
> ! 		       : SLP_TREE_LANE_PERMUTATION (old))
> ! 		    p.second = i++;
> ! 		}
> ! 	      else
> ! 		{
> ! 		  SLP_INSTANCE_TREE (instance) = child;
> ! 		  SLP_TREE_REF_COUNT (child)++;
> ! 		  vect_free_slp_tree (old);
> ! 		}
> ! 	    }
> ! 	  else if (SLP_TREE_LOAD_PERMUTATION (old).exists ()
> ! 		   && SLP_TREE_REF_COUNT (old) == 1
> ! 		   && vertices[old->vertex].get_perm_materialized () != 0)
> ! 	    {
> ! 	      /* ???  For loads the situation is more complex since
> ! 		 we can't modify the permute in place in case the
> ! 		 node is used multiple times.  In fact for loads this
> ! 		 should be somehow handled in the propagation engine.  */
> ! 	      /* Apply the reverse permutation to our stmts.  */
> ! 	      int perm = vertices[old->vertex].get_perm_materialized ();
> ! 	      vect_slp_permute (perms[perm],
> ! 				SLP_TREE_SCALAR_STMTS (old), true);
> ! 	      vect_slp_permute (perms[perm],
> ! 				SLP_TREE_LOAD_PERMUTATION (old), true);
>   	    }
>   	}
>       }
>   
> !   /* Free the perms vector used for propagation.  */
> !   while (!perms.is_empty ())
> !     perms.pop ().release ();
> !   free_graph (slpg);
> ! 
>   
> !   /* Now elide load permutations that are not necessary.  */
> !   for (i = 0; i < leafs.length (); ++i)
>       {
> !       node = vertices[leafs[i]].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* In basic block vectorization we allow any subchain of an interleaving
>   	 chain.
>   	 FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (vinfo))
>   	{
>   	  bool subchain_p = true;
>   	  stmt_vec_info next_load_info = NULL;
> --- 5094,5173 ----
>   		dump_printf_loc (MSG_NOTE, vect_location,
>   				 "simplifying permute node %p\n",
>   				 node);
> ! 	      for (auto &entry : SLP_TREE_LANE_PERMUTATION (node))
> ! 		if (entry.first == opno)
> ! 		  entry.second = m_perms[layout2_i][entry.second];
> ! 	    }
> ! 	  /* Reverse the permutation that we chose to reverse.  */
> ! 	  if (layout_i > 0)
> ! 	    vect_slp_permute (m_perms[layout_i],
> ! 			      SLP_TREE_LANE_PERMUTATION (node), true);
> ! 	}
> !       else
> ! 	{
> ! 	  gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! 	  auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node);
> ! 	  if (layout_i > 0)
> ! 	    /* ???  When we handle non-bijective permutes the idea
> ! 	       is that we can force the load-permutation to be
> ! 	       { min, min + 1, min + 2, ... max }.  But then the
> ! 	       scalar defs might no longer match the lane content
> ! 	       which means wrong-code with live lane vectorization.
> ! 	       So we possibly have to have NULL entries for those.  */
> ! 	    vect_slp_permute (m_perms[layout_i], load_perm, true);
>   	}
>       }
>   
> !   /* Do this before any nodes disappear, since it involves a walk
> !      over the leaves.  */
> !   remove_redundant_permutes ();
> ! 
> !   /* Replace each child with a correctly laid-out version.  */
> !   for (unsigned int node_i : m_partitioned_nodes)
>       {
> !       slp_tree node = m_vertices[node_i].node;
> ! 
> !       /* The loop above adjusted permutes for each child's chosen layout,
> ! 	 so we don't need to adjust them here.  */
> !       if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! 	continue;
> ! 
> !       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> !       gcc_assert (layout_i >= 0);
> !       unsigned j;
> !       slp_tree child;
> !       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
>   	{
> ! 	  if (!child)
>   	    continue;
> ! 
> ! 	  slp_tree new_child = get_result_with_layout (child, layout_i);
> ! 	  if (new_child != child)
>   	    {
> ! 	      vect_free_slp_tree (child);
> ! 	      SLP_TREE_CHILDREN (node)[j] = new_child;
> ! 	      new_child->refcnt += 1;
>   	    }
>   	}
>       }
> + }
>   
> ! /* Elide load permutations that are not necessary.  Such permutations might
> !    be pre-existing, rather than created by the layout optimizations.  */
>   
> ! void
> ! vect_optimize_slp_pass::remove_redundant_permutes ()
> ! {
> !   for (unsigned int node_i : m_leafs)
>       {
> !       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* In basic block vectorization we allow any subchain of an interleaving
>   	 chain.
>   	 FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (m_vinfo))
>   	{
>   	  bool subchain_p = true;
>   	  stmt_vec_info next_load_info = NULL;
> ***************
> *** 4160,4165 ****
> --- 5192,5198 ----
>   	}
>         else
>   	{
> + 	  loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
>   	  stmt_vec_info load_info;
>   	  bool this_load_permuted = false;
>   	  unsigned j;
> ***************
> *** 4175,4182 ****
>   	      /* The load requires permutation when unrolling exposes
>   		 a gap either because the group is larger than the SLP
>   		 group-size or because there is a gap between the groups.  */
> ! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR
> ! 			      (as_a <loop_vec_info> (vinfo)), 1U)
>   		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>   		      && DR_GROUP_GAP (first_stmt_info) == 0)))
>   	    {
> --- 5208,5214 ----
>   	      /* The load requires permutation when unrolling exposes
>   		 a gap either because the group is larger than the SLP
>   		 group-size or because there is a gap between the groups.  */
> ! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
>   		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>   		      && DR_GROUP_GAP (first_stmt_info) == 0)))
>   	    {
> ***************
> *** 4187,4192 ****
> --- 5219,5258 ----
>       }
>   }
>   
> + /* Main entry point for the optimization pass.  */
> + 
> + void
> + vect_optimize_slp_pass::run ()
> + {
> +   build_graph ();
> +   create_partitions ();
> +   start_choosing_layouts ();
> +   if (m_perms.length () > 1)
> +     {
> +       forward_pass ();
> +       backward_pass ();
> +       if (dump_enabled_p ())
> + 	dump ();
> +       materialize ();
> +     }
> +   else
> +     remove_redundant_permutes ();
> +   /* Free the perms vector used for propagation.  */
> +   while (!m_perms.is_empty ())
> +     m_perms.pop ().release ();
> +   free_graph (m_slpg);
> + }
> + 
> + /* Optimize the SLP graph of VINFO.  */
> + 
> + void
> + vect_optimize_slp (vec_info *vinfo)
> + {
> +   if (vinfo->slp_instances.is_empty ())
> +     return;
> +   vect_optimize_slp_pass (vinfo).run ();
> + }
> + 
>   /* Gather loads reachable from the individual SLP graph entries.  */
>   
>   void
> ***************
> *** 6647,6653 ****
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> --- 7713,7719 ----
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = SLP_TREE_VECTYPE (node);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

  reply	other threads:[~2022-08-03 11:23 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-02  7:20 Richard Sandiford
2022-08-03 11:23 ` Richard Biener [this message]
2022-08-04  9:06   ` Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.77.849.2208031118240.4208@jbgna.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).