public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] RFC: Extend SLP permutation optimisations
@ 2022-08-02  7:20 Richard Sandiford
  2022-08-03 11:23 ` Richard Biener
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Sandiford @ 2022-08-02  7:20 UTC (permalink / raw)
  To: gcc-patches; +Cc: rguenther

Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutes
needed or (in the best case) removing the need for the permutes
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permute operation.

- Treat the placement of permute operations as a two-way dataflow
  problem: after propagating information from leaves to roots (as now),
  propagate information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutes inside loops have a higher cost
  than permutes outside loops.

- Try to reduce the total number of permutes when optimising for
  size, even if that increases the number of permutes on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

A while back I posted a patch to extend the existing optimisation
to non-consecutive loads.  This patch doesn't include that change
(although it's a possible future addition).

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

     a[0] += 1;
     a[1] += 1;
     a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the current patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

        ldr     q1, [x2]
        ldr     q0, [x1]
        ext     v1.16b, v1.16b, v1.16b, #8     // <------
        sshl    v0.4s, v0.4s, v1.4s
        ldr     q1, [x3]
        sub     v0.4s, v0.4s, v1.4s
        rev64   v0.4s, v0.4s                   // <------
        str     q0, [x0]
        ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the current code does.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
    {
      a[0] = (a[0] + c[3]);
      a[1] = (a[1] + c[2]);
      a[2] = (a[2] + c[1]);
      a[3] = (a[3] + c[0]);
      c += 4;
    }
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

        mov     x3, x0
        adrp    x0, .LC0
        add     x1, x2, 1600
        ldr     q2, [x0, #:lo12:.LC0]
        ldr     q0, [x3]
        mov     v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        .p2align 3,,7
.L6:
        ldr     q1, [x2], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x2, x1
        bne     .L6
        mov     v1.16b, v0.16b
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        str     q0, [x3]
        ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
    {
      a0 ^= c[0];
      a1 ^= c[1];
      a2 ^= c[2];
      a3 ^= c[3];
      c += 4;
      for (int j = 0; j < 100; ++j)
	{
	  a0 += d[1];
	  a1 += d[0];
	  a2 += d[3];
	  a3 += d[2];
	  d += 4;
	}
      b[0] = a0;
      b[1] = a1;
      b[2] = a2;
      b[3] = a3;
      b += 4;
    }
  a[0] = a0;
  a[1] = a1;
  a[2] = a2;
  a[3] = a3;
}

the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
even though it's part of an SCC that includes the outer loop.
In other words, this is a motivating case for not assigning
permutes at SCC granularity.  The code we get is:

        ldr     q0, [x0]
        mov     x4, x1
        mov     x5, x0
        add     x1, x3, 1600
        add     x3, x4, 1600
        .p2align 3,,7
.L11:
        ldr     q1, [x2], 16
        sub     x0, x1, #1600
        eor     v0.16b, v1.16b, v0.16b
        rev64   v0.4s, v0.4s              // <---
        .p2align 3,,7
.L10:
        ldr     q1, [x0], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x0, x1
        bne     .L10
        rev64   v0.4s, v0.4s              // <---
        add     x1, x0, 1600
        str     q0, [x4], 16
        cmp     x3, x4
        bne     .L11
        str     q0, [x5]
        ret

The patch is still incomplete, hence the FIXMEs, but it's at the stage
where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
regression in slp-11b.c, but I think it's arguably restoring the
expected load-lanes behaviour.)  Does this look like a plausible
direction, before I go filling in the FIXME parts?

Thanks,
Richard

---
 gcc/graphds.cc        |   13 +-
 gcc/graphds.h         |    3 +-
 gcc/hash-map-traits.h |   74 +-
 gcc/hash-traits.h     |   96 ++-
 gcc/tree-vect-slp.cc  | 1870 ++++++++++++++++++++++++++++++++---------
 5 files changed, 1599 insertions(+), 457 deletions(-)

*** /tmp/iBbApg_graphds.cc	2022-08-02 08:16:23.975354155 +0100
--- gcc/graphds.cc	2022-08-01 20:44:22.287691305 +0100
***************
*** 281,287 ****
     numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
     specifies the subgraph of G whose strongly connected components we want
     to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
!    Edge E will be skipped if callback function returns true.
  
     After running this function, v->component is the number of the strongly
     connected component for each vertex of G.  Returns the number of the
--- 281,294 ----
     numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
     specifies the subgraph of G whose strongly connected components we want
     to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
!    Edge E will be skipped if callback function returns true.  If SCC_GROUPING
!    is not null, the nodes will be added to it in the following order:
! 
!    - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag,
!      A's nodes come before B's nodes.
! 
!    - All of an SCC's nodes are listed consecutively, although the order
!      of the nodes within an SCC is not really meaningful.
  
     After running this function, v->component is the number of the strongly
     connected component for each vertex of G.  Returns the number of the
***************
*** 289,295 ****
  
  int
  graphds_scc (struct graph *g, bitmap subgraph,
! 	     skip_edge_callback skip_edge_p)
  {
    int *queue = XNEWVEC (int, g->n_vertices);
    vec<int> postorder = vNULL;
--- 296,302 ----
  
  int
  graphds_scc (struct graph *g, bitmap subgraph,
! 	     skip_edge_callback skip_edge_p, vec<int> *scc_grouping)
  {
    int *queue = XNEWVEC (int, g->n_vertices);
    vec<int> postorder = vNULL;
***************
*** 317,323 ****
  
    for (i = 0; i < nq; i++)
      queue[i] = postorder[nq - i - 1];
!   comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p);
  
    free (queue);
    postorder.release ();
--- 324,330 ----
  
    for (i = 0; i < nq; i++)
      queue[i] = postorder[nq - i - 1];
!   comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, skip_edge_p);
  
    free (queue);
    postorder.release ();
*** /tmp/eRUAEk_graphds.h	2022-08-02 08:16:23.987354123 +0100
--- gcc/graphds.h	2022-08-01 20:44:18.835700593 +0100
***************
*** 58,64 ****
  typedef bool (*skip_edge_callback) (struct graph_edge *);
  int graphds_dfs (struct graph *, int *, int,
  		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL);
  void graphds_domtree (struct graph *, int, int *, int *, int *);
  typedef void (*graphds_edge_callback) (struct graph *,
  				       struct graph_edge *, void *);
--- 58,65 ----
  typedef bool (*skip_edge_callback) (struct graph_edge *);
  int graphds_dfs (struct graph *, int *, int,
  		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL,
! 		 vec<int> * = NULL);
  void graphds_domtree (struct graph *, int, int *, int *, int *);
  typedef void (*graphds_edge_callback) (struct graph *,
  				       struct graph_edge *, void *);
*** /tmp/4zWIek_hash-map-traits.h	2022-08-02 08:16:23.995354102 +0100
--- gcc/hash-map-traits.h	2022-08-01 20:44:22.287691305 +0100
***************
*** 105,118 ****
    static const bool maybe_mx = false;
  };
  
! /* Implement traits for a hash_map with values of type Value for cases
!    in which the key cannot represent empty and deleted slots.  Instead
!    record empty and deleted entries in Value.  Derived classes must
!    implement the hash and equal_keys functions.  */
  
! template <typename Value>
  struct unbounded_hashmap_traits
  {
    template <typename T> static inline void remove (T &);
    static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
    template <typename T> static inline bool is_empty (const T &);
--- 105,123 ----
    static const bool maybe_mx = false;
  };
  
! /* Implement traits for a hash_map with keys of type Key and values of
!    type Value for cases in which the key cannot represent empty and
!    deleted slots.  Instead record empty and deleted entries in Value.  */
  
! template <typename Key, typename Value>
  struct unbounded_hashmap_traits
  {
+   typedef typename Key::value_type key_type;
+ 
+   static hashval_t hash (const typename Key::value_type &);
+   static bool equal_keys (const typename Key::value_type &,
+ 			  const typename Key::compare_type &);
+ 
    template <typename T> static inline void remove (T &);
    static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
    template <typename T> static inline bool is_empty (const T &);
***************
*** 121,162 ****
    template <typename T> static inline void mark_deleted (T &);
  };
  
! template <typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Value>::remove (T &entry)
  {
    default_hash_traits <Value>::remove (entry.m_value);
  }
  
! template <typename Value>
  template <typename T>
  inline bool
! unbounded_hashmap_traits <Value>::is_empty (const T &entry)
  {
    return default_hash_traits <Value>::is_empty (entry.m_value);
  }
  
! template <typename Value>
  template <typename T>
  inline bool
! unbounded_hashmap_traits <Value>::is_deleted (const T &entry)
  {
    return default_hash_traits <Value>::is_deleted (entry.m_value);
  }
  
! template <typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Value>::mark_empty (T &entry)
  {
    default_hash_traits <Value>::mark_empty (entry.m_value);
  }
  
! template <typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Value>::mark_deleted (T &entry)
  {
    default_hash_traits <Value>::mark_deleted (entry.m_value);
  }
--- 126,184 ----
    template <typename T> static inline void mark_deleted (T &);
  };
  
! template <typename Key, typename Value>
! inline hashval_t
! unbounded_hashmap_traits <Key, Value>
! ::hash (const typename Key::value_type &key)
! {
!   return Key::hash (key);
! }
! 
! template <typename Key, typename Value>
! inline bool
! unbounded_hashmap_traits <Key, Value>
! ::equal_keys (const typename Key::value_type &x,
! 	      const typename Key::compare_type &y)
! {
!   return Key::equal (x, y);
! }
! 
! template <typename Key, typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Key, Value>::remove (T &entry)
  {
    default_hash_traits <Value>::remove (entry.m_value);
  }
  
! template <typename Key, typename Value>
  template <typename T>
  inline bool
! unbounded_hashmap_traits <Key, Value>::is_empty (const T &entry)
  {
    return default_hash_traits <Value>::is_empty (entry.m_value);
  }
  
! template <typename Key, typename Value>
  template <typename T>
  inline bool
! unbounded_hashmap_traits <Key, Value>::is_deleted (const T &entry)
  {
    return default_hash_traits <Value>::is_deleted (entry.m_value);
  }
  
! template <typename Key, typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Key, Value>::mark_empty (T &entry)
  {
    default_hash_traits <Value>::mark_empty (entry.m_value);
  }
  
! template <typename Key, typename Value>
  template <typename T>
  inline void
! unbounded_hashmap_traits <Key, Value>::mark_deleted (T &entry)
  {
    default_hash_traits <Value>::mark_deleted (entry.m_value);
  }
***************
*** 166,190 ****
     slots.  */
  
  template <typename Key, typename Value>
! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits <Value>
! {
!   typedef Key key_type;
!   static inline hashval_t hash (Key);
!   static inline bool equal_keys (Key, Key);
! };
! 
! template <typename Key, typename Value>
! inline hashval_t
! unbounded_int_hashmap_traits <Key, Value>::hash (Key k)
! {
!   return k;
! }
! 
! template <typename Key, typename Value>
! inline bool
! unbounded_int_hashmap_traits <Key, Value>::equal_keys (Key k1, Key k2)
! {
!   return k1 == k2;
! }
  
  #endif // HASH_MAP_TRAITS_H
--- 188,194 ----
     slots.  */
  
  template <typename Key, typename Value>
! using unbounded_int_hashmap_traits
!   = unbounded_hashmap_traits <int_hash_base <Key>, Value>;
  
  #endif // HASH_MAP_TRAITS_H
*** /tmp/FL7vGb_hash-traits.h	2022-08-02 08:16:24.011354061 +0100
--- gcc/hash-traits.h	2022-08-01 20:44:22.287691305 +0100
***************
*** 85,125 ****
  {
  }
  
  
! /* Hasher for integer type Type in which Empty is a spare value that can be
!    used to mark empty slots.  If Deleted != Empty then Deleted is another
!    spare value that can be used for deleted slots; if Deleted == Empty then
!    hash table entries cannot be deleted.  */
! 
! template <typename Type, Type Empty, Type Deleted = Empty>
! struct int_hash : typed_noop_remove <Type>
  {
    typedef Type value_type;
    typedef Type compare_type;
  
    static inline hashval_t hash (value_type);
    static inline bool equal (value_type existing, value_type candidate);
-   static inline void mark_deleted (Type &);
-   static const bool empty_zero_p = Empty == 0;
-   static inline void mark_empty (Type &);
-   static inline bool is_deleted (Type);
-   static inline bool is_empty (Type);
  };
  
! template <typename Type, Type Empty, Type Deleted>
  inline hashval_t
! int_hash <Type, Empty, Deleted>::hash (value_type x)
  {
    return x;
  }
  
! template <typename Type, Type Empty, Type Deleted>
  inline bool
! int_hash <Type, Empty, Deleted>::equal (value_type x, value_type y)
  {
    return x == y;
  }
  
  template <typename Type, Type Empty, Type Deleted>
  inline void
  int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
--- 85,135 ----
  {
  }
  
+ /* Base traits for integer type Type, providing just the hash and
+    comparison functionality.  */
  
! template <typename Type>
! struct int_hash_base : typed_noop_remove <Type>
  {
    typedef Type value_type;
    typedef Type compare_type;
  
    static inline hashval_t hash (value_type);
    static inline bool equal (value_type existing, value_type candidate);
  };
  
! template <typename Type>
  inline hashval_t
! int_hash_base <Type>::hash (value_type x)
  {
    return x;
  }
  
! template <typename Type>
  inline bool
! int_hash_base <Type>::equal (value_type x, value_type y)
  {
    return x == y;
  }
  
+ /* Hasher for integer type Type in which Empty is a spare value that can be
+    used to mark empty slots.  If Deleted != Empty then Deleted is another
+    spare value that can be used for deleted slots; if Deleted == Empty then
+    hash table entries cannot be deleted.  */
+ 
+ template <typename Type, Type Empty, Type Deleted = Empty>
+ struct int_hash : int_hash_base <Type>
+ {
+   typedef Type value_type;
+   typedef Type compare_type;
+ 
+   static inline void mark_deleted (Type &);
+   static const bool empty_zero_p = Empty == 0;
+   static inline void mark_empty (Type &);
+   static inline bool is_deleted (Type);
+   static inline bool is_empty (Type);
+ };
+ 
  template <typename Type, Type Empty, Type Deleted>
  inline void
  int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
***************
*** 398,403 ****
--- 408,467 ----
    return T1::is_empty (x.first);
  }
  
+ /* Base traits for vectors, providing just the hash and comparison
+    functionality.  Type gives the traits for the element type.  */
+ 
+ template <typename Type>
+ struct vec_hash_base
+ {
+   typedef vec<typename Type::value_type> value_type;
+   typedef vec<typename Type::compare_type> compare_type;
+ 
+   static inline hashval_t hash (value_type);
+   static inline bool equal (value_type, compare_type);
+ };
+ 
+ template <typename Type>
+ inline hashval_t
+ vec_hash_base <Type>::hash (value_type x)
+ {
+   inchash::hash hstate;
+   hstate.add_int (x.length ());
+   for (auto &value : x)
+     hstate.merge_hash (Type::hash (value));
+   return hstate.end ();
+ }
+ 
+ template <typename Type>
+ inline bool
+ vec_hash_base <Type>::equal (value_type x, compare_type y)
+ {
+   if (x.length () != y.length ())
+     return false;
+   for (unsigned int i = 0; i < x.length (); ++i)
+     if (!Type::equal (x[i], y[i]))
+       return false;
+   return true;
+ }
+ 
+ /* Traits for vectors whose contents should be freed normally.  */
+ 
+ template <typename Type>
+ struct vec_free_hash_base : vec_hash_base <Type>
+ {
+   static void remove (typename vec_hash_base <Type>::value_type &);
+ };
+ 
+ template <typename Type>
+ void
+ vec_free_hash_base <Type>
+ ::remove (typename vec_hash_base <Type>::value_type &x)
+ {
+   for (auto &value : x)
+     Type::remove (x);
+   x.release ();
+ }
+ 
  template <typename T> struct default_hash_traits : T {};
  
  template <typename T>
*** /tmp/5ek0ya_tree-vect-slp.cc	2022-08-02 08:16:24.027354017 +0100
--- gcc/tree-vect-slp.cc	2022-08-01 20:44:22.631690380 +0100
***************
*** 20,25 ****
--- 20,26 ----
  <http://www.gnu.org/licenses/>.  */
  
  #include "config.h"
+ #define INCLUDE_ALGORITHM
  #include "system.h"
  #include "coretypes.h"
  #include "backend.h"
***************
*** 49,54 ****
--- 50,57 ----
  #include "tree-eh.h"
  #include "tree-cfg.h"
  #include "alloc-pool.h"
+ #include "sreal.h"
+ #include "predict.h"
  
  static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
  					  slp_tree, stmt_vector_for_cost *);
***************
*** 304,309 ****
--- 307,322 ----
    oprnds_info.release ();
  }
  
+ /* Return the execution frequency of NODE (so that a higher value indicates
+    a "more important" node when optimizing for speed).  */
+ 
+ static sreal
+ vect_slp_node_weight (slp_tree node)
+ {
+   stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
+   basic_block bb = gimple_bb (stmt_info->stmt);
+   return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
+ }
  
  /* Return true if STMTS contains a pattern statement.  */
  
***************
*** 3531,3559 ****
    return opt_result::success ();
  }
  
! struct slpg_vertex
  {
!   slpg_vertex (slp_tree node_)
!     : node (node_), perm_in (-1), perm_out (-1) {}
  
!   int get_perm_materialized () const
!     { return perm_in != perm_out ? perm_in : 0; }
  
    slp_tree node;
!   /* The common permutation on the incoming lanes (towards SLP children).  */
!   int perm_in;
!   /* The permutation on the outgoing lanes (towards SLP parents).  When
!      the node is a materialization point for a permute this differs
!      from perm_in (and is then usually zero).  Materialization happens
!      on the input side.  */
!   int perm_out;
  };
  
  /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
  
! static void
! vect_slp_build_vertices (hash_set<slp_tree> &visited, slp_tree node,
! 			 vec<slpg_vertex> &vertices, vec<int> &leafs)
  {
    unsigned i;
    slp_tree child;
--- 3544,3994 ----
    return opt_result::success ();
  }
  
! /* Estimates the cost of inserting layout changes into the SLP graph.  */
! 
! struct slpg_layout_cost
! {
!   slpg_layout_cost () = default;
!   slpg_layout_cost (sreal, bool);
! 
!   bool operator== (const slpg_layout_cost &) const;
!   bool operator!= (const slpg_layout_cost &) const;
! 
!   bool is_better_than (const slpg_layout_cost &, bool) const;
! 
!   void add_parallel_cost (const slpg_layout_cost &);
!   void add_serial_cost (const slpg_layout_cost &);
!   void split (unsigned int);
! 
!   /* The longest sequence of layout changes needed during any traversal
!      of the partition dag, weighted by execution frequency.
! 
!      This is the most important metric when optimizing for speed, since
!      it helps to ensure that we keep the number of operations on
!      critical paths to a minimum.  */
!   sreal depth = 0;
! 
!   /* An estimate of the total number of operations needed.  IT is weighted by
!      execution frequency when optimizing for speed but not when optimizing for
!      size.  In order to avoid double-counting, a node with a fanout of N will
!      distribute 1/N of its total cost to each successor.
! 
!      This is the most important metric when optimizing for size, since
!      it helps to keep the total number of operations to a minimum,  */
!   sreal total = 0;
! };
! 
! /* Construct costs for a node with weight WEIGHT.  A higher weight
!    indicates more frequent execution.  IS_FOR_SIZE is true if we are
!    optimizing for size rather than speed.  */
! 
! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size)
!   : depth (weight), total (is_for_size && weight > 0 ? 1 : weight)
! {
! }
! 
! bool
! slpg_layout_cost::operator== (const slpg_layout_cost &other) const
! {
!   return depth == other.depth && total == other.total;
! }
! 
! bool
! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const
! {
!   return !operator== (other);
! }
! 
! /* Return true if these costs are better than OTHER.  IS_FOR_SIZE is
!    true if we are optimizing for size rather than speed.  */
! 
! bool
! slpg_layout_cost::is_better_than (const slpg_layout_cost &other,
! 				  bool is_for_size) const
! {
!   if (is_for_size)
!     {
!       if (total != other.total)
! 	return total < other.total;
!       return depth < other.depth;
!     }
!   else
!     {
!       if (depth != other.depth)
! 	return depth < other.depth;
!       return total < other.total;
!     }
! }
! 
! /* Increase the costs to account for something with cost INPUT_COST
!    happening in parallel with the current costs.  */
! 
! void
! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost)
  {
!   depth = std::max (depth, input_cost.depth);
!   total += input_cost.total;
! }
! 
! /* Increase the costs to account for something with cost INPUT_COST
!    happening in series with the current costs.  */
! 
! void
! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other)
! {
!   depth += other.depth;
!   total += other.total;
! }
! 
! /* Split the total cost among TIMES successors or predecessors.  */
  
! void
! slpg_layout_cost::split (unsigned int times)
! {
!   if (times > 1)
!     total /= times;
! }
! 
! /* Information about one node in the SLP graph, for use during
!    vect_optimize_slp_pass.  */
! 
! struct slpg_vertex
! {
!   slpg_vertex (slp_tree node_) : node (node_) {}
  
+   /* The node itself.  */
    slp_tree node;
! 
!   /* Which partition the node belongs to, or -1 if none.  */
!   int partition = -1;
! 
!   /* The number of nodes that directly use the result of this one
!      (i.e. the number of nodes that count this one as a child).  */
!   unsigned int out_degree = 0;
! 
!   /* The execution frequency of the node.  */
!   sreal weight = 0;
! 
!   /* The total execution frequency of all nodes that directly use the
!      result of this one.  */
!   sreal out_weight = 0;
! };
! 
! /* Information about one partition of the SLP graph, for use during
!    vect_optimize_slp_pass.  */
! 
! struct slpg_partition_info
! {
!   /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END)
!      of m_partitioned_nodes.  */
!   unsigned int node_begin = 0;
!   unsigned int node_end = 0;
! 
!   /* Which layout we've chosen to use for this partition, or -1 if
!      we haven't picked one yet.  */
!   int layout = -1;
! 
!   /* The number of predecessors and successors in the partition dag.
!      The predecessors always have lower partition numbers and the
!      successors always have higher partition numbers.
! 
!      Note that the directions of these edges are not necessarily the
!      same as in the data flow graph.  For example, if an SCC contains
!      an inner and an outer loop, the inner loop's partition will have
!      at least two incoming edges from the outer loop's partition:
!      one for a live-in value and one for a live-out value.  In data
!      flow terms, the live-in edge would also be from the outer loop
!      to the inner loop, but the live-out edge would be from the inner
!      loop to the outer loop.  */
!   unsigned int in_degree = 0;
!   unsigned int out_degree = 0;
! };
! 
! /* Information about the costs of using a particular layout for a
!    particular partition.  It can also say that the combination is
!    invalid.  */
! 
! struct slpg_partition_layout_costs
! {
!   bool is_possible () const { return cost.total != sreal::min (); }
!   void mark_impossible () { cost.total = sreal::min (); }
! 
!   /* The costs inherited from predecessor partitions plus the inherent
!      cost of the layout within the node itself.  */
!   slpg_layout_cost in_cost;
! 
!   /* The costs inherited from successor partitions.  */
!   slpg_layout_cost out_cost;
! 
!   /* The sum of the above, when used.  */
!   slpg_layout_cost cost;
! };
! 
! /* This class tries to optimize the layout of vectors in order to avoid
!    unnecessary shuffling.  At the moment, the set of possible layouts are
!    restricted to bijective permutes.
! 
!    The goal of the pass depends on whether we're optimizing for size or
!    for speed.  When optimizing for size, the goal is to reduce the overall
!    number of layout changes (including layout changes implied by things
!    like load permutes).  When optimizing for speed, the goal is to
!    reduce the maximum latency attributable to layout changes on any
!    non-cyclic path through the data flow graph.
! 
!    For example, when optimizing a loop nest for speed, we will prefer
!    to make layout changes outside of a loop rather than inside of a loop,
!    and will prefer to make layout changes in parallel rather than serially,
!    even if that increases the overall number of layout changes.
! 
!    The high-level procedure is:
! 
!    (1) Build a graph in which edges go from uses (parents) to definitions
!        (children).
! 
!    (2) Divide the graph into a dag of strongly-connected components (SCCs).
! 
!    (3) Partition the nodes in each SCC based on their containing cfg loop,
!        creating a dag of partitions.  The goal is now to assign a layout
!        to each partition.
! 
!    (4) Construct a set of vector layouts that are worth considering and
!        calculate the local cost of using each layout in each partition.
!        Some partitions might not allow a given layout at all.
! 
!    (5) Perform a forward walk over the partition dag (from loads to stores)
!        accumulating the "forward" cost of using each layout.  When visiting
!        each partition, assign a tentative choice of layout to the partition
!        and use that choice when calculating the cost of using a different
!        layout in successor partitions.
! 
!    (6) Perform a backward walk over the partition dag (from stores to loads),
!        accumulating the "backward" cost of using each layout.  When visiting
!        each partition, make a final choice of layout for that partition based
!        on the accumulated forward costs (from (5)) and backward costs
!        (from (6)).
! 
!    (7) Apply the chosen layouts to the SLP graph.
! 
!    For example, consider the SLP statements:
! 
!    S1:      a_1 = load
!        loop:
!    S2:      a_2 = PHI<a_1, a_3>
!    S3:      b_1 = load
!    S4:      a_3 = a_2 + b_1
!        exit:
!    S5:      a_4 = PHI<a_3>
!    S6:      store a_4
! 
!    S2 and S4 form an SCC and are part of the same loop.  Every other
!    statement is in a singleton SCC.  In this example there is a one-to-one
!    mapping between SCCs and partitions and the partition dag looks like this;
! 
! 	S1     S3
! 	 \     /
! 	  S2+S4
! 	    |
! 	   S5
! 	    |
! 	   S6
! 
!    S2, S3 and S4 will have a higher execution frequency than the other
!    statements, so when optimizing for speed, the goal is to avoid any
!    layout changes:
! 
!    - within S3
!    - within S2+S4
!    - on the S3->S2+S4 edge
! 
!    For example, if S3 was originally a reversing load, the goal of the
!    pass is to make it an unreversed load and change the layout on the
!    S1->S2+S4 and S2+S4->S5 edges to compensate.  (Changing the layout
!    on S1->S2+S4 and S5->S6 would also be acceptable.)
! 
!    The difference between SCCs and partitions becomes important if we
!    add an outer loop:
! 
!    S1:      a_1 = ...
!        loop1:
!    S2:      a_2 = PHI<a_1, a_6>
!    S3:      b_1 = load
!    S4:      a_3 = a_2 + b_1
!        loop2:
!    S5:      a_4 = PHI<a_3, a_5>
!    S6:      c_1 = load
!    S7:      a_5 = a_4 + c_1
!        exit2:
!    S8:      a_6 = PHI<a_5>
!    S9:      store a_6
!        exit1:
! 
!    Here, S2, S4, S5, S7 and S8 form a single SCC.  However, we usually
!    do not want restrictions in the outer loop to "infect" the decision
!    for the inner loop.  For example, if an outer-loop node in the SCC
!    contains a statement with a fixed layout, that should not prevent the
!    inner loop from using a different layout.  Conversely, the inner loop
!    should not dictate a layout to the outer loop: if the outer loop does
!    a lot of computation, then it may not be efficient to do all of that
!    computation in the inner loop's preferred layout.
! 
!    We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner).
!    We also try to arrange partitions so that:
! 
!    - the partition for an outer loop comes before the partition for
!      an inner loop
! 
!    - if a sibling loop A dominates a sibling loop B, A's partition
!      comes before B's
! 
!    So in this case, the partition dag is:
! 
! 	S1        S3
! 	 \        /
! 	  S2+S4+S8   S6
! 	   |   \\    /
! 	   |    S5+S7
! 	   |
! 	  S9
! 
!    There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
!    one for a reversal of the edge S7->S8.
! 
!    The backward walk picks a layout for S5+S7 before S2+S4+S8.  The choice
!    for S2+S4+S8 therefore has to balance the cost of using the outer loop's
!    preferred layout against the cost of changing the layout on entry to the
!    inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).
! 
!    ??? Although this works well when optimizing for speed, it has the
!    downside when optimizing for size that the choice of layout for S5+S7
!    is completely independent of S9.  Perhaps we should not partition SCCs
!    when optimizing for size?
! 
!    To give a concrete example of the difference between optimizing
!    for size and speed, consider:
! 
!    a[0] = (b[1] << c[3]) - d[1];
!    a[1] = (b[0] << c[2]) - d[0];
!    a[2] = (b[3] << c[1]) - d[3];
!    a[3] = (b[2] << c[0]) - d[2];
! 
!    There are three different layouts here: one for a, one for b and d,
!    and one for c.  When optimizing for speed it is better to permute
!    each of b, c and d into the order required by a, since those permutes
!    happen in parallel.  But when optimizing for size, it is better to:
! 
!    - permute c into the same order as b
!    - do the arithmetic
!    - permute the result into the order required by a
! 
!    This gives 2 permutes rather than 3.  */
! 
! class vect_optimize_slp_pass
! {
! public:
!   vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {}
!   void run ();
! 
! private:
!   /* Graph building.  */
!   struct loop *containing_loop (slp_tree);
!   bool is_cfg_latch_edge (graph_edge *);
!   void build_vertices (hash_set<slp_tree> &, slp_tree);
!   void build_vertices ();
!   void build_graph ();
! 
!   /* Partitioning.  */
!   void create_partitions ();
!   template<typename T> void for_each_partition_edge (unsigned int, T);
! 
!   /* Layout selection.  */
!   bool can_change_layouts (unsigned int, unsigned int);
!   bool layout_is_reversible (unsigned int);
!   unsigned int layout_cost (unsigned int, unsigned int);
!   slpg_partition_layout_costs &partition_layout_costs (unsigned int,
! 						       unsigned int);
!   bool partition_can_absorb_permutes (unsigned int);
!   int node_absorption_cost (slp_tree, unsigned int);
!   void start_choosing_layouts ();
! 
!   /* Cost propagation.  */
!   slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int,
! 				     unsigned int, unsigned int);
!   slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
!   slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int);
!   void forward_pass ();
!   void backward_pass ();
! 
!   /* Rematerialization.  */
!   slp_tree get_result_with_layout (slp_tree, unsigned int);
!   void materialize ();
! 
!   /* Clean-up.  */
!   void remove_redundant_permutes ();
! 
!   void dump ();
! 
!   vec_info *m_vinfo;
! 
!   /* True if we should optimize the graph for size, false if we should
!      optimize it for speed.  (It wouldn't be easy to make this decision
!      more locally.)  */
!   bool m_optimize_size;
! 
!   /* A graph of all SLP nodes, with edges leading from uses to definitions.
!      In other words, a node's predecessors are its slp_tree parents and
!      a node's successors are its slp_tree children.  */
!   graph *m_slpg = nullptr;
! 
!   /* The vertices of M_SLPG, indexed by slp_tree::vertex.  */
!   auto_vec<slpg_vertex> m_vertices;
! 
!   /* The list of all leaves of M_SLPG. such as external definitions, constants,
!      and loads.  */
!   auto_vec<int> m_leafs;
! 
!   /* This array has one entry for every vector layout that we're considering.
!      Element 0 is null and indicates "no change".  Other entries describe
!      permutes that are inherent in the current graph and that we would
!      like to reverse if possible.
! 
!      For example, a permute { 1, 2, 3, 0 } means that something has
!      effectively been permuted in that way, such as a load group
!      { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]).
!      We'd then like to apply the reverse permute { 3, 0, 1, 2 }
!      in order to put things "back" in order.  */
!   auto_vec<vec<unsigned> > m_perms;
! 
!   /* A partitioning of the nodes for which a layout must be chosen.
!      Each partition represents an <SCC, cfg loop> pair; that is,
!      nodes in different SCCs belong to different partitions, and nodes
!      within an SCC are further partitioned according to the containing
!      cfg loop.  Partition <SCC1, L1> comes before <SCC2, L2> if:
! 
!      - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk
!        from leaves (such as loads) to roots (such as stores).
! 
!      - SCC1 == SCC2 and L1's header strictly dominates L2's header.  */
!   auto_vec<slpg_partition_info> m_partitions;
! 
!   /* The list of all nodes for which a layout must be chosen.  Nodes for
!      partition P come before the nodes for partition P+1.  Nodes within a
!      partition are in reverse postorder.  */
!   auto_vec<unsigned int> m_partitioned_nodes;
! 
!   /* Index P * num-layouts + L contains the cost of using layout L
!      for partition P.  */
!   auto_vec<slpg_partition_layout_costs> m_partition_layout_costs;
! 
!   /* Index N * num-layouts + L, if nonnull, is a node that provides the
!      original output of node N adjusted to have layout L.  */
!   auto_vec<slp_tree> m_node_layouts;
  };
  
  /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
  
! void
! vect_optimize_slp_pass::build_vertices (hash_set<slp_tree> &visited,
! 					slp_tree node)
  {
    unsigned i;
    slp_tree child;
***************
*** 3561,3568 ****
    if (visited.add (node))
      return;
  
!   node->vertex = vertices.length ();
!   vertices.safe_push (slpg_vertex (node));
  
    bool leaf = true;
    bool force_leaf = false;
--- 3996,4003 ----
    if (visited.add (node))
      return;
  
!   node->vertex = m_vertices.length ();
!   m_vertices.safe_push (slpg_vertex (node));
  
    bool leaf = true;
    bool force_leaf = false;
***************
*** 3570,3576 ****
      if (child)
        {
  	leaf = false;
! 	vect_slp_build_vertices (visited, child, vertices, leafs);
        }
      else
        force_leaf = true;
--- 4005,4011 ----
      if (child)
        {
  	leaf = false;
! 	build_vertices (visited, child);
        }
      else
        force_leaf = true;
***************
*** 3580,3600 ****
       and inductions.  Force those SLP PHIs to act as leafs to make them
       backwards reachable.  */
    if (leaf || force_leaf)
!     leafs.safe_push (node->vertex);
  }
  
  /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
  
! static void
! vect_slp_build_vertices (vec_info *info, vec<slpg_vertex> &vertices,
! 			 vec<int> &leafs)
  {
    hash_set<slp_tree> visited;
    unsigned i;
    slp_instance instance;
!   FOR_EACH_VEC_ELT (info->slp_instances, i, instance)
!     vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), vertices,
! 			     leafs);
  }
  
  /* Apply (reverse) bijectite PERM to VEC.  */
--- 4015,4033 ----
       and inductions.  Force those SLP PHIs to act as leafs to make them
       backwards reachable.  */
    if (leaf || force_leaf)
!     m_leafs.safe_push (node->vertex);
  }
  
  /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
  
! void
! vect_optimize_slp_pass::build_vertices ()
  {
    hash_set<slp_tree> visited;
    unsigned i;
    slp_instance instance;
!   FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
!     build_vertices (visited, SLP_INSTANCE_TREE (instance));
  }
  
  /* Apply (reverse) bijectite PERM to VEC.  */
***************
*** 3625,3699 ****
      }
  }
  
! /* Return whether permutations PERM_A and PERM_B as recorded in the
!    PERMS vector are equal.  */
  
! static bool
! vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
! 		   int perm_a, int perm_b)
  {
!   return (perm_a == perm_b
! 	  || (perm_a != -1 && perm_b != -1
! 	      && perms[perm_a].length () == perms[perm_b].length ()
! 	      && memcmp (&perms[perm_a][0], &perms[perm_b][0],
! 			 sizeof (unsigned) * perms[perm_a].length ()) == 0));
  }
  
! /* Optimize the SLP graph of VINFO.  */
  
! void
! vect_optimize_slp (vec_info *vinfo)
  {
!   if (vinfo->slp_instances.is_empty ())
!     return;
  
!   slp_tree node;
!   unsigned i;
!   auto_vec<slpg_vertex> vertices;
!   auto_vec<int> leafs;
!   vect_slp_build_vertices (vinfo, vertices, leafs);
  
!   struct graph *slpg = new_graph (vertices.length ());
!   for (slpg_vertex &v : vertices)
      for (slp_tree child : SLP_TREE_CHILDREN (v.node))
        if (child)
! 	add_edge (slpg, v.node->vertex, child->vertex);
  
!   /* Compute (reverse) postorder on the inverted graph.  */
!   auto_vec<int> ipo;
!   graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL);
! 
!   auto_vec<vec<unsigned> > perms;
!   perms.safe_push (vNULL); /* zero is no permute */
! 
!   /* Produce initial permutations.  */
!   for (i = 0; i < leafs.length (); ++i)
!     {
!       int idx = leafs[i];
!       slp_tree node = vertices[idx].node;
! 
!       /* Handle externals and constants optimistically throughout the
! 	 iteration.  But treat existing vectors as fixed since we
! 	 do not handle permuting them below.  */
!       if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
! 	   && !SLP_TREE_VEC_DEFS (node).exists ())
! 	  || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
! 	continue;
  
!       /* Leafs do not change across iterations.  Note leafs also double
! 	 as entries to the reverse graph.  */
!       if (!slpg->vertices[idx].succ)
  	{
! 	  vertices[idx].perm_in = 0;
! 	  vertices[idx].perm_out = 0;
  	}
  
        /* Loads are the only thing generating permutes.  */
        if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
  	continue;
  
        /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
! 	 node unpermuted, record this permute.  */
        stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
        if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
  	continue;
--- 4058,4416 ----
      }
  }
  
! /* Return the cfg loop that contains NODE.  */
  
! struct loop *
! vect_optimize_slp_pass::containing_loop (slp_tree node)
  {
!   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
!   if (!rep)
!     return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father;
!   return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father;
  }
  
! /* Return true if UD (an edge from a use to a definition) is associated
!    with a loop latch edge in the cfg.  */
  
! bool
! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud)
  {
!   slp_tree use = m_vertices[ud->src].node;
!   slp_tree def = m_vertices[ud->dest].node;
!   if (SLP_TREE_DEF_TYPE (use) != vect_internal_def
!       || SLP_TREE_DEF_TYPE (def) != vect_internal_def)
!     return false;
  
!   stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use));
!   return (is_a<gphi *> (use_rep->stmt)
! 	  && bb_loop_header_p (gimple_bb (use_rep->stmt))
! 	  && containing_loop (def) == containing_loop (use));
! }
! 
! /* Build the graph.  Mark edges that correspond to cfg loop latch edges with
!    a nonnull data field.  */
  
! void
! vect_optimize_slp_pass::build_graph ()
! {
!   build_vertices ();
! 
!   m_slpg = new_graph (m_vertices.length ());
!   for (slpg_vertex &v : m_vertices)
      for (slp_tree child : SLP_TREE_CHILDREN (v.node))
        if (child)
! 	{
! 	  graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex);
! 	  if (is_cfg_latch_edge (ud))
! 	    ud->data = this;
! 	}
! }
  
! /* Return true if E corresponds to a loop latch edge in the cfg.  */
  
! static bool
! skip_cfg_latch_edges (graph_edge *e)
! {
!   return e->data;
! }
! 
! /* Create the node partitions.  */
! 
! void
! vect_optimize_slp_pass::create_partitions ()
! {
!   /* Calculate a postorder of the graph, ignoring edges that correspond
!      to natural latch edges in the cfg.  Reading the vector from the
!      end to the beginning gives the reverse postorder.  */
!   auto_vec<int> initial_rpo;
!   graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo,
! 	       false, NULL, skip_cfg_latch_edges);
!   gcc_assert (initial_rpo.length () == m_vertices.length ());
! 
!   /* Calculate the strongly connected components of the graph.  */
!   auto_vec<int> scc_grouping;
!   unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping);
! 
!   /* Create a new index order in which all nodes from the same SCC are
!      consecutive.  Use scc_pos to record the index of the first node in
!      each SCC.  */
!   auto_vec<unsigned int> scc_pos (num_sccs);
!   int last_component = -1;
!   unsigned int node_count = 0;
!   for (unsigned int node_i : scc_grouping)
!     {
!       if (last_component != m_slpg->vertices[node_i].component)
! 	{
! 	  last_component = m_slpg->vertices[node_i].component;
! 	  gcc_assert (last_component == int (scc_pos.length ()));
! 	  scc_pos.quick_push (node_count);
! 	}
!       node_count += 1;
!     }
!   gcc_assert (node_count == initial_rpo.length ()
! 	      && last_component + 1 == int (num_sccs));
! 
!   /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes
!      inside each SCC following the RPO we calculated above.  The fact that
!      we ignored natural latch edges when calculating the RPO should ensure
!      that, for natural loop nests:
! 
!      - the first node that we encounter in a cfg loop is the loop header phi
!      - the loop header phis are in dominance order
! 
!      Arranging for this is an optimization (see below) rather than a
!      correctness issue.  Unnatural loops with a tangled mess of backedges
!      will still work correctly, but might give poorer results.
! 
!      Also update scc_pos so that it gives 1 + the index of the last node
!      in the SCC.  */
!   m_partitioned_nodes.safe_grow (node_count);
!   for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;)
!     {
!       unsigned int node_i = initial_rpo[old_i];
!       unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++;
!       m_partitioned_nodes[new_i] = node_i;
!     }
! 
!   /* Partition each SCC based on the containing cfg loop.  The order we
!      constructed above should ensure that, for natural cfg loops, we'll
!      create sub-SCC partitions for outer loops before the corresponding
!      sub-SCC partitions for inner loops.  Similarly, when one sibling
!      loop A dominates another sibling loop B, we should create a sub-SCC
!      partition for A before a sub-SCC partition for B.
! 
!      As above, nothing depends for correctness on whether this achieves
!      a natural nesting, but we should get better results when it does.  */
!   m_partitions.reserve (m_vertices.length ());
!   unsigned int next_partition_i = 0;
!   hash_map<struct loop *, int> loop_partitions;
!   unsigned int rpo_begin = 0;
!   unsigned int num_partitioned_nodes = 0;
!   for (unsigned int rpo_end : scc_pos)
!     {
!       loop_partitions.empty ();
!       for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i)
! 	{
! 	  /* Handle externals and constants optimistically throughout.
! 	     But treat existing vectors as fixed since we do not handle
! 	     permuting them.  */
! 	  unsigned int node_i = m_partitioned_nodes[rpo_i];
! 	  slp_tree node = m_vertices[node_i].node;
! 	  if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
! 	       && !SLP_TREE_VEC_DEFS (node).exists ())
! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
! 	    m_vertices[node_i].partition = -1;
! 	  else
! 	    {
! 	      struct loop *loop = containing_loop (node);
! 	      bool existed;
! 	      auto &partition_i
! 		= loop_partitions.get_or_insert (loop, &existed);
! 	      if (!existed)
! 		{
! 		  partition_i = next_partition_i++;
! 		  m_partitions.quick_push (slpg_partition_info ());
! 		}
! 	      m_vertices[node_i].partition = partition_i;
! 	      num_partitioned_nodes += 1;
! 	      m_partitions[partition_i].node_end += 1;
! 	    }
! 	}
!       rpo_begin = rpo_end;
!     }
! 
!   /* Assign ranges of consecutive node indices to each partition,
!      in partition order.  Start with node_end being the same as
!      node_begin so that the next loop can use it as a counter.  */
!   unsigned int node_begin = 0;
!   for (auto &partition : m_partitions)
!     {
!       partition.node_begin = node_begin;
!       node_begin += partition.node_end;
!       partition.node_end = partition.node_begin;
!     }
!   gcc_assert (node_begin == num_partitioned_nodes);
! 
!   /* Finally build the list of nodes in partition order.  */
!   m_partitioned_nodes.truncate (num_partitioned_nodes);
!   for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i)
!     {
!       int partition_i = m_vertices[node_i].partition;
!       if (partition_i >= 0)
  	{
! 	  unsigned int order_i = m_partitions[partition_i].node_end++;
! 	  m_partitioned_nodes[order_i] = node_i;
  	}
+     }
+ }
+ 
+ /* Look for edges from earlier partitions into node NODE_I and edges from
+    node NODE_I into later partitions.  Call:
+ 
+       FN (ud, other_node_i, other_partition_i)
+ 
+    for each such use-to-def edge ud, where other_node_i and other_partition_i
+    are the node and partition at the other end of the edge.  */
+ 
+ template<typename T>
+ void
+ vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn)
+ {
+   int partition_i = m_vertices[node_i].partition;
+   for (graph_edge *pred = m_slpg->vertices[node_i].pred;
+        pred; pred = pred->pred_next)
+     {
+       int src_partition_i = m_vertices[pred->src].partition;
+       if (src_partition_i >= 0 && src_partition_i != partition_i)
+ 	fn (pred, pred->src, src_partition_i);
+     }
+   for (graph_edge *succ = m_slpg->vertices[node_i].succ;
+        succ; succ = succ->succ_next)
+     {
+       int dest_partition_i = m_vertices[succ->dest].partition;
+       if (dest_partition_i >= 0 && dest_partition_i != partition_i)
+ 	fn (succ, succ->dest, dest_partition_i);
+     }
+ }
+ 
+ /* Return true if the target is able to change a vector from layout
+    FROM_LAYOUT_I to layout TO_LAYOUT_I.  */
+ 
+ bool
+ vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i,
+ 					    unsigned int to_layout_i)
+ {
+   /* FIXME: test what the target can support.  */
+   (void) from_layout_i;
+   (void) to_layout_i;
+   return true;
+ }
+ 
+ /* We already know that the target can perform permute m_perms[LAYOUT_I].
+    In our terms this means that we know that the target can go from a
+    layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with
+    it applied (layout 0, the original layout of the SLP graph).  Return
+    true if the target can also go in the opposite direction: from the
+    original layout to one in which m_perms[LAYOUT_I] has been reversed.  */
+ 
+ bool
+ vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i)
+ {
+   return can_change_layouts (0, layout_i);
+ }
+ 
+ /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
+    to layout TO_LAYOUT_I.  */
+ 
+ unsigned int
+ vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i,
+ 				     unsigned int to_layout_i)
+ {
+   if (from_layout_i == to_layout_i)
+     return 0;
+ 
+   if (can_change_layouts (from_layout_i, to_layout_i))
+     return 1;
+ 
+   /* Change to layout 0 and then out again.  */
+   gcc_assert (layout_is_reversible (from_layout_i));
+   return 2;
+ }
+ 
+ /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.  */
+ 
+ inline slpg_partition_layout_costs &
+ vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i,
+ 						unsigned int layout_i)
+ {
+   return m_partition_layout_costs[partition_i * m_perms.length () + layout_i];
+ }
+ 
+ /* Return true if partition PARTITION_I is structually allowed to absorb
+    neighboring permutes.  */
+ 
+ bool
+ vect_optimize_slp_pass::
+ partition_can_absorb_permutes (unsigned int partition_i)
+ {
+   auto &partition = m_partitions[partition_i];
+   if (partition.node_end != partition.node_begin + 1)
+     return false;
+ 
+   slp_tree node = m_vertices[m_partitioned_nodes[partition.node_begin]].node;
+   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
+     return true;
+ 
+   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
+   if (rep
+       && STMT_VINFO_DATA_REF (rep)
+       && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
+     return true;
+ 
+   return false;
+ }
+ 
+ /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the
+    permute that would be required to go from layout 0 to layout LAYOUT_I.
+    Return the cost of the change if so, in the same arbitrary units as
+    for layout_cost.  Return -1 otherwise.  */
+ 
+ int
+ vect_optimize_slp_pass::node_absorption_cost (slp_tree node,
+ 					      unsigned int layout_i)
+ {
+   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
+     /* FIXME: check whether NODE can handle cases in which both the inputs
+        and outputs have layout LAYOUT_I.  Return -1 if not.  */
+     return 0;
+ 
+   auto &vertex = m_vertices[node->vertex];
+   auto &partition = m_partitions[vertex.partition];
+   if (partition.layout == int (layout_i))
+     return 0;
+   /* FIXME: check whether NODE can be changed from its current layout
+      to layout LAYOUT_I.  Return -1 if not.  */
+   return 1;
+ }
+ 
+ /* Decide which element layouts we should consider using.  Mark which
+    partitions can potentially use which layouts and the local costs of
+    each choice.  */
+ 
+ void
+ vect_optimize_slp_pass::start_choosing_layouts ()
+ {
+   /* Used to assign unique permute indices.  */
+   using perm_hash = unbounded_hashmap_traits<
+     vec_free_hash_base<int_hash_base<unsigned>>,
+     int_hash<int, -1, -2>
+   >;
+   hash_map<vec<unsigned>, int, perm_hash> layout_ids;
+ 
+   /* Layout 0 is "no change".  */
+   m_perms.safe_push (vNULL);
+ 
+   /* Create layouts from existing permutations.  */
+   /* FIXME: cap the maximum number of layouts, to avoid quadraticness.  */
+   for (unsigned int node_i : m_leafs)
+     {
+       int partition_i = m_vertices[node_i].partition;
+       if (partition_i < 0)
+ 	continue;
+ 
+       /* Leafs also double as entries to the reverse graph.  Allow the
+ 	 layout of those to be changed.  */
+       if (!m_slpg->vertices[node_i].succ)
+ 	m_partitions[partition_i].layout = 0;
  
        /* Loads are the only thing generating permutes.  */
+       slp_tree node = m_vertices[node_i].node;
        if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
  	continue;
  
        /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
! 	 node unpermuted, record a layout that reverses this permute.  */
!       gcc_assert (m_partitions[partition_i].layout == 0);
        stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
        if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
  	continue;
***************
*** 3708,3720 ****
  	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
  	    any_permute = true;
  	}
-       /* If there's no permute no need to split one out.  */
-       if (!any_permute)
- 	continue;
        /* If the span doesn't match we'd disrupt VF computation, avoid
  	 that for now.  */
        if (imax - imin + 1 != SLP_TREE_LANES (node))
  	continue;
  
        /* For now only handle true permutes, like
  	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
--- 4425,4442 ----
  	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
  	    any_permute = true;
  	}
        /* If the span doesn't match we'd disrupt VF computation, avoid
  	 that for now.  */
        if (imax - imin + 1 != SLP_TREE_LANES (node))
  	continue;
+       /* If there's no permute no need to split one out.  In this case
+ 	 we can consider turning the load into a permuted load, if that
+ 	 turns out to be cheaper than alternatives.  */
+       if (!any_permute)
+ 	{
+ 	  m_partitions[partition_i].layout = -1;
+ 	  continue;
+ 	}
  
        /* For now only handle true permutes, like
  	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
***************
*** 3735,3985 ****
        perm.safe_grow (SLP_TREE_LANES (node), true);
        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
  	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
!       perms.safe_push (perm);
!       vertices[idx].perm_in = perms.length () - 1;
!       vertices[idx].perm_out = perms.length () - 1;
      }
  
!   /* In addition to the above we have to mark outgoing permutes facing
!      non-reduction graph entries that are not represented as to be
!      materialized.  */
!   for (slp_instance instance : vinfo->slp_instances)
      if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
        {
! 	/* Just setting perm_out isn't enough for the propagation to
! 	   pick this up.  */
! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0;
! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0;
        }
  
!   /* Propagate permutes along the graph and compute materialization points.  */
!   bool changed;
!   bool do_materialization = false;
!   unsigned iteration = 0;
!   do
!     {
!       changed = false;
!       ++iteration;
! 
!       if (dump_enabled_p ())
! 	dump_printf_loc (MSG_NOTE, vect_location,
! 			 "SLP optimize iteration %d\n", iteration);
  
!       for (i = vertices.length (); i > 0 ; --i)
  	{
! 	  int idx = ipo[i-1];
! 	  slp_tree node = vertices[idx].node;
  
! 	  /* Handle externals and constants optimistically throughout the
! 	     iteration.  */
! 	  if (SLP_TREE_DEF_TYPE (node) == vect_external_def
! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
! 	    continue;
  
! 	  /* We still eventually have failed backedge SLP nodes in the
! 	     graph, those are only cancelled when analyzing operations.
! 	     Simply treat them as transparent ops, propagating permutes
! 	     through them.  */
! 	  if (SLP_TREE_DEF_TYPE (node) == vect_internal_def)
! 	    {
! 	      /* We do not handle stores with a permutation, so all
! 		 incoming permutes must have been materialized.  */
! 	      stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
! 	      if (STMT_VINFO_DATA_REF (rep)
! 		  && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
! 		{
! 		  /* ???  We're forcing materialization in place
! 		     of the child here, we'd need special handling
! 		     in materialization to leave perm_in -1 here.  */
! 		  vertices[idx].perm_in = 0;
! 		  vertices[idx].perm_out = 0;
! 		}
! 	      /* We cannot move a permute across an operation that is
! 		 not independent on lanes.  Note this is an explicit
! 		 negative list since that's much shorter than the respective
! 		 positive one but it's critical to keep maintaining it.  */
! 	      if (is_gimple_call (STMT_VINFO_STMT (rep)))
! 		switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
! 		  {
! 		  case CFN_COMPLEX_ADD_ROT90:
! 		  case CFN_COMPLEX_ADD_ROT270:
! 		  case CFN_COMPLEX_MUL:
! 		  case CFN_COMPLEX_MUL_CONJ:
! 		  case CFN_VEC_ADDSUB:
! 		  case CFN_VEC_FMADDSUB:
! 		  case CFN_VEC_FMSUBADD:
! 		    vertices[idx].perm_in = 0;
! 		    vertices[idx].perm_out = 0;
! 		  default:;
! 		  }
  	    }
  
! 	  if (!slpg->vertices[idx].succ)
! 	    /* Pick up pre-computed leaf values.  */
! 	    ;
! 	  else
  	    {
! 	      bool any_succ_perm_out_m1 = false;
! 	      int perm_in = vertices[idx].perm_in;
! 	      for (graph_edge *succ = slpg->vertices[idx].succ;
! 		   succ; succ = succ->succ_next)
  		{
! 		  int succ_idx = succ->dest;
! 		  int succ_perm = vertices[succ_idx].perm_out;
! 		  /* Handle unvisited (and constant) nodes optimistically.  */
! 		  /* ???  But for constants once we want to handle
! 		     non-bijective permutes we have to verify the permute,
! 		     when unifying lanes, will not unify different constants.
! 		     For example see gcc.dg/vect/bb-slp-14.c for a case
! 		     that would break.  */
! 		  if (succ_perm == -1)
! 		    {
! 		      /* When we handled a non-leaf optimistically, note
! 			 that so we can adjust its outgoing permute below.  */
! 		      slp_tree succ_node = vertices[succ_idx].node;
! 		      if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
! 			any_succ_perm_out_m1 = true;
! 		      continue;
! 		    }
! 		  if (perm_in == -1)
! 		    perm_in = succ_perm;
! 		  else if (succ_perm == 0
! 			   || !vect_slp_perms_eq (perms, perm_in, succ_perm))
  		    {
! 		      perm_in = 0;
! 		      break;
  		    }
  		}
  
! 	      /* Adjust any incoming permutes we treated optimistically.  */
! 	      if (perm_in != -1 && any_succ_perm_out_m1)
  		{
! 		  for (graph_edge *succ = slpg->vertices[idx].succ;
! 		       succ; succ = succ->succ_next)
  		    {
! 		      slp_tree succ_node = vertices[succ->dest].node;
! 		      if (vertices[succ->dest].perm_out == -1
! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
! 			{
! 			  vertices[succ->dest].perm_out = perm_in;
! 			  /* And ensure this propagates.  */
! 			  if (vertices[succ->dest].perm_in == -1)
! 			    vertices[succ->dest].perm_in = perm_in;
! 			}
  		    }
! 		  changed = true;
! 		}
  
! 	      if (!vect_slp_perms_eq (perms, perm_in,
! 				      vertices[idx].perm_in))
! 		{
! 		  /* Make sure we eventually converge.  */
! 		  gcc_checking_assert (vertices[idx].perm_in == -1
! 				       || perm_in == 0);
! 		  vertices[idx].perm_in = perm_in;
! 
! 		  /* While we can handle VEC_PERM nodes as transparent
! 		     pass-through they can be a cheap materialization
! 		     point as well.  In addition they can act as source
! 		     of a random permutation as well.
! 		     The following ensures that former materialization
! 		     points that now have zero incoming permutes no
! 		     longer appear as such and that former "any" permutes
! 		     get pass-through.  We keep VEC_PERM nodes optimistic
! 		     as "any" outgoing permute though.  */
! 		  if (vertices[idx].perm_out != 0
! 		      && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
! 		    vertices[idx].perm_out = perm_in;
! 		  changed = true;
! 		}
  	    }
  
! 	  /* Elide pruning at materialization points in the first
! 	     iteration phase.  */
! 	  if (!do_materialization)
! 	    continue;
  
! 	  int perm = vertices[idx].perm_out;
! 	  if (perm == 0 || perm == -1)
  	    continue;
  
! 	  /* Decide on permute materialization.  Look whether there's
! 	     a use (pred) edge that is permuted differently than us.
! 	     In that case mark ourselves so the permutation is applied.  */
! 	  bool all_preds_permuted = slpg->vertices[idx].pred != NULL;
! 	  if (all_preds_permuted)
! 	    for (graph_edge *pred = slpg->vertices[idx].pred;
! 		 pred; pred = pred->pred_next)
! 	      {
! 		int pred_perm = vertices[pred->src].perm_in;
! 		gcc_checking_assert (pred_perm != -1);
! 		if (!vect_slp_perms_eq (perms, perm, pred_perm))
! 		  {
! 		    all_preds_permuted = false;
! 		    break;
! 		  }
! 	      }
! 	  if (!all_preds_permuted)
! 	    {
! 	      vertices[idx].perm_out = 0;
! 	      changed = true;
  	    }
- 	}
  
!       /* If the initial propagation converged, switch on materialization
! 	 and re-propagate.  */
!       if (!changed && !do_materialization)
  	{
! 	  do_materialization = true;
! 	  changed = true;
  	}
      }
!   while (changed);
!   statistics_histogram_event (cfun, "SLP optimize perm iterations", iteration);
  
!   /* Materialize.  */
!   for (i = 0; i < vertices.length (); ++i)
      {
!       int perm_in = vertices[i].perm_in;
!       slp_tree node = vertices[i].node;
  
!       /* First permute invariant/external original successors, we handle
! 	 those optimistically during propagation and duplicate them if
! 	 they are used with different permutations.  */
!       unsigned j;
!       slp_tree child;
!       if (perm_in > 0)
! 	FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
! 	  {
! 	    if (!child
! 		|| (SLP_TREE_DEF_TYPE (child) != vect_constant_def
! 		    && SLP_TREE_DEF_TYPE (child) != vect_external_def))
! 	      continue;
  
! 	    /* If the vector is uniform there's nothing to do.  */
! 	    if (vect_slp_tree_uniform_p (child))
! 	      continue;
  
! 	    /* We can end up sharing some externals via two_operator
! 	       handling.  Be prepared to unshare those.  */
! 	    if (child->refcnt != 1)
! 	      {
! 		gcc_assert (slpg->vertices[child->vertex].pred->pred_next);
! 		SLP_TREE_CHILDREN (node)[j] = child
! 		  = vect_create_new_slp_node
! 		      (SLP_TREE_SCALAR_OPS (child).copy ());
! 	      }
! 	    vect_slp_permute (perms[perm_in],
! 			      SLP_TREE_SCALAR_OPS (child), true);
! 	  }
  
        if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
  	{
  	  /* Apply the common permutes to the input vectors.  */
! 	  if (perm_in > 0)
! 	    {
  	      /* If the node is already a permute node we can apply
  		 the permutation to the lane selection, effectively
  		 materializing it on the incoming vectors.  */
--- 4457,5092 ----
        perm.safe_grow (SLP_TREE_LANES (node), true);
        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
  	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
!       bool existed;
!       int &layout_i = layout_ids.get_or_insert (perm, &existed);
!       if (existed)
! 	perm.release ();
!       else
! 	{
! 	  layout_i = m_perms.length ();
! 	  m_perms.safe_push (perm);
! 	}
!       m_partitions[partition_i].layout = layout_i;
      }
  
!   /* Initially assume that every layout is possible and has zero cost
!      in every partition.  */
!   m_partition_layout_costs.safe_grow_cleared (m_partitions.length ()
! 					      * m_perms.length ());
! 
!   /* We have to mark outgoing permutes facing non-reduction graph
!      entries that are not represented as to be materialized.  */
!   for (slp_instance instance : m_vinfo->slp_instances)
      if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
        {
! 	unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex;
! 	m_partitions[m_vertices[node_i].partition].layout = 0;
        }
  
!   /* Check which layouts each node and partition can handle.  Calculate the
!      weights associated with inserting layout changes on edges.  */
!   m_optimize_size = true;
!   for (unsigned int node_i : m_partitioned_nodes)
!     {
!       auto &vertex = m_vertices[node_i];
!       unsigned int partition_i = vertex.partition;
!       auto &partition = m_partitions[partition_i];
!       slp_tree node = vertex.node;
! 
!       if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node))
! 	{
! 	  basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt);
! 	  if (optimize_bb_for_speed_p (bb))
! 	    m_optimize_size = false;
! 
! 	  vertex.weight = vect_slp_node_weight (node);
! 
! 	  /* We do not handle stores with a permutation, so all
! 	     incoming permutes must have been materialized.  */
! 	  if (STMT_VINFO_DATA_REF (rep)
! 	      && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
! 	    /* ???  We're forcing materialization in place
! 	       of the child here, we'd need special handling
! 	       in materialization to leave layout -1 here.  */
! 	    partition.layout = 0;
! 
! 	  /* We cannot change the layout of an operation that is
! 	     not independent on lanes.  Note this is an explicit
! 	     negative list since that's much shorter than the respective
! 	     positive one but it's critical to keep maintaining it.  */
! 	  if (is_gimple_call (STMT_VINFO_STMT (rep)))
! 	    switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
! 	      {
! 	      case CFN_COMPLEX_ADD_ROT90:
! 	      case CFN_COMPLEX_ADD_ROT270:
! 	      case CFN_COMPLEX_MUL:
! 	      case CFN_COMPLEX_MUL_CONJ:
! 	      case CFN_VEC_ADDSUB:
! 	      case CFN_VEC_FMADDSUB:
! 	      case CFN_VEC_FMSUBADD:
! 		partition.layout = 0;
! 	      default:;
! 	      }
! 	}
  
!       if (partition.layout == 0)
  	{
! 	  /* The node must stay in its current order.  */
! 	  for (unsigned int layout_i = 1; layout_i < m_perms.length ();
! 	       ++layout_i)
! 	    partition_layout_costs (partition_i, layout_i).mark_impossible ();
! 	}
!       else if (partition_can_absorb_permutes (partition_i))
! 	/* Calculate the cost of converting NODE to use each permute.
! 	   If PARTITION.LAYOUT > 0 then that layout's permute will be free.  */
! 	for (unsigned int layout_i = 0; layout_i < m_perms.length ();
! 	     ++layout_i)
! 	  {
! 	    auto &layout_costs = partition_layout_costs (partition_i,
! 							 layout_i);
! 	    int factor = node_absorption_cost (node, layout_i);
! 	    if (factor < 0)
! 	      {
! 		layout_costs.mark_impossible ();
! 		continue;
! 	      }
! 	    gcc_assert (layout_costs.is_possible ());
! 	    layout_costs.in_cost = { vertex.weight * factor, m_optimize_size };
! 	  }
  
!       auto process_edge = [&](graph_edge *ud, unsigned int node2_i,
! 			      unsigned int partition2_i)
! 	{
! 	  /* Count the number of edges from earlier partitions and the number
! 	     of edges to later partitions.  */
! 	  if (partition2_i < partition_i)
! 	    partition.in_degree += 1;
! 	  else
! 	    partition.out_degree += 1;
  
! 	  /* If the current node uses the result of NODE2_I, accumulate
! 	     the effects of that.  */
! 	  if (ud->src == int (node_i))
! 	    {
! 	      m_vertices[node2_i].out_weight += vertex.weight;
! 	      m_vertices[node2_i].out_degree += 1;
  	    }
+ 	};
+       for_each_partition_edge (node_i, process_edge);
+     }
  
!   /* Layout 0 represents the original order and layout LAYOUT_I represents
!      the effect of reversing m_perms[LAYOUT_I].  If a predecessor partition
!      does not support layout LAYOUT_I and if it also isn't possible to go
!      from layout 0 to layout LAYOUT_I, then it is very difficult for a
!      successor partition to support layout LAYOUT_I.  (The only alternative
!      would go to layout LAYOUT_I from a different nonzero layout, but
!      that's a niche case and not really worth supporting.)  */
!   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
!     {
!       if (layout_is_reversible (layout_i))
! 	continue;
! 
!       bool changed;
!       do
! 	{
! 	  changed = false;
! 	  for (unsigned int node_i : m_partitioned_nodes)
  	    {
! 	      unsigned int partition_i = m_vertices[node_i].partition;
! 	      if (partition_layout_costs (partition_i,
! 					  layout_i).is_possible ())
! 		continue;
! 	      for (graph_edge *pred = m_slpg->vertices[node_i].pred;
! 		   pred; pred = pred->pred_next)
  		{
! 		  unsigned int partition2_i = m_vertices[pred->src].partition;
! 		  auto &costs = partition_layout_costs (partition2_i,
! 							layout_i);
! 		  if (costs.is_possible ())
  		    {
! 		      costs.mark_impossible ();
! 		      changed = true;
  		    }
  		}
+ 	    }
+ 	}
+       while (changed);
+     }
+ }
+ 
+ /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
+    and layout LAYOUT2_I on cross-partition use-to-def edge UD.  */
+ 
+ slpg_layout_cost
+ vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud,
+ 					  unsigned int node1_i,
+ 					  unsigned int layout1_i,
+ 					  unsigned int layout2_i)
+ {
+   auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i;
+   auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i;
+   auto factor = layout_cost (def_layout_i, use_layout_i);
+ 
+   /* We have a choice of putting the layout change at the site of the
+      definition or at the site of the use.  Prefer the former when
+      optimizing for size or when the execution frequency of the
+      definition is no greater than the combined execution frequencies of
+      the uses.  When putting the layout change at the site of the definition,
+      divvy up the cost among all consumers.  */
+   slpg_vertex &def_vertex = m_vertices[ud->dest];
+   slpg_vertex &use_vertex = m_vertices[ud->src];
+   if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight)
+     {
+       slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size };
+       cost.split (def_vertex.out_degree);
+       return cost;
+     }
+   return { use_vertex.weight * factor, m_optimize_size };
+ }
+ 
+ /* UD represents a use-def link between FROM_NODE_I and a node in a later
+    partition; FROM_NODE_I could be the definition node or the use node.
+    The node at the other end of the link wants to use layout TO_LAYOUT_I;
+    return the cost of any necessary fix-ups on edge UD.
+ 
+    At this point, FROM_NODE_I's partition has chosen the cheapest
+    layout based on the information available so far, but this choice
+    is only provisional.  */
+ 
+ slpg_layout_cost
+ vect_optimize_slp_pass::forward_cost (graph_edge *ud,
+ 				      unsigned int from_node_i,
+ 				      unsigned int to_layout_i)
+ {
+   slpg_vertex &from_vertex = m_vertices[from_node_i];
+   unsigned int from_partition_i = from_vertex.partition;
+   slpg_partition_info &from_partition = m_partitions[from_partition_i];
+   gcc_assert (from_partition.layout >= 0);
+ 
+   /* First calculate the cost on the assumption that FROM_PARTITION sticks
+      with its current layout preference.  */
+   slpg_layout_cost cost
+     = partition_layout_costs (from_partition_i,
+ 			      from_partition.layout).in_cost;
+   cost.split (from_partition.out_degree);
+   cost.add_serial_cost (edge_layout_cost (ud, from_node_i,
+ 					  from_partition.layout, to_layout_i));
+ 
+   /* Take the minimum of that cost and the cost that applies if
+      FROM_PARTITION instead switches to TO_LAYOUT_I.  */
+   auto &direct_layout_costs = partition_layout_costs (from_partition_i,
+ 						      to_layout_i);
+   if (direct_layout_costs.is_possible ())
+     {
+       slpg_layout_cost direct_cost = direct_layout_costs.in_cost;
+       direct_cost.split (from_partition.out_degree);
+       if (direct_cost.is_better_than (cost, m_optimize_size))
+ 	cost = direct_cost;
+     }
+ 
+   /* If FROM_PARTITION is likely to be able to absorb the required outgoing
+      permute, optimistically assume that that is what will happen.  */
+   if (partition_can_absorb_permutes (from_partition_i))
+     {
+       int factor = node_absorption_cost (from_vertex.node, to_layout_i);
+       if (factor >= 0)
+ 	{
+ 	  sreal weight = from_vertex.weight * factor;
+ 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
+ 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
+ 	    cost = absorb_cost;
+ 	}
+     }
+ 
+   return cost;
+ }
+ 
+ /* UD represents a use-def link between TO_NODE_I and a node in an earlier
+    partition; TO_NODE_I could be the definition node or the use node.
+    The node at the other end of the link wants to use layout FROM_LAYOUT_I;
+    return the cost of any necessary fix-ups on edge UD.
+ 
+    At this point, TO_NODE_I's partition has a fixed choice of layout.  */
+ 
+ slpg_layout_cost
+ vect_optimize_slp_pass::backward_cost (graph_edge *ud,
+ 				       unsigned int to_node_i,
+ 				       unsigned int from_layout_i)
+ {
+   slpg_vertex &to_vertex = m_vertices[to_node_i];
+   unsigned int to_partition_i = to_vertex.partition;
+   slpg_partition_info &to_partition = m_partitions[to_partition_i];
+   gcc_assert (to_partition.layout >= 0);
+ 
+   slpg_layout_cost cost
+     = partition_layout_costs (to_partition_i,
+ 			      to_partition.layout).out_cost;
+   cost.split (to_partition.in_degree);
+   cost.add_serial_cost (edge_layout_cost (ud, to_node_i, to_partition.layout,
+ 					  from_layout_i));
+ 
+   /* If TO_PARTITION is likely to be able to absorb the required input permute,
+      optimistically assume that that is what will happen.  */
+   if (partition_can_absorb_permutes (to_partition_i))
+     {
+       int factor = node_absorption_cost (to_vertex.node, from_layout_i);
+       if (factor >= 0)
+ 	{
+ 	  sreal weight = to_vertex.weight * factor;
+ 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
+ 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
+ 	    cost = absorb_cost;
+ 	}
+     }
+ 
+   return cost;
+ }
  
! /* Make a forward pass through the partitions, accumulating input costs.
!    Make a tentative (provisional) choice of layout for each partition.  */
! 
! void
! vect_optimize_slp_pass::forward_pass ()
! {
!   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
!        ++partition_i)
!     {
!       auto &partition = m_partitions[partition_i];
!       unsigned int min_layout_i = 0;
!       slpg_layout_cost *min_layout_cost = nullptr;
!       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
! 	{
! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
! 	  if (!layout_costs.is_possible ())
! 	    continue;
! 
! 	  /* Accumulate the costs from predecessor partitions.  */
! 	  for (unsigned int order_i = partition.node_begin;
! 	       order_i < partition.node_end; ++order_i)
! 	    {
! 	      unsigned int node_i = m_partitioned_nodes[order_i];
! 	      auto add_cost = [&](graph_edge *ud, unsigned int from_node_i,
! 				  unsigned int from_partition_i)
  		{
! 		  if (from_partition_i < partition_i)
  		    {
! 		      auto cost = forward_cost (ud, from_node_i, layout_i);
! 		      layout_costs.in_cost.add_parallel_cost (cost);
  		    }
! 		};
! 	      for_each_partition_edge (node_i, add_cost);
! 	    }
  
! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
! 	     the event of a tie between it and another layout.  */
! 	  if (!min_layout_cost
! 	      || layout_costs.in_cost.is_better_than (*min_layout_cost,
! 						      m_optimize_size))
! 	    {
! 	      min_layout_i = layout_i;
! 	      min_layout_cost = &layout_costs.in_cost;
  	    }
+ 	}
+       gcc_assert (min_layout_cost);
+       partition.layout = min_layout_i;
+     }
+ }
  
! /* Make a backward pass through the partitions, accumulating output costs.
!    Make a final choice of layout for each partition.  */
  
! void
! vect_optimize_slp_pass::backward_pass ()
! {
!   for (unsigned int partition_i = m_partitions.length (); partition_i-- > 0;)
!     {
!       auto &partition = m_partitions[partition_i];
!       unsigned int min_layout_i = 0;
!       slpg_layout_cost *min_layout_cost = nullptr;
!       unsigned int seen_min_layout_i = 0;
!       slpg_layout_cost *seen_min_layout_cost = nullptr;
!       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
! 	{
! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
! 	  if (!layout_costs.is_possible ())
  	    continue;
  
! 	  /* Accumulate the costs from successor partitions.  Record whether
! 	     any of them actually uses this layout. */
! 	  bool seen = false;
! 	  for (unsigned int order_i = partition.node_begin;
! 	       order_i < partition.node_end; ++order_i)
! 	    {
! 	      unsigned int node_i = m_partitioned_nodes[order_i];
! 	      auto add_cost = [&](graph_edge *ud, unsigned int to_node_i,
! 				  unsigned int to_partition_i)
! 		{
! 		  if (to_partition_i > partition_i)
! 		    {
! 		      auto cost = backward_cost (ud, to_node_i, layout_i);
! 		      layout_costs.out_cost.add_parallel_cost (cost);
! 		      seen |= (m_partitions[to_partition_i].layout
! 			       == int (layout_i));
! 		    }
! 		};
! 	      for_each_partition_edge (node_i, add_cost);
  	    }
  
! 	  /* Locally combine the costs from the forward and backward passes.
! 	     (This combined cost is not passed on, since that would lead
! 	     to double counting.)  */
! 	  layout_costs.cost = layout_costs.in_cost;
! 	  layout_costs.cost.add_serial_cost (layout_costs.out_cost);
! 
! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
! 	     the event of a tie between it and another layout.  */
! 	  if (!min_layout_cost
! 	      || layout_costs.cost.is_better_than (*min_layout_cost,
! 						   m_optimize_size))
! 	    {
! 	      min_layout_i = layout_i;
! 	      min_layout_cost = &layout_costs.cost;
! 	    }
! 	  /* Likewise, but restrict the choice of layouts to those that
! 	     are used by successor partitions.  */
! 	  if (seen
! 	      && (!seen_min_layout_cost
! 		  || layout_costs.cost.is_better_than (*seen_min_layout_cost,
! 						       m_optimize_size)))
! 	    {
! 	      seen_min_layout_i = min_layout_i;
! 	      seen_min_layout_cost = &layout_costs.cost;
! 	    }
! 	}
!       gcc_assert (min_layout_cost);
! 
!       /* If the partition can absorb permutes, prefer a layout that has
! 	 actually been used.  */
!       if (seen_min_layout_cost
! 	  && partition_can_absorb_permutes (partition_i))
! 	partition.layout = seen_min_layout_i;
!       else
! 	partition.layout = min_layout_i;
!     }
! }
! 
! /* Return a node that applies layout TO_LAYOUT_I to the original form of NODE.
!    NODE already has the layout that was selected for its partition.  */
! 
! slp_tree
! vect_optimize_slp_pass::get_result_with_layout (slp_tree node,
! 						unsigned int to_layout_i)
! {
!   unsigned int result_i = node->vertex * m_perms.length () + to_layout_i;
!   slp_tree result = m_node_layouts[result_i];
!   if (result)
!     return result;
! 
!   if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
!       || SLP_TREE_DEF_TYPE (node) == vect_external_def)
!     {
!       /* If the vector is uniform or unchanged, there's nothing to do.  */
!       if (to_layout_i == 0 || vect_slp_tree_uniform_p (node))
! 	result = node;
!       else
  	{
! 	  auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy ();
! 	  result = vect_create_new_slp_node (scalar_ops);
! 	  vect_slp_permute (m_perms[to_layout_i], scalar_ops, true);
  	}
      }
!   else
!     {
!       unsigned int partition_i = m_vertices[node->vertex].partition;
!       unsigned int from_layout_i = m_partitions[partition_i].layout;
!       if (from_layout_i == to_layout_i)
! 	return node;
! 
!       /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used
! 	 twice, with two different layouts, then we could create
! 	 parallel VEC_PERM_EXPRs rather than add a serial one.  */
! 
!       /* FIXME: When calculating the costs, we considered the possiblity
! 	 of duplicating a VEC_PERM_EXPR at each use site, rather than
! 	 having one at the definition site.  Do that here if it turns
! 	 out to be cheaper.  */
! 
!       if (dump_enabled_p ())
! 	dump_printf_loc (MSG_NOTE, vect_location,
! 			 "inserting permute node in place of %p\n",
! 			 node);
! 
!       unsigned int num_lanes = SLP_TREE_LANES (node);
!       result = vect_create_new_slp_node (1, VEC_PERM_EXPR);
!       if (SLP_TREE_SCALAR_STMTS (node).length ())
! 	{
! 	  auto &stmts = SLP_TREE_SCALAR_STMTS (result);
! 	  stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result));
! 	  if (from_layout_i != 0)
! 	    vect_slp_permute (m_perms[from_layout_i], stmts, false);
! 	  if (to_layout_i != 0)
! 	    vect_slp_permute (m_perms[to_layout_i], stmts, true);
! 	}
!       SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node);
!       SLP_TREE_LANES (result) = num_lanes;
!       SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node);
!       result->vertex = -1;
! 
!       auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result);
!       lane_perm.create (num_lanes);
!       for (unsigned j = 0; j < num_lanes; ++j)
! 	lane_perm.quick_push ({ 0, j });
!       if (from_layout_i != 0)
! 	vect_slp_permute (m_perms[from_layout_i], lane_perm, false);
!       if (to_layout_i != 0)
! 	vect_slp_permute (m_perms[to_layout_i], lane_perm, true);
! 
!       node->refcnt++;
!       SLP_TREE_CHILDREN (result).safe_push (node);
!     }
!   m_node_layouts[result_i] = result;
!   return result;
! }
! 
! /* Print the partition graph and layout information to the dump file.  */
  
! void
! vect_optimize_slp_pass::dump ()
! {
!   dump_printf_loc (MSG_NOTE, vect_location,
! 		   "SLP optimize permutes:\n");
!   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
      {
!       dump_printf_loc (MSG_NOTE, vect_location, "  %d: { ", layout_i);
!       const char *sep = "";
!       for (unsigned int idx : m_perms[layout_i])
! 	{
! 	  dump_printf (MSG_NOTE, "%s%d", sep, idx);
! 	  sep = ", ";
! 	}
!       dump_printf (MSG_NOTE, " }\n");
!     }
!   dump_printf_loc (MSG_NOTE, vect_location,
! 		   "SLP optimize partitions:\n");
!   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
!        ++partition_i)
!     {
!       auto &partition = m_partitions[partition_i];
!       dump_printf_loc (MSG_NOTE, vect_location,  "  -------------\n");
!       dump_printf_loc (MSG_NOTE, vect_location,
! 		       "  partition %d (layout %d):\n",
! 		       partition_i, partition.layout);
!       dump_printf_loc (MSG_NOTE, vect_location, "    nodes:\n");
!       for (unsigned int order_i = partition.node_begin;
! 	   order_i < partition.node_end; ++order_i)
! 	{
! 	  unsigned int node_i = m_partitioned_nodes[order_i];
! 	  slp_tree node = m_vertices[node_i].node;
! 	  dump_printf_loc (MSG_NOTE, vect_location, "      - %p:\n",
! 			   (void *) m_vertices[node_i].node);
! 	  dump_printf_loc (MSG_NOTE, vect_location,
! 			   "          weight: %f\n",
! 			   m_vertices[node_i].weight.to_double ());
! 	  if (m_vertices[node_i].out_degree)
! 	    dump_printf_loc (MSG_NOTE, vect_location,
! 			     "          out weight: %f (degree %d)\n",
! 			     m_vertices[node_i].out_weight.to_double (),
! 			     m_vertices[node_i].out_degree);
! 	  if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
! 	    dump_printf_loc (MSG_NOTE, vect_location,
! 			     "          op: VEC_PERM_EXPR\n");
! 	  else if (auto rep = SLP_TREE_REPRESENTATIVE (node))
! 	    dump_printf_loc (MSG_NOTE, vect_location,
! 			     "          op template: %G", rep->stmt);
! 	}
!       dump_printf_loc (MSG_NOTE, vect_location, "    edges:\n");
!       for (unsigned int order_i = partition.node_begin;
! 	   order_i < partition.node_end; ++order_i)
! 	{
! 	  unsigned int node_i = m_partitioned_nodes[order_i];
! 	  auto print_edge = [&](graph_edge *, unsigned int node2_i,
! 				unsigned int partition2_i)
! 	    {
! 	      if (partition2_i < partition_i)
! 		dump_printf_loc (MSG_NOTE, vect_location,
! 				 "      - %p [%d] --> %p\n",
! 				 m_vertices[node2_i].node, partition2_i,
! 				 m_vertices[node_i].node);
! 	      else
! 		dump_printf_loc (MSG_NOTE, vect_location,
! 				 "      - %p --> [%d] %p\n",
! 				 m_vertices[node_i].node, partition2_i,
! 				 m_vertices[node2_i].node);
! 	    };
! 	  for_each_partition_edge (node_i, print_edge);
! 	}
  
!       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
! 	{
! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
! 	  if (layout_costs.is_possible ())
! 	    {
! 	      dump_printf_loc (MSG_NOTE, vect_location,
! 			       "    layout %d:%s\n", layout_i,
! 			       partition.layout == int (layout_i)
! 			       ? " (*)" : "");
! #define TEMPLATE "{depth: %f, total: %f}"
! 	      dump_printf_loc (MSG_NOTE, vect_location,
! 			       "        " TEMPLATE "\n", layout_i,
! 			       layout_costs.in_cost.depth.to_double (),
! 			       layout_costs.in_cost.total.to_double ());
! 	      dump_printf_loc (MSG_NOTE, vect_location,
! 			       "      + " TEMPLATE "\n",
! 			       layout_costs.out_cost.depth.to_double (),
! 			       layout_costs.out_cost.total.to_double ());
! 	      dump_printf_loc (MSG_NOTE, vect_location,
! 			       "      = " TEMPLATE "\n",
! 			       layout_costs.cost.depth.to_double (),
! 			       layout_costs.cost.total.to_double ());
! #undef TEMPLATE
! 	    }
! 	  else
! 	    dump_printf_loc (MSG_NOTE, vect_location,
! 			     "    layout %d: rejected\n", layout_i);
! 	}
!     }
! }
  
! /* Apply the chosen vector layouts to the SLP graph.  */
  
! void
! vect_optimize_slp_pass::materialize ()
! {
!   m_partition_layout_costs.release ();
!   m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length ());
  
+   for (unsigned int node_i : m_partitioned_nodes)
+     {
+       /* Rearrange the scalar statements to match the chosen layout.  */
+       slp_tree node = m_vertices[node_i].node;
+       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
+       gcc_assert (layout_i >= 0);
+       if (layout_i > 0)
+ 	vect_slp_permute (m_perms[layout_i],
+ 			  SLP_TREE_SCALAR_STMTS (node), true);
+ 
+       /* Update load and lane permutes.  */
        if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
  	{
  	  /* Apply the common permutes to the input vectors.  */
! 	  /* FIXME: check whether the target accepts this folding and
! 	     handle it like a normal node if not.  That would include
! 	     adjusting the follow-up loop below.  */
! 	  unsigned int opno;
! 	  slp_tree child;
! 	  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child)
! 	    {
! 	      unsigned int partition2_i = m_vertices[child->vertex].partition;
! 	      int layout2_i = m_partitions[partition2_i].layout;
! 	      if (layout2_i == 0)
! 		continue;
! 
  	      /* If the node is already a permute node we can apply
  		 the permutation to the lane selection, effectively
  		 materializing it on the incoming vectors.  */
***************
*** 3987,4141 ****
  		dump_printf_loc (MSG_NOTE, vect_location,
  				 "simplifying permute node %p\n",
  				 node);
! 	      for (unsigned k = 0;
! 		   k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k)
! 		SLP_TREE_LANE_PERMUTATION (node)[k].second
! 		  = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second];
! 	    }
! 	  /* Apply the anticipated output permute to the permute and
! 	     stmt vectors.  */
! 	  int perm_out = vertices[i].perm_out;
! 	  if (perm_out > 0)
! 	    {
! 	      vect_slp_permute (perms[perm_out],
! 				SLP_TREE_SCALAR_STMTS (node), true);
! 	      vect_slp_permute (perms[perm_out],
! 				SLP_TREE_LANE_PERMUTATION (node), true);
! 	    }
! 	}
!       else if (vertices[i].get_perm_materialized () != 0)
! 	{
! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
! 	    /* For loads simply drop the permutation, the load permutation
! 	       already performs the desired permutation.  */
! 	    ;
! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
! 	    gcc_unreachable ();
! 	  else
! 	    {
! 	      if (dump_enabled_p ())
! 		dump_printf_loc (MSG_NOTE, vect_location,
! 				 "inserting permute node in place of %p\n",
! 				 node);
! 
! 	      /* Make a copy of NODE and in-place change it to a
! 		 VEC_PERM node to permute the lanes of the copy.  */
! 	      slp_tree copy = new _slp_tree;
! 	      SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node);
! 	      SLP_TREE_CHILDREN (node) = vNULL;
! 	      SLP_TREE_SCALAR_STMTS (copy)
! 		= SLP_TREE_SCALAR_STMTS (node).copy ();
! 	      vect_slp_permute (perms[perm_in],
! 				SLP_TREE_SCALAR_STMTS (copy), true);
! 	      gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ());
! 	      SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node);
! 	      gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ());
! 	      SLP_TREE_LANE_PERMUTATION (copy)
! 		= SLP_TREE_LANE_PERMUTATION (node);
! 	      SLP_TREE_LANE_PERMUTATION (node) = vNULL;
! 	      SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node);
! 	      copy->refcnt = 1;
! 	      copy->max_nunits = node->max_nunits;
! 	      SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node);
! 	      SLP_TREE_LANES (copy) = SLP_TREE_LANES (node);
! 	      SLP_TREE_CODE (copy) = SLP_TREE_CODE (node);
! 
! 	      /* Now turn NODE into a VEC_PERM.  */
! 	      SLP_TREE_CHILDREN (node).safe_push (copy);
! 	      SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node));
! 	      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
! 		SLP_TREE_LANE_PERMUTATION (node)
! 		  .quick_push (std::make_pair (0, perms[perm_in][j]));
! 	      SLP_TREE_CODE (node) = VEC_PERM_EXPR;
! 	    }
! 	}
!       else if (perm_in > 0) /* perm_in == perm_out */
! 	{
! 	  /* Apply the reverse permutation to our stmts.  */
! 	  vect_slp_permute (perms[perm_in],
! 			    SLP_TREE_SCALAR_STMTS (node), true);
! 	  /* And to the lane/load permutation, which we can simply
! 	     make regular by design.  */
! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
! 	    {
! 	      gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
! 	      /* ???  When we handle non-bijective permutes the idea
! 		 is that we can force the load-permutation to be
! 		 { min, min + 1, min + 2, ... max }.  But then the
! 		 scalar defs might no longer match the lane content
! 		 which means wrong-code with live lane vectorization.
! 		 So we possibly have to have NULL entries for those.  */
! 	      vect_slp_permute (perms[perm_in],
! 				SLP_TREE_LOAD_PERMUTATION (node), true);
! 	    }
! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
! 	    gcc_unreachable ();
  	}
      }
  
!   /* Elide any permutations at BB reduction roots.  */
!   if (is_a <bb_vec_info> (vinfo))
      {
!       for (slp_instance instance : vinfo->slp_instances)
  	{
! 	  if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc)
  	    continue;
! 	  slp_tree old = SLP_INSTANCE_TREE (instance);
! 	  if (SLP_TREE_CODE (old) == VEC_PERM_EXPR
! 	      && SLP_TREE_CHILDREN (old).length () == 1)
  	    {
! 	      slp_tree child = SLP_TREE_CHILDREN (old)[0];
! 	      if (SLP_TREE_DEF_TYPE (child) == vect_external_def)
! 		{
! 		  /* Preserve the special VEC_PERM we use to shield existing
! 		     vector defs from the rest.  But make it a no-op.  */
! 		  unsigned i = 0;
! 		  for (std::pair<unsigned, unsigned> &p
! 		       : SLP_TREE_LANE_PERMUTATION (old))
! 		    p.second = i++;
! 		}
! 	      else
! 		{
! 		  SLP_INSTANCE_TREE (instance) = child;
! 		  SLP_TREE_REF_COUNT (child)++;
! 		  vect_free_slp_tree (old);
! 		}
! 	    }
! 	  else if (SLP_TREE_LOAD_PERMUTATION (old).exists ()
! 		   && SLP_TREE_REF_COUNT (old) == 1
! 		   && vertices[old->vertex].get_perm_materialized () != 0)
! 	    {
! 	      /* ???  For loads the situation is more complex since
! 		 we can't modify the permute in place in case the
! 		 node is used multiple times.  In fact for loads this
! 		 should be somehow handled in the propagation engine.  */
! 	      /* Apply the reverse permutation to our stmts.  */
! 	      int perm = vertices[old->vertex].get_perm_materialized ();
! 	      vect_slp_permute (perms[perm],
! 				SLP_TREE_SCALAR_STMTS (old), true);
! 	      vect_slp_permute (perms[perm],
! 				SLP_TREE_LOAD_PERMUTATION (old), true);
  	    }
  	}
      }
  
!   /* Free the perms vector used for propagation.  */
!   while (!perms.is_empty ())
!     perms.pop ().release ();
!   free_graph (slpg);
! 
  
!   /* Now elide load permutations that are not necessary.  */
!   for (i = 0; i < leafs.length (); ++i)
      {
!       node = vertices[leafs[i]].node;
        if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
  	continue;
  
        /* In basic block vectorization we allow any subchain of an interleaving
  	 chain.
  	 FORNOW: not in loop SLP because of realignment complications.  */
!       if (is_a <bb_vec_info> (vinfo))
  	{
  	  bool subchain_p = true;
  	  stmt_vec_info next_load_info = NULL;
--- 5094,5173 ----
  		dump_printf_loc (MSG_NOTE, vect_location,
  				 "simplifying permute node %p\n",
  				 node);
! 	      for (auto &entry : SLP_TREE_LANE_PERMUTATION (node))
! 		if (entry.first == opno)
! 		  entry.second = m_perms[layout2_i][entry.second];
! 	    }
! 	  /* Reverse the permutation that we chose to reverse.  */
! 	  if (layout_i > 0)
! 	    vect_slp_permute (m_perms[layout_i],
! 			      SLP_TREE_LANE_PERMUTATION (node), true);
! 	}
!       else
! 	{
! 	  gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
! 	  auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node);
! 	  if (layout_i > 0)
! 	    /* ???  When we handle non-bijective permutes the idea
! 	       is that we can force the load-permutation to be
! 	       { min, min + 1, min + 2, ... max }.  But then the
! 	       scalar defs might no longer match the lane content
! 	       which means wrong-code with live lane vectorization.
! 	       So we possibly have to have NULL entries for those.  */
! 	    vect_slp_permute (m_perms[layout_i], load_perm, true);
  	}
      }
  
!   /* Do this before any nodes disappear, since it involves a walk
!      over the leaves.  */
!   remove_redundant_permutes ();
! 
!   /* Replace each child with a correctly laid-out version.  */
!   for (unsigned int node_i : m_partitioned_nodes)
      {
!       slp_tree node = m_vertices[node_i].node;
! 
!       /* The loop above adjusted permutes for each child's chosen layout,
! 	 so we don't need to adjust them here.  */
!       if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
! 	continue;
! 
!       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
!       gcc_assert (layout_i >= 0);
!       unsigned j;
!       slp_tree child;
!       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
  	{
! 	  if (!child)
  	    continue;
! 
! 	  slp_tree new_child = get_result_with_layout (child, layout_i);
! 	  if (new_child != child)
  	    {
! 	      vect_free_slp_tree (child);
! 	      SLP_TREE_CHILDREN (node)[j] = new_child;
! 	      new_child->refcnt += 1;
  	    }
  	}
      }
+ }
  
! /* Elide load permutations that are not necessary.  Such permutations might
!    be pre-existing, rather than created by the layout optimizations.  */
  
! void
! vect_optimize_slp_pass::remove_redundant_permutes ()
! {
!   for (unsigned int node_i : m_leafs)
      {
!       slp_tree node = m_vertices[node_i].node;
        if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
  	continue;
  
        /* In basic block vectorization we allow any subchain of an interleaving
  	 chain.
  	 FORNOW: not in loop SLP because of realignment complications.  */
!       if (is_a <bb_vec_info> (m_vinfo))
  	{
  	  bool subchain_p = true;
  	  stmt_vec_info next_load_info = NULL;
***************
*** 4160,4165 ****
--- 5192,5198 ----
  	}
        else
  	{
+ 	  loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
  	  stmt_vec_info load_info;
  	  bool this_load_permuted = false;
  	  unsigned j;
***************
*** 4175,4182 ****
  	      /* The load requires permutation when unrolling exposes
  		 a gap either because the group is larger than the SLP
  		 group-size or because there is a gap between the groups.  */
! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR
! 			      (as_a <loop_vec_info> (vinfo)), 1U)
  		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
  		      && DR_GROUP_GAP (first_stmt_info) == 0)))
  	    {
--- 5208,5214 ----
  	      /* The load requires permutation when unrolling exposes
  		 a gap either because the group is larger than the SLP
  		 group-size or because there is a gap between the groups.  */
! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
  		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
  		      && DR_GROUP_GAP (first_stmt_info) == 0)))
  	    {
***************
*** 4187,4192 ****
--- 5219,5258 ----
      }
  }
  
+ /* Main entry point for the optimization pass.  */
+ 
+ void
+ vect_optimize_slp_pass::run ()
+ {
+   build_graph ();
+   create_partitions ();
+   start_choosing_layouts ();
+   if (m_perms.length () > 1)
+     {
+       forward_pass ();
+       backward_pass ();
+       if (dump_enabled_p ())
+ 	dump ();
+       materialize ();
+     }
+   else
+     remove_redundant_permutes ();
+   /* Free the perms vector used for propagation.  */
+   while (!m_perms.is_empty ())
+     m_perms.pop ().release ();
+   free_graph (m_slpg);
+ }
+ 
+ /* Optimize the SLP graph of VINFO.  */
+ 
+ void
+ vect_optimize_slp (vec_info *vinfo)
+ {
+   if (vinfo->slp_instances.is_empty ())
+     return;
+   vect_optimize_slp_pass (vinfo).run ();
+ }
+ 
  /* Gather loads reachable from the individual SLP graph entries.  */
  
  void
***************
*** 6647,6653 ****
  {
    stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
    int vec_index = 0;
!   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
    unsigned int mask_element;
    machine_mode mode;
--- 7713,7719 ----
  {
    stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
    int vec_index = 0;
!   tree vectype = SLP_TREE_VECTYPE (node);
    unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
    unsigned int mask_element;
    machine_mode mode;

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] RFC: Extend SLP permutation optimisations
  2022-08-02  7:20 [PATCH] RFC: Extend SLP permutation optimisations Richard Sandiford
@ 2022-08-03 11:23 ` Richard Biener
  2022-08-04  9:06   ` Richard Sandiford
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Biener @ 2022-08-03 11:23 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, 2 Aug 2022, Richard Sandiford wrote:

> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely.  This patch tries to extend it as follows:
> 
> - Allow loads to take a different permutation from the one they
>   started with, rather than choosing between "original permutation"
>   and "no permutation".
> 
> - Allow changes in both directions, if the target supports the
>   reverse permute operation.
> 
> - Treat the placement of permute operations as a two-way dataflow
>   problem: after propagating information from leaves to roots (as now),
>   propagate information back up the graph.
> 
> - Take execution frequency into account when optimising for speed,
>   so that (for example) permutes inside loops have a higher cost
>   than permutes outside loops.
> 
> - Try to reduce the total number of permutes when optimising for
>   size, even if that increases the number of permutes on a given
>   execution path.
> 
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
> 
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads.  This patch doesn't include that change
> (although it's a possible future addition).
> 
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
> 
> - Make it easier to represent predicated operations, including
>   predicated operations with gaps.  E.g.:
> 
>      a[0] += 1;
>      a[1] += 1;
>      a[3] += 1;
> 
>   could be a single load/add/store for SVE.  We could handle this
>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>   (depending on what's being counted).  We might need to move
>   elements between lanes at various points, like with permutes.
> 
>   (This would first mean adding support for stores with gaps.)
> 
> - Make it easier to switch between an even/odd and unpermuted layout
>   when switching between wide and narrow elements.  E.g. if a widening
>   operation produces an even vector and an odd vector, we should try
>   to keep operations on the wide elements in that order rather than
>   force them to be permuted back "in order".
> 
> To give some examples of what the current patch does:
> 
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[1] << c[3]) - d[1];
>   a[1] = (b[0] << c[2]) - d[0];
>   a[2] = (b[3] << c[1]) - d[3];
>   a[3] = (b[2] << c[0]) - d[2];
> }
> 
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time.  But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
> 
>         ldr     q1, [x2]
>         ldr     q0, [x1]
>         ext     v1.16b, v1.16b, v1.16b, #8     // <------
>         sshl    v0.4s, v0.4s, v1.4s
>         ldr     q1, [x3]
>         sub     v0.4s, v0.4s, v1.4s
>         rev64   v0.4s, v0.4s                   // <------
>         str     q0, [x0]
>         ret
> 
> The following function:
> 
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   a[0] = (b[3] << c[3]) - d[3];
>   a[1] = (b[2] << c[2]) - d[2];
>   a[2] = (b[1] << c[1]) - d[1];
>   a[3] = (b[0] << c[0]) - d[0];
> }
> 
> continues to push the reverse down to just before the store,
> like the current code does.
> 
> In:
> 
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       a[0] = (a[0] + c[3]);
>       a[1] = (a[1] + c[2]);
>       a[2] = (a[2] + c[1]);
>       a[3] = (a[3] + c[0]);
>       c += 4;
>     }
> }
> 
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop.  When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
> 
>         mov     x3, x0
>         adrp    x0, .LC0
>         add     x1, x2, 1600
>         ldr     q2, [x0, #:lo12:.LC0]
>         ldr     q0, [x3]
>         mov     v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         .p2align 3,,7
> .L6:
>         ldr     q1, [x2], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x2, x1
>         bne     .L6
>         mov     v1.16b, v0.16b
>         adrp    x0, .LC0
>         ldr     q2, [x0, #:lo12:.LC0]
>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>         str     q0, [x3]
>         ret
> 
> Similarly, for the very artificial testcase:
> 
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>        int *__restrict d)
> {
>   int a0 = a[0];
>   int a1 = a[1];
>   int a2 = a[2];
>   int a3 = a[3];
>   for (int i = 0; i < 100; ++i)
>     {
>       a0 ^= c[0];
>       a1 ^= c[1];
>       a2 ^= c[2];
>       a3 ^= c[3];
>       c += 4;
>       for (int j = 0; j < 100; ++j)
> 	{
> 	  a0 += d[1];
> 	  a1 += d[0];
> 	  a2 += d[3];
> 	  a3 += d[2];
> 	  d += 4;
> 	}
>       b[0] = a0;
>       b[1] = a1;
>       b[2] = a2;
>       b[3] = a3;
>       b += 4;
>     }
>   a[0] = a0;
>   a[1] = a1;
>   a[2] = a2;
>   a[3] = a3;
> }
> 
> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
> even though it's part of an SCC that includes the outer loop.
> In other words, this is a motivating case for not assigning
> permutes at SCC granularity.  The code we get is:
> 
>         ldr     q0, [x0]
>         mov     x4, x1
>         mov     x5, x0
>         add     x1, x3, 1600
>         add     x3, x4, 1600
>         .p2align 3,,7
> .L11:
>         ldr     q1, [x2], 16
>         sub     x0, x1, #1600
>         eor     v0.16b, v1.16b, v0.16b
>         rev64   v0.4s, v0.4s              // <---
>         .p2align 3,,7
> .L10:
>         ldr     q1, [x0], 16
>         add     v0.4s, v0.4s, v1.4s
>         cmp     x0, x1
>         bne     .L10
>         rev64   v0.4s, v0.4s              // <---
>         add     x1, x0, 1600
>         str     q0, [x4], 16
>         cmp     x3, x4
>         bne     .L11
>         str     q0, [x5]
>         ret
> 
> The patch is still incomplete, hence the FIXMEs, but it's at the stage
> where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
> regression in slp-11b.c, but I think it's arguably restoring the
> expected load-lanes behaviour.)  Does this look like a plausible
> direction, before I go filling in the FIXME parts?

Yes, it looks nice!  I wonder if you spent much thoughts on
layout optimizing for SLP patterns and/or how to integrate both?

Thanks,
Richard.

> Thanks,
> Richard
> 
> ---
>  gcc/graphds.cc        |   13 +-
>  gcc/graphds.h         |    3 +-
>  gcc/hash-map-traits.h |   74 +-
>  gcc/hash-traits.h     |   96 ++-
>  gcc/tree-vect-slp.cc  | 1870 ++++++++++++++++++++++++++++++++---------
>  5 files changed, 1599 insertions(+), 457 deletions(-)
> 
> *** /tmp/iBbApg_graphds.cc	2022-08-02 08:16:23.975354155 +0100
> --- gcc/graphds.cc	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 281,287 ****
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
> !    Edge E will be skipped if callback function returns true.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> --- 281,294 ----
>      numbers assigned by the previous pass.  If SUBGRAPH is not NULL, it
>      specifies the subgraph of G whose strongly connected components we want
>      to determine.  If SKIP_EDGE_P is not NULL, it points to a callback function.
> !    Edge E will be skipped if callback function returns true.  If SCC_GROUPING
> !    is not null, the nodes will be added to it in the following order:
> ! 
> !    - If SCC A is a direct or indirect predecessor of SCC B in the SCC dag,
> !      A's nodes come before B's nodes.
> ! 
> !    - All of an SCC's nodes are listed consecutively, although the order
> !      of the nodes within an SCC is not really meaningful.
>   
>      After running this function, v->component is the number of the strongly
>      connected component for each vertex of G.  Returns the number of the
> ***************
> *** 289,295 ****
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> ! 	     skip_edge_callback skip_edge_p)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> --- 296,302 ----
>   
>   int
>   graphds_scc (struct graph *g, bitmap subgraph,
> ! 	     skip_edge_callback skip_edge_p, vec<int> *scc_grouping)
>   {
>     int *queue = XNEWVEC (int, g->n_vertices);
>     vec<int> postorder = vNULL;
> ***************
> *** 317,323 ****
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, NULL, true, subgraph, skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> --- 324,330 ----
>   
>     for (i = 0; i < nq; i++)
>       queue[i] = postorder[nq - i - 1];
> !   comp = graphds_dfs (g, queue, nq, scc_grouping, true, subgraph, skip_edge_p);
>   
>     free (queue);
>     postorder.release ();
> *** /tmp/eRUAEk_graphds.h	2022-08-02 08:16:23.987354123 +0100
> --- gcc/graphds.h	2022-08-01 20:44:18.835700593 +0100
> ***************
> *** 58,64 ****
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>   		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>   				       struct graph_edge *, void *);
> --- 58,65 ----
>   typedef bool (*skip_edge_callback) (struct graph_edge *);
>   int graphds_dfs (struct graph *, int *, int,
>   		 vec<int> *, bool, bitmap, skip_edge_callback = NULL);
> ! int graphds_scc (struct graph *, bitmap, skip_edge_callback = NULL,
> ! 		 vec<int> * = NULL);
>   void graphds_domtree (struct graph *, int, int *, int *, int *);
>   typedef void (*graphds_edge_callback) (struct graph *,
>   				       struct graph_edge *, void *);
> *** /tmp/4zWIek_hash-map-traits.h	2022-08-02 08:16:23.995354102 +0100
> --- gcc/hash-map-traits.h	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 105,118 ****
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with values of type Value for cases
> !    in which the key cannot represent empty and deleted slots.  Instead
> !    record empty and deleted entries in Value.  Derived classes must
> !    implement the hash and equal_keys functions.  */
>   
> ! template <typename Value>
>   struct unbounded_hashmap_traits
>   {
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> --- 105,123 ----
>     static const bool maybe_mx = false;
>   };
>   
> ! /* Implement traits for a hash_map with keys of type Key and values of
> !    type Value for cases in which the key cannot represent empty and
> !    deleted slots.  Instead record empty and deleted entries in Value.  */
>   
> ! template <typename Key, typename Value>
>   struct unbounded_hashmap_traits
>   {
> +   typedef typename Key::value_type key_type;
> + 
> +   static hashval_t hash (const typename Key::value_type &);
> +   static bool equal_keys (const typename Key::value_type &,
> + 			  const typename Key::compare_type &);
> + 
>     template <typename T> static inline void remove (T &);
>     static const bool empty_zero_p = default_hash_traits <Value>::empty_zero_p;
>     template <typename T> static inline bool is_empty (const T &);
> ***************
> *** 121,162 ****
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> --- 126,184 ----
>     template <typename T> static inline void mark_deleted (T &);
>   };
>   
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_hashmap_traits <Key, Value>
> ! ::hash (const typename Key::value_type &key)
> ! {
> !   return Key::hash (key);
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_hashmap_traits <Key, Value>
> ! ::equal_keys (const typename Key::value_type &x,
> ! 	      const typename Key::compare_type &y)
> ! {
> !   return Key::equal (x, y);
> ! }
> ! 
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::remove (T &entry)
>   {
>     default_hash_traits <Value>::remove (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_empty (const T &entry)
>   {
>     return default_hash_traits <Value>::is_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline bool
> ! unbounded_hashmap_traits <Key, Value>::is_deleted (const T &entry)
>   {
>     return default_hash_traits <Value>::is_deleted (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_empty (T &entry)
>   {
>     default_hash_traits <Value>::mark_empty (entry.m_value);
>   }
>   
> ! template <typename Key, typename Value>
>   template <typename T>
>   inline void
> ! unbounded_hashmap_traits <Key, Value>::mark_deleted (T &entry)
>   {
>     default_hash_traits <Value>::mark_deleted (entry.m_value);
>   }
> ***************
> *** 166,190 ****
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! struct unbounded_int_hashmap_traits : unbounded_hashmap_traits <Value>
> ! {
> !   typedef Key key_type;
> !   static inline hashval_t hash (Key);
> !   static inline bool equal_keys (Key, Key);
> ! };
> ! 
> ! template <typename Key, typename Value>
> ! inline hashval_t
> ! unbounded_int_hashmap_traits <Key, Value>::hash (Key k)
> ! {
> !   return k;
> ! }
> ! 
> ! template <typename Key, typename Value>
> ! inline bool
> ! unbounded_int_hashmap_traits <Key, Value>::equal_keys (Key k1, Key k2)
> ! {
> !   return k1 == k2;
> ! }
>   
>   #endif // HASH_MAP_TRAITS_H
> --- 188,194 ----
>      slots.  */
>   
>   template <typename Key, typename Value>
> ! using unbounded_int_hashmap_traits
> !   = unbounded_hashmap_traits <int_hash_base <Key>, Value>;
>   
>   #endif // HASH_MAP_TRAITS_H
> *** /tmp/FL7vGb_hash-traits.h	2022-08-02 08:16:24.011354061 +0100
> --- gcc/hash-traits.h	2022-08-01 20:44:22.287691305 +0100
> ***************
> *** 85,125 ****
>   {
>   }
>   
>   
> ! /* Hasher for integer type Type in which Empty is a spare value that can be
> !    used to mark empty slots.  If Deleted != Empty then Deleted is another
> !    spare value that can be used for deleted slots; if Deleted == Empty then
> !    hash table entries cannot be deleted.  */
> ! 
> ! template <typename Type, Type Empty, Type Deleted = Empty>
> ! struct int_hash : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
> -   static inline void mark_deleted (Type &);
> -   static const bool empty_zero_p = Empty == 0;
> -   static inline void mark_empty (Type &);
> -   static inline bool is_deleted (Type);
> -   static inline bool is_empty (Type);
>   };
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline hashval_t
> ! int_hash <Type, Empty, Deleted>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type, Type Empty, Type Deleted>
>   inline bool
> ! int_hash <Type, Empty, Deleted>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> --- 85,135 ----
>   {
>   }
>   
> + /* Base traits for integer type Type, providing just the hash and
> +    comparison functionality.  */
>   
> ! template <typename Type>
> ! struct int_hash_base : typed_noop_remove <Type>
>   {
>     typedef Type value_type;
>     typedef Type compare_type;
>   
>     static inline hashval_t hash (value_type);
>     static inline bool equal (value_type existing, value_type candidate);
>   };
>   
> ! template <typename Type>
>   inline hashval_t
> ! int_hash_base <Type>::hash (value_type x)
>   {
>     return x;
>   }
>   
> ! template <typename Type>
>   inline bool
> ! int_hash_base <Type>::equal (value_type x, value_type y)
>   {
>     return x == y;
>   }
>   
> + /* Hasher for integer type Type in which Empty is a spare value that can be
> +    used to mark empty slots.  If Deleted != Empty then Deleted is another
> +    spare value that can be used for deleted slots; if Deleted == Empty then
> +    hash table entries cannot be deleted.  */
> + 
> + template <typename Type, Type Empty, Type Deleted = Empty>
> + struct int_hash : int_hash_base <Type>
> + {
> +   typedef Type value_type;
> +   typedef Type compare_type;
> + 
> +   static inline void mark_deleted (Type &);
> +   static const bool empty_zero_p = Empty == 0;
> +   static inline void mark_empty (Type &);
> +   static inline bool is_deleted (Type);
> +   static inline bool is_empty (Type);
> + };
> + 
>   template <typename Type, Type Empty, Type Deleted>
>   inline void
>   int_hash <Type, Empty, Deleted>::mark_deleted (Type &x)
> ***************
> *** 398,403 ****
> --- 408,467 ----
>     return T1::is_empty (x.first);
>   }
>   
> + /* Base traits for vectors, providing just the hash and comparison
> +    functionality.  Type gives the traits for the element type.  */
> + 
> + template <typename Type>
> + struct vec_hash_base
> + {
> +   typedef vec<typename Type::value_type> value_type;
> +   typedef vec<typename Type::compare_type> compare_type;
> + 
> +   static inline hashval_t hash (value_type);
> +   static inline bool equal (value_type, compare_type);
> + };
> + 
> + template <typename Type>
> + inline hashval_t
> + vec_hash_base <Type>::hash (value_type x)
> + {
> +   inchash::hash hstate;
> +   hstate.add_int (x.length ());
> +   for (auto &value : x)
> +     hstate.merge_hash (Type::hash (value));
> +   return hstate.end ();
> + }
> + 
> + template <typename Type>
> + inline bool
> + vec_hash_base <Type>::equal (value_type x, compare_type y)
> + {
> +   if (x.length () != y.length ())
> +     return false;
> +   for (unsigned int i = 0; i < x.length (); ++i)
> +     if (!Type::equal (x[i], y[i]))
> +       return false;
> +   return true;
> + }
> + 
> + /* Traits for vectors whose contents should be freed normally.  */
> + 
> + template <typename Type>
> + struct vec_free_hash_base : vec_hash_base <Type>
> + {
> +   static void remove (typename vec_hash_base <Type>::value_type &);
> + };
> + 
> + template <typename Type>
> + void
> + vec_free_hash_base <Type>
> + ::remove (typename vec_hash_base <Type>::value_type &x)
> + {
> +   for (auto &value : x)
> +     Type::remove (x);
> +   x.release ();
> + }
> + 
>   template <typename T> struct default_hash_traits : T {};
>   
>   template <typename T>
> *** /tmp/5ek0ya_tree-vect-slp.cc	2022-08-02 08:16:24.027354017 +0100
> --- gcc/tree-vect-slp.cc	2022-08-01 20:44:22.631690380 +0100
> ***************
> *** 20,25 ****
> --- 20,26 ----
>   <http://www.gnu.org/licenses/>.  */
>   
>   #include "config.h"
> + #define INCLUDE_ALGORITHM
>   #include "system.h"
>   #include "coretypes.h"
>   #include "backend.h"
> ***************
> *** 49,54 ****
> --- 50,57 ----
>   #include "tree-eh.h"
>   #include "tree-cfg.h"
>   #include "alloc-pool.h"
> + #include "sreal.h"
> + #include "predict.h"
>   
>   static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
>   					  slp_tree, stmt_vector_for_cost *);
> ***************
> *** 304,309 ****
> --- 307,322 ----
>     oprnds_info.release ();
>   }
>   
> + /* Return the execution frequency of NODE (so that a higher value indicates
> +    a "more important" node when optimizing for speed).  */
> + 
> + static sreal
> + vect_slp_node_weight (slp_tree node)
> + {
> +   stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> +   basic_block bb = gimple_bb (stmt_info->stmt);
> +   return bb->count.to_sreal_scale (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count);
> + }
>   
>   /* Return true if STMTS contains a pattern statement.  */
>   
> ***************
> *** 3531,3559 ****
>     return opt_result::success ();
>   }
>   
> ! struct slpg_vertex
>   {
> !   slpg_vertex (slp_tree node_)
> !     : node (node_), perm_in (-1), perm_out (-1) {}
>   
> !   int get_perm_materialized () const
> !     { return perm_in != perm_out ? perm_in : 0; }
>   
>     slp_tree node;
> !   /* The common permutation on the incoming lanes (towards SLP children).  */
> !   int perm_in;
> !   /* The permutation on the outgoing lanes (towards SLP parents).  When
> !      the node is a materialization point for a permute this differs
> !      from perm_in (and is then usually zero).  Materialization happens
> !      on the input side.  */
> !   int perm_out;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (hash_set<slp_tree> &visited, slp_tree node,
> ! 			 vec<slpg_vertex> &vertices, vec<int> &leafs)
>   {
>     unsigned i;
>     slp_tree child;
> --- 3544,3994 ----
>     return opt_result::success ();
>   }
>   
> ! /* Estimates the cost of inserting layout changes into the SLP graph.  */
> ! 
> ! struct slpg_layout_cost
> ! {
> !   slpg_layout_cost () = default;
> !   slpg_layout_cost (sreal, bool);
> ! 
> !   bool operator== (const slpg_layout_cost &) const;
> !   bool operator!= (const slpg_layout_cost &) const;
> ! 
> !   bool is_better_than (const slpg_layout_cost &, bool) const;
> ! 
> !   void add_parallel_cost (const slpg_layout_cost &);
> !   void add_serial_cost (const slpg_layout_cost &);
> !   void split (unsigned int);
> ! 
> !   /* The longest sequence of layout changes needed during any traversal
> !      of the partition dag, weighted by execution frequency.
> ! 
> !      This is the most important metric when optimizing for speed, since
> !      it helps to ensure that we keep the number of operations on
> !      critical paths to a minimum.  */
> !   sreal depth = 0;
> ! 
> !   /* An estimate of the total number of operations needed.  IT is weighted by
> !      execution frequency when optimizing for speed but not when optimizing for
> !      size.  In order to avoid double-counting, a node with a fanout of N will
> !      distribute 1/N of its total cost to each successor.
> ! 
> !      This is the most important metric when optimizing for size, since
> !      it helps to keep the total number of operations to a minimum,  */
> !   sreal total = 0;
> ! };
> ! 
> ! /* Construct costs for a node with weight WEIGHT.  A higher weight
> !    indicates more frequent execution.  IS_FOR_SIZE is true if we are
> !    optimizing for size rather than speed.  */
> ! 
> ! slpg_layout_cost::slpg_layout_cost (sreal weight, bool is_for_size)
> !   : depth (weight), total (is_for_size && weight > 0 ? 1 : weight)
> ! {
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator== (const slpg_layout_cost &other) const
> ! {
> !   return depth == other.depth && total == other.total;
> ! }
> ! 
> ! bool
> ! slpg_layout_cost::operator!= (const slpg_layout_cost &other) const
> ! {
> !   return !operator== (other);
> ! }
> ! 
> ! /* Return true if these costs are better than OTHER.  IS_FOR_SIZE is
> !    true if we are optimizing for size rather than speed.  */
> ! 
> ! bool
> ! slpg_layout_cost::is_better_than (const slpg_layout_cost &other,
> ! 				  bool is_for_size) const
> ! {
> !   if (is_for_size)
> !     {
> !       if (total != other.total)
> ! 	return total < other.total;
> !       return depth < other.depth;
> !     }
> !   else
> !     {
> !       if (depth != other.depth)
> ! 	return depth < other.depth;
> !       return total < other.total;
> !     }
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in parallel with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_parallel_cost (const slpg_layout_cost &input_cost)
>   {
> !   depth = std::max (depth, input_cost.depth);
> !   total += input_cost.total;
> ! }
> ! 
> ! /* Increase the costs to account for something with cost INPUT_COST
> !    happening in series with the current costs.  */
> ! 
> ! void
> ! slpg_layout_cost::add_serial_cost (const slpg_layout_cost &other)
> ! {
> !   depth += other.depth;
> !   total += other.total;
> ! }
> ! 
> ! /* Split the total cost among TIMES successors or predecessors.  */
>   
> ! void
> ! slpg_layout_cost::split (unsigned int times)
> ! {
> !   if (times > 1)
> !     total /= times;
> ! }
> ! 
> ! /* Information about one node in the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_vertex
> ! {
> !   slpg_vertex (slp_tree node_) : node (node_) {}
>   
> +   /* The node itself.  */
>     slp_tree node;
> ! 
> !   /* Which partition the node belongs to, or -1 if none.  */
> !   int partition = -1;
> ! 
> !   /* The number of nodes that directly use the result of this one
> !      (i.e. the number of nodes that count this one as a child).  */
> !   unsigned int out_degree = 0;
> ! 
> !   /* The execution frequency of the node.  */
> !   sreal weight = 0;
> ! 
> !   /* The total execution frequency of all nodes that directly use the
> !      result of this one.  */
> !   sreal out_weight = 0;
> ! };
> ! 
> ! /* Information about one partition of the SLP graph, for use during
> !    vect_optimize_slp_pass.  */
> ! 
> ! struct slpg_partition_info
> ! {
> !   /* The nodes in the partition occupy indices [NODE_BEGIN, NODE_END)
> !      of m_partitioned_nodes.  */
> !   unsigned int node_begin = 0;
> !   unsigned int node_end = 0;
> ! 
> !   /* Which layout we've chosen to use for this partition, or -1 if
> !      we haven't picked one yet.  */
> !   int layout = -1;
> ! 
> !   /* The number of predecessors and successors in the partition dag.
> !      The predecessors always have lower partition numbers and the
> !      successors always have higher partition numbers.
> ! 
> !      Note that the directions of these edges are not necessarily the
> !      same as in the data flow graph.  For example, if an SCC contains
> !      an inner and an outer loop, the inner loop's partition will have
> !      at least two incoming edges from the outer loop's partition:
> !      one for a live-in value and one for a live-out value.  In data
> !      flow terms, the live-in edge would also be from the outer loop
> !      to the inner loop, but the live-out edge would be from the inner
> !      loop to the outer loop.  */
> !   unsigned int in_degree = 0;
> !   unsigned int out_degree = 0;
> ! };
> ! 
> ! /* Information about the costs of using a particular layout for a
> !    particular partition.  It can also say that the combination is
> !    invalid.  */
> ! 
> ! struct slpg_partition_layout_costs
> ! {
> !   bool is_possible () const { return cost.total != sreal::min (); }
> !   void mark_impossible () { cost.total = sreal::min (); }
> ! 
> !   /* The costs inherited from predecessor partitions plus the inherent
> !      cost of the layout within the node itself.  */
> !   slpg_layout_cost in_cost;
> ! 
> !   /* The costs inherited from successor partitions.  */
> !   slpg_layout_cost out_cost;
> ! 
> !   /* The sum of the above, when used.  */
> !   slpg_layout_cost cost;
> ! };
> ! 
> ! /* This class tries to optimize the layout of vectors in order to avoid
> !    unnecessary shuffling.  At the moment, the set of possible layouts are
> !    restricted to bijective permutes.
> ! 
> !    The goal of the pass depends on whether we're optimizing for size or
> !    for speed.  When optimizing for size, the goal is to reduce the overall
> !    number of layout changes (including layout changes implied by things
> !    like load permutes).  When optimizing for speed, the goal is to
> !    reduce the maximum latency attributable to layout changes on any
> !    non-cyclic path through the data flow graph.
> ! 
> !    For example, when optimizing a loop nest for speed, we will prefer
> !    to make layout changes outside of a loop rather than inside of a loop,
> !    and will prefer to make layout changes in parallel rather than serially,
> !    even if that increases the overall number of layout changes.
> ! 
> !    The high-level procedure is:
> ! 
> !    (1) Build a graph in which edges go from uses (parents) to definitions
> !        (children).
> ! 
> !    (2) Divide the graph into a dag of strongly-connected components (SCCs).
> ! 
> !    (3) Partition the nodes in each SCC based on their containing cfg loop,
> !        creating a dag of partitions.  The goal is now to assign a layout
> !        to each partition.
> ! 
> !    (4) Construct a set of vector layouts that are worth considering and
> !        calculate the local cost of using each layout in each partition.
> !        Some partitions might not allow a given layout at all.
> ! 
> !    (5) Perform a forward walk over the partition dag (from loads to stores)
> !        accumulating the "forward" cost of using each layout.  When visiting
> !        each partition, assign a tentative choice of layout to the partition
> !        and use that choice when calculating the cost of using a different
> !        layout in successor partitions.
> ! 
> !    (6) Perform a backward walk over the partition dag (from stores to loads),
> !        accumulating the "backward" cost of using each layout.  When visiting
> !        each partition, make a final choice of layout for that partition based
> !        on the accumulated forward costs (from (5)) and backward costs
> !        (from (6)).
> ! 
> !    (7) Apply the chosen layouts to the SLP graph.
> ! 
> !    For example, consider the SLP statements:
> ! 
> !    S1:      a_1 = load
> !        loop:
> !    S2:      a_2 = PHI<a_1, a_3>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        exit:
> !    S5:      a_4 = PHI<a_3>
> !    S6:      store a_4
> ! 
> !    S2 and S4 form an SCC and are part of the same loop.  Every other
> !    statement is in a singleton SCC.  In this example there is a one-to-one
> !    mapping between SCCs and partitions and the partition dag looks like this;
> ! 
> ! 	S1     S3
> ! 	 \     /
> ! 	  S2+S4
> ! 	    |
> ! 	   S5
> ! 	    |
> ! 	   S6
> ! 
> !    S2, S3 and S4 will have a higher execution frequency than the other
> !    statements, so when optimizing for speed, the goal is to avoid any
> !    layout changes:
> ! 
> !    - within S3
> !    - within S2+S4
> !    - on the S3->S2+S4 edge
> ! 
> !    For example, if S3 was originally a reversing load, the goal of the
> !    pass is to make it an unreversed load and change the layout on the
> !    S1->S2+S4 and S2+S4->S5 edges to compensate.  (Changing the layout
> !    on S1->S2+S4 and S5->S6 would also be acceptable.)
> ! 
> !    The difference between SCCs and partitions becomes important if we
> !    add an outer loop:
> ! 
> !    S1:      a_1 = ...
> !        loop1:
> !    S2:      a_2 = PHI<a_1, a_6>
> !    S3:      b_1 = load
> !    S4:      a_3 = a_2 + b_1
> !        loop2:
> !    S5:      a_4 = PHI<a_3, a_5>
> !    S6:      c_1 = load
> !    S7:      a_5 = a_4 + c_1
> !        exit2:
> !    S8:      a_6 = PHI<a_5>
> !    S9:      store a_6
> !        exit1:
> ! 
> !    Here, S2, S4, S5, S7 and S8 form a single SCC.  However, we usually
> !    do not want restrictions in the outer loop to "infect" the decision
> !    for the inner loop.  For example, if an outer-loop node in the SCC
> !    contains a statement with a fixed layout, that should not prevent the
> !    inner loop from using a different layout.  Conversely, the inner loop
> !    should not dictate a layout to the outer loop: if the outer loop does
> !    a lot of computation, then it may not be efficient to do all of that
> !    computation in the inner loop's preferred layout.
> ! 
> !    We therefore partition the SCC into S2+S4+S8 (outer) and S5+S7 (inner).
> !    We also try to arrange partitions so that:
> ! 
> !    - the partition for an outer loop comes before the partition for
> !      an inner loop
> ! 
> !    - if a sibling loop A dominates a sibling loop B, A's partition
> !      comes before B's
> ! 
> !    So in this case, the partition dag is:
> ! 
> ! 	S1        S3
> ! 	 \        /
> ! 	  S2+S4+S8   S6
> ! 	   |   \\    /
> ! 	   |    S5+S7
> ! 	   |
> ! 	  S9
> ! 
> !    There are two edges from S2+S4+S8 to S5+S7: one for the edge S4->S5 and
> !    one for a reversal of the edge S7->S8.
> ! 
> !    The backward walk picks a layout for S5+S7 before S2+S4+S8.  The choice
> !    for S2+S4+S8 therefore has to balance the cost of using the outer loop's
> !    preferred layout against the cost of changing the layout on entry to the
> !    inner loop (S4->S5) and on exit from the inner loop (S7->S8 reversed).
> ! 
> !    ??? Although this works well when optimizing for speed, it has the
> !    downside when optimizing for size that the choice of layout for S5+S7
> !    is completely independent of S9.  Perhaps we should not partition SCCs
> !    when optimizing for size?
> ! 
> !    To give a concrete example of the difference between optimizing
> !    for size and speed, consider:
> ! 
> !    a[0] = (b[1] << c[3]) - d[1];
> !    a[1] = (b[0] << c[2]) - d[0];
> !    a[2] = (b[3] << c[1]) - d[3];
> !    a[3] = (b[2] << c[0]) - d[2];
> ! 
> !    There are three different layouts here: one for a, one for b and d,
> !    and one for c.  When optimizing for speed it is better to permute
> !    each of b, c and d into the order required by a, since those permutes
> !    happen in parallel.  But when optimizing for size, it is better to:
> ! 
> !    - permute c into the same order as b
> !    - do the arithmetic
> !    - permute the result into the order required by a
> ! 
> !    This gives 2 permutes rather than 3.  */
> ! 
> ! class vect_optimize_slp_pass
> ! {
> ! public:
> !   vect_optimize_slp_pass (vec_info *vinfo) : m_vinfo (vinfo) {}
> !   void run ();
> ! 
> ! private:
> !   /* Graph building.  */
> !   struct loop *containing_loop (slp_tree);
> !   bool is_cfg_latch_edge (graph_edge *);
> !   void build_vertices (hash_set<slp_tree> &, slp_tree);
> !   void build_vertices ();
> !   void build_graph ();
> ! 
> !   /* Partitioning.  */
> !   void create_partitions ();
> !   template<typename T> void for_each_partition_edge (unsigned int, T);
> ! 
> !   /* Layout selection.  */
> !   bool can_change_layouts (unsigned int, unsigned int);
> !   bool layout_is_reversible (unsigned int);
> !   unsigned int layout_cost (unsigned int, unsigned int);
> !   slpg_partition_layout_costs &partition_layout_costs (unsigned int,
> ! 						       unsigned int);
> !   bool partition_can_absorb_permutes (unsigned int);
> !   int node_absorption_cost (slp_tree, unsigned int);
> !   void start_choosing_layouts ();
> ! 
> !   /* Cost propagation.  */
> !   slpg_layout_cost edge_layout_cost (graph_edge *, unsigned int,
> ! 				     unsigned int, unsigned int);
> !   slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
> !   slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned int);
> !   void forward_pass ();
> !   void backward_pass ();
> ! 
> !   /* Rematerialization.  */
> !   slp_tree get_result_with_layout (slp_tree, unsigned int);
> !   void materialize ();
> ! 
> !   /* Clean-up.  */
> !   void remove_redundant_permutes ();
> ! 
> !   void dump ();
> ! 
> !   vec_info *m_vinfo;
> ! 
> !   /* True if we should optimize the graph for size, false if we should
> !      optimize it for speed.  (It wouldn't be easy to make this decision
> !      more locally.)  */
> !   bool m_optimize_size;
> ! 
> !   /* A graph of all SLP nodes, with edges leading from uses to definitions.
> !      In other words, a node's predecessors are its slp_tree parents and
> !      a node's successors are its slp_tree children.  */
> !   graph *m_slpg = nullptr;
> ! 
> !   /* The vertices of M_SLPG, indexed by slp_tree::vertex.  */
> !   auto_vec<slpg_vertex> m_vertices;
> ! 
> !   /* The list of all leaves of M_SLPG. such as external definitions, constants,
> !      and loads.  */
> !   auto_vec<int> m_leafs;
> ! 
> !   /* This array has one entry for every vector layout that we're considering.
> !      Element 0 is null and indicates "no change".  Other entries describe
> !      permutes that are inherent in the current graph and that we would
> !      like to reverse if possible.
> ! 
> !      For example, a permute { 1, 2, 3, 0 } means that something has
> !      effectively been permuted in that way, such as a load group
> !      { a[1], a[2], a[3], a[0] } (viewed as a permutation of a[0:3]).
> !      We'd then like to apply the reverse permute { 3, 0, 1, 2 }
> !      in order to put things "back" in order.  */
> !   auto_vec<vec<unsigned> > m_perms;
> ! 
> !   /* A partitioning of the nodes for which a layout must be chosen.
> !      Each partition represents an <SCC, cfg loop> pair; that is,
> !      nodes in different SCCs belong to different partitions, and nodes
> !      within an SCC are further partitioned according to the containing
> !      cfg loop.  Partition <SCC1, L1> comes before <SCC2, L2> if:
> ! 
> !      - SCC1 != SCC2 and SCC1 is a predecessor of SCC2 in a forward walk
> !        from leaves (such as loads) to roots (such as stores).
> ! 
> !      - SCC1 == SCC2 and L1's header strictly dominates L2's header.  */
> !   auto_vec<slpg_partition_info> m_partitions;
> ! 
> !   /* The list of all nodes for which a layout must be chosen.  Nodes for
> !      partition P come before the nodes for partition P+1.  Nodes within a
> !      partition are in reverse postorder.  */
> !   auto_vec<unsigned int> m_partitioned_nodes;
> ! 
> !   /* Index P * num-layouts + L contains the cost of using layout L
> !      for partition P.  */
> !   auto_vec<slpg_partition_layout_costs> m_partition_layout_costs;
> ! 
> !   /* Index N * num-layouts + L, if nonnull, is a node that provides the
> !      original output of node N adjusted to have layout L.  */
> !   auto_vec<slp_tree> m_node_layouts;
>   };
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices (hash_set<slp_tree> &visited,
> ! 					slp_tree node)
>   {
>     unsigned i;
>     slp_tree child;
> ***************
> *** 3561,3568 ****
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = vertices.length ();
> !   vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> --- 3996,4003 ----
>     if (visited.add (node))
>       return;
>   
> !   node->vertex = m_vertices.length ();
> !   m_vertices.safe_push (slpg_vertex (node));
>   
>     bool leaf = true;
>     bool force_leaf = false;
> ***************
> *** 3570,3576 ****
>       if (child)
>         {
>   	leaf = false;
> ! 	vect_slp_build_vertices (visited, child, vertices, leafs);
>         }
>       else
>         force_leaf = true;
> --- 4005,4011 ----
>       if (child)
>         {
>   	leaf = false;
> ! 	build_vertices (visited, child);
>         }
>       else
>         force_leaf = true;
> ***************
> *** 3580,3600 ****
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! static void
> ! vect_slp_build_vertices (vec_info *info, vec<slpg_vertex> &vertices,
> ! 			 vec<int> &leafs)
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (info->slp_instances, i, instance)
> !     vect_slp_build_vertices (visited, SLP_INSTANCE_TREE (instance), vertices,
> ! 			     leafs);
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> --- 4015,4033 ----
>        and inductions.  Force those SLP PHIs to act as leafs to make them
>        backwards reachable.  */
>     if (leaf || force_leaf)
> !     m_leafs.safe_push (node->vertex);
>   }
>   
>   /* Fill the vertices and leafs vector with all nodes in the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_vertices ()
>   {
>     hash_set<slp_tree> visited;
>     unsigned i;
>     slp_instance instance;
> !   FOR_EACH_VEC_ELT (m_vinfo->slp_instances, i, instance)
> !     build_vertices (visited, SLP_INSTANCE_TREE (instance));
>   }
>   
>   /* Apply (reverse) bijectite PERM to VEC.  */
> ***************
> *** 3625,3699 ****
>       }
>   }
>   
> ! /* Return whether permutations PERM_A and PERM_B as recorded in the
> !    PERMS vector are equal.  */
>   
> ! static bool
> ! vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
> ! 		   int perm_a, int perm_b)
>   {
> !   return (perm_a == perm_b
> ! 	  || (perm_a != -1 && perm_b != -1
> ! 	      && perms[perm_a].length () == perms[perm_b].length ()
> ! 	      && memcmp (&perms[perm_a][0], &perms[perm_b][0],
> ! 			 sizeof (unsigned) * perms[perm_a].length ()) == 0));
>   }
>   
> ! /* Optimize the SLP graph of VINFO.  */
>   
> ! void
> ! vect_optimize_slp (vec_info *vinfo)
>   {
> !   if (vinfo->slp_instances.is_empty ())
> !     return;
>   
> !   slp_tree node;
> !   unsigned i;
> !   auto_vec<slpg_vertex> vertices;
> !   auto_vec<int> leafs;
> !   vect_slp_build_vertices (vinfo, vertices, leafs);
>   
> !   struct graph *slpg = new_graph (vertices.length ());
> !   for (slpg_vertex &v : vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> ! 	add_edge (slpg, v.node->vertex, child->vertex);
>   
> !   /* Compute (reverse) postorder on the inverted graph.  */
> !   auto_vec<int> ipo;
> !   graphds_dfs (slpg, &leafs[0], leafs.length (), &ipo, false, NULL, NULL);
> ! 
> !   auto_vec<vec<unsigned> > perms;
> !   perms.safe_push (vNULL); /* zero is no permute */
> ! 
> !   /* Produce initial permutations.  */
> !   for (i = 0; i < leafs.length (); ++i)
> !     {
> !       int idx = leafs[i];
> !       slp_tree node = vertices[idx].node;
> ! 
> !       /* Handle externals and constants optimistically throughout the
> ! 	 iteration.  But treat existing vectors as fixed since we
> ! 	 do not handle permuting them below.  */
> !       if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	   && !SLP_TREE_VEC_DEFS (node).exists ())
> ! 	  || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	continue;
>   
> !       /* Leafs do not change across iterations.  Note leafs also double
> ! 	 as entries to the reverse graph.  */
> !       if (!slpg->vertices[idx].succ)
>   	{
> ! 	  vertices[idx].perm_in = 0;
> ! 	  vertices[idx].perm_out = 0;
>   	}
>   
>         /* Loads are the only thing generating permutes.  */
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! 	 node unpermuted, record this permute.  */
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>   	continue;
> --- 4058,4416 ----
>       }
>   }
>   
> ! /* Return the cfg loop that contains NODE.  */
>   
> ! struct loop *
> ! vect_optimize_slp_pass::containing_loop (slp_tree node)
>   {
> !   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> !   if (!rep)
> !     return ENTRY_BLOCK_PTR_FOR_FN (cfun)->loop_father;
> !   return gimple_bb (vect_orig_stmt (rep)->stmt)->loop_father;
>   }
>   
> ! /* Return true if UD (an edge from a use to a definition) is associated
> !    with a loop latch edge in the cfg.  */
>   
> ! bool
> ! vect_optimize_slp_pass::is_cfg_latch_edge (graph_edge *ud)
>   {
> !   slp_tree use = m_vertices[ud->src].node;
> !   slp_tree def = m_vertices[ud->dest].node;
> !   if (SLP_TREE_DEF_TYPE (use) != vect_internal_def
> !       || SLP_TREE_DEF_TYPE (def) != vect_internal_def)
> !     return false;
>   
> !   stmt_vec_info use_rep = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (use));
> !   return (is_a<gphi *> (use_rep->stmt)
> ! 	  && bb_loop_header_p (gimple_bb (use_rep->stmt))
> ! 	  && containing_loop (def) == containing_loop (use));
> ! }
> ! 
> ! /* Build the graph.  Mark edges that correspond to cfg loop latch edges with
> !    a nonnull data field.  */
>   
> ! void
> ! vect_optimize_slp_pass::build_graph ()
> ! {
> !   build_vertices ();
> ! 
> !   m_slpg = new_graph (m_vertices.length ());
> !   for (slpg_vertex &v : m_vertices)
>       for (slp_tree child : SLP_TREE_CHILDREN (v.node))
>         if (child)
> ! 	{
> ! 	  graph_edge *ud = add_edge (m_slpg, v.node->vertex, child->vertex);
> ! 	  if (is_cfg_latch_edge (ud))
> ! 	    ud->data = this;
> ! 	}
> ! }
>   
> ! /* Return true if E corresponds to a loop latch edge in the cfg.  */
>   
> ! static bool
> ! skip_cfg_latch_edges (graph_edge *e)
> ! {
> !   return e->data;
> ! }
> ! 
> ! /* Create the node partitions.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::create_partitions ()
> ! {
> !   /* Calculate a postorder of the graph, ignoring edges that correspond
> !      to natural latch edges in the cfg.  Reading the vector from the
> !      end to the beginning gives the reverse postorder.  */
> !   auto_vec<int> initial_rpo;
> !   graphds_dfs (m_slpg, &m_leafs[0], m_leafs.length (), &initial_rpo,
> ! 	       false, NULL, skip_cfg_latch_edges);
> !   gcc_assert (initial_rpo.length () == m_vertices.length ());
> ! 
> !   /* Calculate the strongly connected components of the graph.  */
> !   auto_vec<int> scc_grouping;
> !   unsigned int num_sccs = graphds_scc (m_slpg, NULL, NULL, &scc_grouping);
> ! 
> !   /* Create a new index order in which all nodes from the same SCC are
> !      consecutive.  Use scc_pos to record the index of the first node in
> !      each SCC.  */
> !   auto_vec<unsigned int> scc_pos (num_sccs);
> !   int last_component = -1;
> !   unsigned int node_count = 0;
> !   for (unsigned int node_i : scc_grouping)
> !     {
> !       if (last_component != m_slpg->vertices[node_i].component)
> ! 	{
> ! 	  last_component = m_slpg->vertices[node_i].component;
> ! 	  gcc_assert (last_component == int (scc_pos.length ()));
> ! 	  scc_pos.quick_push (node_count);
> ! 	}
> !       node_count += 1;
> !     }
> !   gcc_assert (node_count == initial_rpo.length ()
> ! 	      && last_component + 1 == int (num_sccs));
> ! 
> !   /* Use m_partitioned_nodes to group nodes into SCC order, with the nodes
> !      inside each SCC following the RPO we calculated above.  The fact that
> !      we ignored natural latch edges when calculating the RPO should ensure
> !      that, for natural loop nests:
> ! 
> !      - the first node that we encounter in a cfg loop is the loop header phi
> !      - the loop header phis are in dominance order
> ! 
> !      Arranging for this is an optimization (see below) rather than a
> !      correctness issue.  Unnatural loops with a tangled mess of backedges
> !      will still work correctly, but might give poorer results.
> ! 
> !      Also update scc_pos so that it gives 1 + the index of the last node
> !      in the SCC.  */
> !   m_partitioned_nodes.safe_grow (node_count);
> !   for (unsigned int old_i = initial_rpo.length (); old_i-- > 0;)
> !     {
> !       unsigned int node_i = initial_rpo[old_i];
> !       unsigned int new_i = scc_pos[m_slpg->vertices[node_i].component]++;
> !       m_partitioned_nodes[new_i] = node_i;
> !     }
> ! 
> !   /* Partition each SCC based on the containing cfg loop.  The order we
> !      constructed above should ensure that, for natural cfg loops, we'll
> !      create sub-SCC partitions for outer loops before the corresponding
> !      sub-SCC partitions for inner loops.  Similarly, when one sibling
> !      loop A dominates another sibling loop B, we should create a sub-SCC
> !      partition for A before a sub-SCC partition for B.
> ! 
> !      As above, nothing depends for correctness on whether this achieves
> !      a natural nesting, but we should get better results when it does.  */
> !   m_partitions.reserve (m_vertices.length ());
> !   unsigned int next_partition_i = 0;
> !   hash_map<struct loop *, int> loop_partitions;
> !   unsigned int rpo_begin = 0;
> !   unsigned int num_partitioned_nodes = 0;
> !   for (unsigned int rpo_end : scc_pos)
> !     {
> !       loop_partitions.empty ();
> !       for (unsigned int rpo_i = rpo_begin; rpo_i < rpo_end; ++rpo_i)
> ! 	{
> ! 	  /* Handle externals and constants optimistically throughout.
> ! 	     But treat existing vectors as fixed since we do not handle
> ! 	     permuting them.  */
> ! 	  unsigned int node_i = m_partitioned_nodes[rpo_i];
> ! 	  slp_tree node = m_vertices[node_i].node;
> ! 	  if ((SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	       && !SLP_TREE_VEC_DEFS (node).exists ())
> ! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	    m_vertices[node_i].partition = -1;
> ! 	  else
> ! 	    {
> ! 	      struct loop *loop = containing_loop (node);
> ! 	      bool existed;
> ! 	      auto &partition_i
> ! 		= loop_partitions.get_or_insert (loop, &existed);
> ! 	      if (!existed)
> ! 		{
> ! 		  partition_i = next_partition_i++;
> ! 		  m_partitions.quick_push (slpg_partition_info ());
> ! 		}
> ! 	      m_vertices[node_i].partition = partition_i;
> ! 	      num_partitioned_nodes += 1;
> ! 	      m_partitions[partition_i].node_end += 1;
> ! 	    }
> ! 	}
> !       rpo_begin = rpo_end;
> !     }
> ! 
> !   /* Assign ranges of consecutive node indices to each partition,
> !      in partition order.  Start with node_end being the same as
> !      node_begin so that the next loop can use it as a counter.  */
> !   unsigned int node_begin = 0;
> !   for (auto &partition : m_partitions)
> !     {
> !       partition.node_begin = node_begin;
> !       node_begin += partition.node_end;
> !       partition.node_end = partition.node_begin;
> !     }
> !   gcc_assert (node_begin == num_partitioned_nodes);
> ! 
> !   /* Finally build the list of nodes in partition order.  */
> !   m_partitioned_nodes.truncate (num_partitioned_nodes);
> !   for (unsigned int node_i = 0; node_i < m_vertices.length (); ++node_i)
> !     {
> !       int partition_i = m_vertices[node_i].partition;
> !       if (partition_i >= 0)
>   	{
> ! 	  unsigned int order_i = m_partitions[partition_i].node_end++;
> ! 	  m_partitioned_nodes[order_i] = node_i;
>   	}
> +     }
> + }
> + 
> + /* Look for edges from earlier partitions into node NODE_I and edges from
> +    node NODE_I into later partitions.  Call:
> + 
> +       FN (ud, other_node_i, other_partition_i)
> + 
> +    for each such use-to-def edge ud, where other_node_i and other_partition_i
> +    are the node and partition at the other end of the edge.  */
> + 
> + template<typename T>
> + void
> + vect_optimize_slp_pass::for_each_partition_edge (unsigned int node_i, T fn)
> + {
> +   int partition_i = m_vertices[node_i].partition;
> +   for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> +        pred; pred = pred->pred_next)
> +     {
> +       int src_partition_i = m_vertices[pred->src].partition;
> +       if (src_partition_i >= 0 && src_partition_i != partition_i)
> + 	fn (pred, pred->src, src_partition_i);
> +     }
> +   for (graph_edge *succ = m_slpg->vertices[node_i].succ;
> +        succ; succ = succ->succ_next)
> +     {
> +       int dest_partition_i = m_vertices[succ->dest].partition;
> +       if (dest_partition_i >= 0 && dest_partition_i != partition_i)
> + 	fn (succ, succ->dest, dest_partition_i);
> +     }
> + }
> + 
> + /* Return true if the target is able to change a vector from layout
> +    FROM_LAYOUT_I to layout TO_LAYOUT_I.  */
> + 
> + bool
> + vect_optimize_slp_pass::can_change_layouts (unsigned int from_layout_i,
> + 					    unsigned int to_layout_i)
> + {
> +   /* FIXME: test what the target can support.  */
> +   (void) from_layout_i;
> +   (void) to_layout_i;
> +   return true;
> + }
> + 
> + /* We already know that the target can perform permute m_perms[LAYOUT_I].
> +    In our terms this means that we know that the target can go from a
> +    layout with m_perms[LAYOUT_I] reversed (layout LAYOUT_I) to one with
> +    it applied (layout 0, the original layout of the SLP graph).  Return
> +    true if the target can also go in the opposite direction: from the
> +    original layout to one in which m_perms[LAYOUT_I] has been reversed.  */
> + 
> + bool
> + vect_optimize_slp_pass::layout_is_reversible (unsigned int layout_i)
> + {
> +   return can_change_layouts (0, layout_i);
> + }
> + 
> + /* Return the cost (in arbtirary units) of going from layout FROM_LAYOUT_I
> +    to layout TO_LAYOUT_I.  */
> + 
> + unsigned int
> + vect_optimize_slp_pass::layout_cost (unsigned int from_layout_i,
> + 				     unsigned int to_layout_i)
> + {
> +   if (from_layout_i == to_layout_i)
> +     return 0;
> + 
> +   if (can_change_layouts (from_layout_i, to_layout_i))
> +     return 1;
> + 
> +   /* Change to layout 0 and then out again.  */
> +   gcc_assert (layout_is_reversible (from_layout_i));
> +   return 2;
> + }
> + 
> + /* Return the costs of assigning layout LAYOUT_I to partition PARTITION_I.  */
> + 
> + inline slpg_partition_layout_costs &
> + vect_optimize_slp_pass::partition_layout_costs (unsigned int partition_i,
> + 						unsigned int layout_i)
> + {
> +   return m_partition_layout_costs[partition_i * m_perms.length () + layout_i];
> + }
> + 
> + /* Return true if partition PARTITION_I is structually allowed to absorb
> +    neighboring permutes.  */
> + 
> + bool
> + vect_optimize_slp_pass::
> + partition_can_absorb_permutes (unsigned int partition_i)
> + {
> +   auto &partition = m_partitions[partition_i];
> +   if (partition.node_end != partition.node_begin + 1)
> +     return false;
> + 
> +   slp_tree node = m_vertices[m_partitioned_nodes[partition.node_begin]].node;
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     return true;
> + 
> +   stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> +   if (rep
> +       && STMT_VINFO_DATA_REF (rep)
> +       && DR_IS_READ (STMT_VINFO_DATA_REF (rep)))
> +     return true;
> + 
> +   return false;
> + }
> + 
> + /* Check whether NODE (which is a VEC_PERM_EXPR or a load) can absorb the
> +    permute that would be required to go from layout 0 to layout LAYOUT_I.
> +    Return the cost of the change if so, in the same arbitrary units as
> +    for layout_cost.  Return -1 otherwise.  */
> + 
> + int
> + vect_optimize_slp_pass::node_absorption_cost (slp_tree node,
> + 					      unsigned int layout_i)
> + {
> +   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> +     /* FIXME: check whether NODE can handle cases in which both the inputs
> +        and outputs have layout LAYOUT_I.  Return -1 if not.  */
> +     return 0;
> + 
> +   auto &vertex = m_vertices[node->vertex];
> +   auto &partition = m_partitions[vertex.partition];
> +   if (partition.layout == int (layout_i))
> +     return 0;
> +   /* FIXME: check whether NODE can be changed from its current layout
> +      to layout LAYOUT_I.  Return -1 if not.  */
> +   return 1;
> + }
> + 
> + /* Decide which element layouts we should consider using.  Mark which
> +    partitions can potentially use which layouts and the local costs of
> +    each choice.  */
> + 
> + void
> + vect_optimize_slp_pass::start_choosing_layouts ()
> + {
> +   /* Used to assign unique permute indices.  */
> +   using perm_hash = unbounded_hashmap_traits<
> +     vec_free_hash_base<int_hash_base<unsigned>>,
> +     int_hash<int, -1, -2>
> +   >;
> +   hash_map<vec<unsigned>, int, perm_hash> layout_ids;
> + 
> +   /* Layout 0 is "no change".  */
> +   m_perms.safe_push (vNULL);
> + 
> +   /* Create layouts from existing permutations.  */
> +   /* FIXME: cap the maximum number of layouts, to avoid quadraticness.  */
> +   for (unsigned int node_i : m_leafs)
> +     {
> +       int partition_i = m_vertices[node_i].partition;
> +       if (partition_i < 0)
> + 	continue;
> + 
> +       /* Leafs also double as entries to the reverse graph.  Allow the
> + 	 layout of those to be changed.  */
> +       if (!m_slpg->vertices[node_i].succ)
> + 	m_partitions[partition_i].layout = 0;
>   
>         /* Loads are the only thing generating permutes.  */
> +       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> ! 	 node unpermuted, record a layout that reverses this permute.  */
> !       gcc_assert (m_partitions[partition_i].layout == 0);
>         stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
>         if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>   	continue;
> ***************
> *** 3708,3720 ****
>   	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>   	    any_permute = true;
>   	}
> -       /* If there's no permute no need to split one out.  */
> -       if (!any_permute)
> - 	continue;
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>   	 that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>   	continue;
>   
>         /* For now only handle true permutes, like
>   	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> --- 4425,4442 ----
>   	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>   	    any_permute = true;
>   	}
>         /* If the span doesn't match we'd disrupt VF computation, avoid
>   	 that for now.  */
>         if (imax - imin + 1 != SLP_TREE_LANES (node))
>   	continue;
> +       /* If there's no permute no need to split one out.  In this case
> + 	 we can consider turning the load into a permuted load, if that
> + 	 turns out to be cheaper than alternatives.  */
> +       if (!any_permute)
> + 	{
> + 	  m_partitions[partition_i].layout = -1;
> + 	  continue;
> + 	}
>   
>         /* For now only handle true permutes, like
>   	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> ***************
> *** 3735,3985 ****
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>   	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       perms.safe_push (perm);
> !       vertices[idx].perm_in = perms.length () - 1;
> !       vertices[idx].perm_out = perms.length () - 1;
>       }
>   
> !   /* In addition to the above we have to mark outgoing permutes facing
> !      non-reduction graph entries that are not represented as to be
> !      materialized.  */
> !   for (slp_instance instance : vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> ! 	/* Just setting perm_out isn't enough for the propagation to
> ! 	   pick this up.  */
> ! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_in = 0;
> ! 	vertices[SLP_INSTANCE_TREE (instance)->vertex].perm_out = 0;
>         }
>   
> !   /* Propagate permutes along the graph and compute materialization points.  */
> !   bool changed;
> !   bool do_materialization = false;
> !   unsigned iteration = 0;
> !   do
> !     {
> !       changed = false;
> !       ++iteration;
> ! 
> !       if (dump_enabled_p ())
> ! 	dump_printf_loc (MSG_NOTE, vect_location,
> ! 			 "SLP optimize iteration %d\n", iteration);
>   
> !       for (i = vertices.length (); i > 0 ; --i)
>   	{
> ! 	  int idx = ipo[i-1];
> ! 	  slp_tree node = vertices[idx].node;
>   
> ! 	  /* Handle externals and constants optimistically throughout the
> ! 	     iteration.  */
> ! 	  if (SLP_TREE_DEF_TYPE (node) == vect_external_def
> ! 	      || SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> ! 	    continue;
>   
> ! 	  /* We still eventually have failed backedge SLP nodes in the
> ! 	     graph, those are only cancelled when analyzing operations.
> ! 	     Simply treat them as transparent ops, propagating permutes
> ! 	     through them.  */
> ! 	  if (SLP_TREE_DEF_TYPE (node) == vect_internal_def)
> ! 	    {
> ! 	      /* We do not handle stores with a permutation, so all
> ! 		 incoming permutes must have been materialized.  */
> ! 	      stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node);
> ! 	      if (STMT_VINFO_DATA_REF (rep)
> ! 		  && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! 		{
> ! 		  /* ???  We're forcing materialization in place
> ! 		     of the child here, we'd need special handling
> ! 		     in materialization to leave perm_in -1 here.  */
> ! 		  vertices[idx].perm_in = 0;
> ! 		  vertices[idx].perm_out = 0;
> ! 		}
> ! 	      /* We cannot move a permute across an operation that is
> ! 		 not independent on lanes.  Note this is an explicit
> ! 		 negative list since that's much shorter than the respective
> ! 		 positive one but it's critical to keep maintaining it.  */
> ! 	      if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! 		switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! 		  {
> ! 		  case CFN_COMPLEX_ADD_ROT90:
> ! 		  case CFN_COMPLEX_ADD_ROT270:
> ! 		  case CFN_COMPLEX_MUL:
> ! 		  case CFN_COMPLEX_MUL_CONJ:
> ! 		  case CFN_VEC_ADDSUB:
> ! 		  case CFN_VEC_FMADDSUB:
> ! 		  case CFN_VEC_FMSUBADD:
> ! 		    vertices[idx].perm_in = 0;
> ! 		    vertices[idx].perm_out = 0;
> ! 		  default:;
> ! 		  }
>   	    }
>   
> ! 	  if (!slpg->vertices[idx].succ)
> ! 	    /* Pick up pre-computed leaf values.  */
> ! 	    ;
> ! 	  else
>   	    {
> ! 	      bool any_succ_perm_out_m1 = false;
> ! 	      int perm_in = vertices[idx].perm_in;
> ! 	      for (graph_edge *succ = slpg->vertices[idx].succ;
> ! 		   succ; succ = succ->succ_next)
>   		{
> ! 		  int succ_idx = succ->dest;
> ! 		  int succ_perm = vertices[succ_idx].perm_out;
> ! 		  /* Handle unvisited (and constant) nodes optimistically.  */
> ! 		  /* ???  But for constants once we want to handle
> ! 		     non-bijective permutes we have to verify the permute,
> ! 		     when unifying lanes, will not unify different constants.
> ! 		     For example see gcc.dg/vect/bb-slp-14.c for a case
> ! 		     that would break.  */
> ! 		  if (succ_perm == -1)
> ! 		    {
> ! 		      /* When we handled a non-leaf optimistically, note
> ! 			 that so we can adjust its outgoing permute below.  */
> ! 		      slp_tree succ_node = vertices[succ_idx].node;
> ! 		      if (SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! 			any_succ_perm_out_m1 = true;
> ! 		      continue;
> ! 		    }
> ! 		  if (perm_in == -1)
> ! 		    perm_in = succ_perm;
> ! 		  else if (succ_perm == 0
> ! 			   || !vect_slp_perms_eq (perms, perm_in, succ_perm))
>   		    {
> ! 		      perm_in = 0;
> ! 		      break;
>   		    }
>   		}
>   
> ! 	      /* Adjust any incoming permutes we treated optimistically.  */
> ! 	      if (perm_in != -1 && any_succ_perm_out_m1)
>   		{
> ! 		  for (graph_edge *succ = slpg->vertices[idx].succ;
> ! 		       succ; succ = succ->succ_next)
>   		    {
> ! 		      slp_tree succ_node = vertices[succ->dest].node;
> ! 		      if (vertices[succ->dest].perm_out == -1
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_external_def
> ! 			  && SLP_TREE_DEF_TYPE (succ_node) != vect_constant_def)
> ! 			{
> ! 			  vertices[succ->dest].perm_out = perm_in;
> ! 			  /* And ensure this propagates.  */
> ! 			  if (vertices[succ->dest].perm_in == -1)
> ! 			    vertices[succ->dest].perm_in = perm_in;
> ! 			}
>   		    }
> ! 		  changed = true;
> ! 		}
>   
> ! 	      if (!vect_slp_perms_eq (perms, perm_in,
> ! 				      vertices[idx].perm_in))
> ! 		{
> ! 		  /* Make sure we eventually converge.  */
> ! 		  gcc_checking_assert (vertices[idx].perm_in == -1
> ! 				       || perm_in == 0);
> ! 		  vertices[idx].perm_in = perm_in;
> ! 
> ! 		  /* While we can handle VEC_PERM nodes as transparent
> ! 		     pass-through they can be a cheap materialization
> ! 		     point as well.  In addition they can act as source
> ! 		     of a random permutation as well.
> ! 		     The following ensures that former materialization
> ! 		     points that now have zero incoming permutes no
> ! 		     longer appear as such and that former "any" permutes
> ! 		     get pass-through.  We keep VEC_PERM nodes optimistic
> ! 		     as "any" outgoing permute though.  */
> ! 		  if (vertices[idx].perm_out != 0
> ! 		      && SLP_TREE_CODE (node) != VEC_PERM_EXPR)
> ! 		    vertices[idx].perm_out = perm_in;
> ! 		  changed = true;
> ! 		}
>   	    }
>   
> ! 	  /* Elide pruning at materialization points in the first
> ! 	     iteration phase.  */
> ! 	  if (!do_materialization)
> ! 	    continue;
>   
> ! 	  int perm = vertices[idx].perm_out;
> ! 	  if (perm == 0 || perm == -1)
>   	    continue;
>   
> ! 	  /* Decide on permute materialization.  Look whether there's
> ! 	     a use (pred) edge that is permuted differently than us.
> ! 	     In that case mark ourselves so the permutation is applied.  */
> ! 	  bool all_preds_permuted = slpg->vertices[idx].pred != NULL;
> ! 	  if (all_preds_permuted)
> ! 	    for (graph_edge *pred = slpg->vertices[idx].pred;
> ! 		 pred; pred = pred->pred_next)
> ! 	      {
> ! 		int pred_perm = vertices[pred->src].perm_in;
> ! 		gcc_checking_assert (pred_perm != -1);
> ! 		if (!vect_slp_perms_eq (perms, perm, pred_perm))
> ! 		  {
> ! 		    all_preds_permuted = false;
> ! 		    break;
> ! 		  }
> ! 	      }
> ! 	  if (!all_preds_permuted)
> ! 	    {
> ! 	      vertices[idx].perm_out = 0;
> ! 	      changed = true;
>   	    }
> - 	}
>   
> !       /* If the initial propagation converged, switch on materialization
> ! 	 and re-propagate.  */
> !       if (!changed && !do_materialization)
>   	{
> ! 	  do_materialization = true;
> ! 	  changed = true;
>   	}
>       }
> !   while (changed);
> !   statistics_histogram_event (cfun, "SLP optimize perm iterations", iteration);
>   
> !   /* Materialize.  */
> !   for (i = 0; i < vertices.length (); ++i)
>       {
> !       int perm_in = vertices[i].perm_in;
> !       slp_tree node = vertices[i].node;
>   
> !       /* First permute invariant/external original successors, we handle
> ! 	 those optimistically during propagation and duplicate them if
> ! 	 they are used with different permutations.  */
> !       unsigned j;
> !       slp_tree child;
> !       if (perm_in > 0)
> ! 	FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
> ! 	  {
> ! 	    if (!child
> ! 		|| (SLP_TREE_DEF_TYPE (child) != vect_constant_def
> ! 		    && SLP_TREE_DEF_TYPE (child) != vect_external_def))
> ! 	      continue;
>   
> ! 	    /* If the vector is uniform there's nothing to do.  */
> ! 	    if (vect_slp_tree_uniform_p (child))
> ! 	      continue;
>   
> ! 	    /* We can end up sharing some externals via two_operator
> ! 	       handling.  Be prepared to unshare those.  */
> ! 	    if (child->refcnt != 1)
> ! 	      {
> ! 		gcc_assert (slpg->vertices[child->vertex].pred->pred_next);
> ! 		SLP_TREE_CHILDREN (node)[j] = child
> ! 		  = vect_create_new_slp_node
> ! 		      (SLP_TREE_SCALAR_OPS (child).copy ());
> ! 	      }
> ! 	    vect_slp_permute (perms[perm_in],
> ! 			      SLP_TREE_SCALAR_OPS (child), true);
> ! 	  }
>   
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>   	{
>   	  /* Apply the common permutes to the input vectors.  */
> ! 	  if (perm_in > 0)
> ! 	    {
>   	      /* If the node is already a permute node we can apply
>   		 the permutation to the lane selection, effectively
>   		 materializing it on the incoming vectors.  */
> --- 4457,5092 ----
>         perm.safe_grow (SLP_TREE_LANES (node), true);
>         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>   	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> !       bool existed;
> !       int &layout_i = layout_ids.get_or_insert (perm, &existed);
> !       if (existed)
> ! 	perm.release ();
> !       else
> ! 	{
> ! 	  layout_i = m_perms.length ();
> ! 	  m_perms.safe_push (perm);
> ! 	}
> !       m_partitions[partition_i].layout = layout_i;
>       }
>   
> !   /* Initially assume that every layout is possible and has zero cost
> !      in every partition.  */
> !   m_partition_layout_costs.safe_grow_cleared (m_partitions.length ()
> ! 					      * m_perms.length ());
> ! 
> !   /* We have to mark outgoing permutes facing non-reduction graph
> !      entries that are not represented as to be materialized.  */
> !   for (slp_instance instance : m_vinfo->slp_instances)
>       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_ctor)
>         {
> ! 	unsigned int node_i = SLP_INSTANCE_TREE (instance)->vertex;
> ! 	m_partitions[m_vertices[node_i].partition].layout = 0;
>         }
>   
> !   /* Check which layouts each node and partition can handle.  Calculate the
> !      weights associated with inserting layout changes on edges.  */
> !   m_optimize_size = true;
> !   for (unsigned int node_i : m_partitioned_nodes)
> !     {
> !       auto &vertex = m_vertices[node_i];
> !       unsigned int partition_i = vertex.partition;
> !       auto &partition = m_partitions[partition_i];
> !       slp_tree node = vertex.node;
> ! 
> !       if (stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (node))
> ! 	{
> ! 	  basic_block bb = gimple_bb (vect_orig_stmt (rep)->stmt);
> ! 	  if (optimize_bb_for_speed_p (bb))
> ! 	    m_optimize_size = false;
> ! 
> ! 	  vertex.weight = vect_slp_node_weight (node);
> ! 
> ! 	  /* We do not handle stores with a permutation, so all
> ! 	     incoming permutes must have been materialized.  */
> ! 	  if (STMT_VINFO_DATA_REF (rep)
> ! 	      && DR_IS_WRITE (STMT_VINFO_DATA_REF (rep)))
> ! 	    /* ???  We're forcing materialization in place
> ! 	       of the child here, we'd need special handling
> ! 	       in materialization to leave layout -1 here.  */
> ! 	    partition.layout = 0;
> ! 
> ! 	  /* We cannot change the layout of an operation that is
> ! 	     not independent on lanes.  Note this is an explicit
> ! 	     negative list since that's much shorter than the respective
> ! 	     positive one but it's critical to keep maintaining it.  */
> ! 	  if (is_gimple_call (STMT_VINFO_STMT (rep)))
> ! 	    switch (gimple_call_combined_fn (STMT_VINFO_STMT (rep)))
> ! 	      {
> ! 	      case CFN_COMPLEX_ADD_ROT90:
> ! 	      case CFN_COMPLEX_ADD_ROT270:
> ! 	      case CFN_COMPLEX_MUL:
> ! 	      case CFN_COMPLEX_MUL_CONJ:
> ! 	      case CFN_VEC_ADDSUB:
> ! 	      case CFN_VEC_FMADDSUB:
> ! 	      case CFN_VEC_FMSUBADD:
> ! 		partition.layout = 0;
> ! 	      default:;
> ! 	      }
> ! 	}
>   
> !       if (partition.layout == 0)
>   	{
> ! 	  /* The node must stay in its current order.  */
> ! 	  for (unsigned int layout_i = 1; layout_i < m_perms.length ();
> ! 	       ++layout_i)
> ! 	    partition_layout_costs (partition_i, layout_i).mark_impossible ();
> ! 	}
> !       else if (partition_can_absorb_permutes (partition_i))
> ! 	/* Calculate the cost of converting NODE to use each permute.
> ! 	   If PARTITION.LAYOUT > 0 then that layout's permute will be free.  */
> ! 	for (unsigned int layout_i = 0; layout_i < m_perms.length ();
> ! 	     ++layout_i)
> ! 	  {
> ! 	    auto &layout_costs = partition_layout_costs (partition_i,
> ! 							 layout_i);
> ! 	    int factor = node_absorption_cost (node, layout_i);
> ! 	    if (factor < 0)
> ! 	      {
> ! 		layout_costs.mark_impossible ();
> ! 		continue;
> ! 	      }
> ! 	    gcc_assert (layout_costs.is_possible ());
> ! 	    layout_costs.in_cost = { vertex.weight * factor, m_optimize_size };
> ! 	  }
>   
> !       auto process_edge = [&](graph_edge *ud, unsigned int node2_i,
> ! 			      unsigned int partition2_i)
> ! 	{
> ! 	  /* Count the number of edges from earlier partitions and the number
> ! 	     of edges to later partitions.  */
> ! 	  if (partition2_i < partition_i)
> ! 	    partition.in_degree += 1;
> ! 	  else
> ! 	    partition.out_degree += 1;
>   
> ! 	  /* If the current node uses the result of NODE2_I, accumulate
> ! 	     the effects of that.  */
> ! 	  if (ud->src == int (node_i))
> ! 	    {
> ! 	      m_vertices[node2_i].out_weight += vertex.weight;
> ! 	      m_vertices[node2_i].out_degree += 1;
>   	    }
> + 	};
> +       for_each_partition_edge (node_i, process_edge);
> +     }
>   
> !   /* Layout 0 represents the original order and layout LAYOUT_I represents
> !      the effect of reversing m_perms[LAYOUT_I].  If a predecessor partition
> !      does not support layout LAYOUT_I and if it also isn't possible to go
> !      from layout 0 to layout LAYOUT_I, then it is very difficult for a
> !      successor partition to support layout LAYOUT_I.  (The only alternative
> !      would go to layout LAYOUT_I from a different nonzero layout, but
> !      that's a niche case and not really worth supporting.)  */
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
> !     {
> !       if (layout_is_reversible (layout_i))
> ! 	continue;
> ! 
> !       bool changed;
> !       do
> ! 	{
> ! 	  changed = false;
> ! 	  for (unsigned int node_i : m_partitioned_nodes)
>   	    {
> ! 	      unsigned int partition_i = m_vertices[node_i].partition;
> ! 	      if (partition_layout_costs (partition_i,
> ! 					  layout_i).is_possible ())
> ! 		continue;
> ! 	      for (graph_edge *pred = m_slpg->vertices[node_i].pred;
> ! 		   pred; pred = pred->pred_next)
>   		{
> ! 		  unsigned int partition2_i = m_vertices[pred->src].partition;
> ! 		  auto &costs = partition_layout_costs (partition2_i,
> ! 							layout_i);
> ! 		  if (costs.is_possible ())
>   		    {
> ! 		      costs.mark_impossible ();
> ! 		      changed = true;
>   		    }
>   		}
> + 	    }
> + 	}
> +       while (changed);
> +     }
> + }
> + 
> + /* Return the cost of switching between layout LAYOUT1_I (at node NODE1_I)
> +    and layout LAYOUT2_I on cross-partition use-to-def edge UD.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::edge_layout_cost (graph_edge *ud,
> + 					  unsigned int node1_i,
> + 					  unsigned int layout1_i,
> + 					  unsigned int layout2_i)
> + {
> +   auto def_layout_i = ud->dest == int (node1_i) ? layout1_i : layout2_i;
> +   auto use_layout_i = ud->dest == int (node1_i) ? layout2_i : layout1_i;
> +   auto factor = layout_cost (def_layout_i, use_layout_i);
> + 
> +   /* We have a choice of putting the layout change at the site of the
> +      definition or at the site of the use.  Prefer the former when
> +      optimizing for size or when the execution frequency of the
> +      definition is no greater than the combined execution frequencies of
> +      the uses.  When putting the layout change at the site of the definition,
> +      divvy up the cost among all consumers.  */
> +   slpg_vertex &def_vertex = m_vertices[ud->dest];
> +   slpg_vertex &use_vertex = m_vertices[ud->src];
> +   if (m_optimize_size || def_vertex.weight <= def_vertex.out_weight)
> +     {
> +       slpg_layout_cost cost = { def_vertex.weight * factor, m_optimize_size };
> +       cost.split (def_vertex.out_degree);
> +       return cost;
> +     }
> +   return { use_vertex.weight * factor, m_optimize_size };
> + }
> + 
> + /* UD represents a use-def link between FROM_NODE_I and a node in a later
> +    partition; FROM_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout TO_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, FROM_NODE_I's partition has chosen the cheapest
> +    layout based on the information available so far, but this choice
> +    is only provisional.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::forward_cost (graph_edge *ud,
> + 				      unsigned int from_node_i,
> + 				      unsigned int to_layout_i)
> + {
> +   slpg_vertex &from_vertex = m_vertices[from_node_i];
> +   unsigned int from_partition_i = from_vertex.partition;
> +   slpg_partition_info &from_partition = m_partitions[from_partition_i];
> +   gcc_assert (from_partition.layout >= 0);
> + 
> +   /* First calculate the cost on the assumption that FROM_PARTITION sticks
> +      with its current layout preference.  */
> +   slpg_layout_cost cost
> +     = partition_layout_costs (from_partition_i,
> + 			      from_partition.layout).in_cost;
> +   cost.split (from_partition.out_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, from_node_i,
> + 					  from_partition.layout, to_layout_i));
> + 
> +   /* Take the minimum of that cost and the cost that applies if
> +      FROM_PARTITION instead switches to TO_LAYOUT_I.  */
> +   auto &direct_layout_costs = partition_layout_costs (from_partition_i,
> + 						      to_layout_i);
> +   if (direct_layout_costs.is_possible ())
> +     {
> +       slpg_layout_cost direct_cost = direct_layout_costs.in_cost;
> +       direct_cost.split (from_partition.out_degree);
> +       if (direct_cost.is_better_than (cost, m_optimize_size))
> + 	cost = direct_cost;
> +     }
> + 
> +   /* If FROM_PARTITION is likely to be able to absorb the required outgoing
> +      permute, optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (from_partition_i))
> +     {
> +       int factor = node_absorption_cost (from_vertex.node, to_layout_i);
> +       if (factor >= 0)
> + 	{
> + 	  sreal weight = from_vertex.weight * factor;
> + 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
> + 	    cost = absorb_cost;
> + 	}
> +     }
> + 
> +   return cost;
> + }
> + 
> + /* UD represents a use-def link between TO_NODE_I and a node in an earlier
> +    partition; TO_NODE_I could be the definition node or the use node.
> +    The node at the other end of the link wants to use layout FROM_LAYOUT_I;
> +    return the cost of any necessary fix-ups on edge UD.
> + 
> +    At this point, TO_NODE_I's partition has a fixed choice of layout.  */
> + 
> + slpg_layout_cost
> + vect_optimize_slp_pass::backward_cost (graph_edge *ud,
> + 				       unsigned int to_node_i,
> + 				       unsigned int from_layout_i)
> + {
> +   slpg_vertex &to_vertex = m_vertices[to_node_i];
> +   unsigned int to_partition_i = to_vertex.partition;
> +   slpg_partition_info &to_partition = m_partitions[to_partition_i];
> +   gcc_assert (to_partition.layout >= 0);
> + 
> +   slpg_layout_cost cost
> +     = partition_layout_costs (to_partition_i,
> + 			      to_partition.layout).out_cost;
> +   cost.split (to_partition.in_degree);
> +   cost.add_serial_cost (edge_layout_cost (ud, to_node_i, to_partition.layout,
> + 					  from_layout_i));
> + 
> +   /* If TO_PARTITION is likely to be able to absorb the required input permute,
> +      optimistically assume that that is what will happen.  */
> +   if (partition_can_absorb_permutes (to_partition_i))
> +     {
> +       int factor = node_absorption_cost (to_vertex.node, from_layout_i);
> +       if (factor >= 0)
> + 	{
> + 	  sreal weight = to_vertex.weight * factor;
> + 	  slpg_layout_cost absorb_cost = { weight, m_optimize_size };
> + 	  if (absorb_cost.is_better_than (cost, m_optimize_size))
> + 	    cost = absorb_cost;
> + 	}
> +     }
> + 
> +   return cost;
> + }
>   
> ! /* Make a forward pass through the partitions, accumulating input costs.
> !    Make a tentative (provisional) choice of layout for each partition.  */
> ! 
> ! void
> ! vect_optimize_slp_pass::forward_pass ()
> ! {
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (!layout_costs.is_possible ())
> ! 	    continue;
> ! 
> ! 	  /* Accumulate the costs from predecessor partitions.  */
> ! 	  for (unsigned int order_i = partition.node_begin;
> ! 	       order_i < partition.node_end; ++order_i)
> ! 	    {
> ! 	      unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	      auto add_cost = [&](graph_edge *ud, unsigned int from_node_i,
> ! 				  unsigned int from_partition_i)
>   		{
> ! 		  if (from_partition_i < partition_i)
>   		    {
> ! 		      auto cost = forward_cost (ud, from_node_i, layout_i);
> ! 		      layout_costs.in_cost.add_parallel_cost (cost);
>   		    }
> ! 		};
> ! 	      for_each_partition_edge (node_i, add_cost);
> ! 	    }
>   
> ! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
> ! 	     the event of a tie between it and another layout.  */
> ! 	  if (!min_layout_cost
> ! 	      || layout_costs.in_cost.is_better_than (*min_layout_cost,
> ! 						      m_optimize_size))
> ! 	    {
> ! 	      min_layout_i = layout_i;
> ! 	      min_layout_cost = &layout_costs.in_cost;
>   	    }
> + 	}
> +       gcc_assert (min_layout_cost);
> +       partition.layout = min_layout_i;
> +     }
> + }
>   
> ! /* Make a backward pass through the partitions, accumulating output costs.
> !    Make a final choice of layout for each partition.  */
>   
> ! void
> ! vect_optimize_slp_pass::backward_pass ()
> ! {
> !   for (unsigned int partition_i = m_partitions.length (); partition_i-- > 0;)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       unsigned int min_layout_i = 0;
> !       slpg_layout_cost *min_layout_cost = nullptr;
> !       unsigned int seen_min_layout_i = 0;
> !       slpg_layout_cost *seen_min_layout_cost = nullptr;
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (!layout_costs.is_possible ())
>   	    continue;
>   
> ! 	  /* Accumulate the costs from successor partitions.  Record whether
> ! 	     any of them actually uses this layout. */
> ! 	  bool seen = false;
> ! 	  for (unsigned int order_i = partition.node_begin;
> ! 	       order_i < partition.node_end; ++order_i)
> ! 	    {
> ! 	      unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	      auto add_cost = [&](graph_edge *ud, unsigned int to_node_i,
> ! 				  unsigned int to_partition_i)
> ! 		{
> ! 		  if (to_partition_i > partition_i)
> ! 		    {
> ! 		      auto cost = backward_cost (ud, to_node_i, layout_i);
> ! 		      layout_costs.out_cost.add_parallel_cost (cost);
> ! 		      seen |= (m_partitions[to_partition_i].layout
> ! 			       == int (layout_i));
> ! 		    }
> ! 		};
> ! 	      for_each_partition_edge (node_i, add_cost);
>   	    }
>   
> ! 	  /* Locally combine the costs from the forward and backward passes.
> ! 	     (This combined cost is not passed on, since that would lead
> ! 	     to double counting.)  */
> ! 	  layout_costs.cost = layout_costs.in_cost;
> ! 	  layout_costs.cost.add_serial_cost (layout_costs.out_cost);
> ! 
> ! 	  /* Record the layout with the lowest cost.  Prefer layout 0 in
> ! 	     the event of a tie between it and another layout.  */
> ! 	  if (!min_layout_cost
> ! 	      || layout_costs.cost.is_better_than (*min_layout_cost,
> ! 						   m_optimize_size))
> ! 	    {
> ! 	      min_layout_i = layout_i;
> ! 	      min_layout_cost = &layout_costs.cost;
> ! 	    }
> ! 	  /* Likewise, but restrict the choice of layouts to those that
> ! 	     are used by successor partitions.  */
> ! 	  if (seen
> ! 	      && (!seen_min_layout_cost
> ! 		  || layout_costs.cost.is_better_than (*seen_min_layout_cost,
> ! 						       m_optimize_size)))
> ! 	    {
> ! 	      seen_min_layout_i = min_layout_i;
> ! 	      seen_min_layout_cost = &layout_costs.cost;
> ! 	    }
> ! 	}
> !       gcc_assert (min_layout_cost);
> ! 
> !       /* If the partition can absorb permutes, prefer a layout that has
> ! 	 actually been used.  */
> !       if (seen_min_layout_cost
> ! 	  && partition_can_absorb_permutes (partition_i))
> ! 	partition.layout = seen_min_layout_i;
> !       else
> ! 	partition.layout = min_layout_i;
> !     }
> ! }
> ! 
> ! /* Return a node that applies layout TO_LAYOUT_I to the original form of NODE.
> !    NODE already has the layout that was selected for its partition.  */
> ! 
> ! slp_tree
> ! vect_optimize_slp_pass::get_result_with_layout (slp_tree node,
> ! 						unsigned int to_layout_i)
> ! {
> !   unsigned int result_i = node->vertex * m_perms.length () + to_layout_i;
> !   slp_tree result = m_node_layouts[result_i];
> !   if (result)
> !     return result;
> ! 
> !   if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
> !       || SLP_TREE_DEF_TYPE (node) == vect_external_def)
> !     {
> !       /* If the vector is uniform or unchanged, there's nothing to do.  */
> !       if (to_layout_i == 0 || vect_slp_tree_uniform_p (node))
> ! 	result = node;
> !       else
>   	{
> ! 	  auto scalar_ops = SLP_TREE_SCALAR_OPS (node).copy ();
> ! 	  result = vect_create_new_slp_node (scalar_ops);
> ! 	  vect_slp_permute (m_perms[to_layout_i], scalar_ops, true);
>   	}
>       }
> !   else
> !     {
> !       unsigned int partition_i = m_vertices[node->vertex].partition;
> !       unsigned int from_layout_i = m_partitions[partition_i].layout;
> !       if (from_layout_i == to_layout_i)
> ! 	return node;
> ! 
> !       /* FIXME: if the result of a pre-existing VEC_PERM_EXPR is used
> ! 	 twice, with two different layouts, then we could create
> ! 	 parallel VEC_PERM_EXPRs rather than add a serial one.  */
> ! 
> !       /* FIXME: When calculating the costs, we considered the possiblity
> ! 	 of duplicating a VEC_PERM_EXPR at each use site, rather than
> ! 	 having one at the definition site.  Do that here if it turns
> ! 	 out to be cheaper.  */
> ! 
> !       if (dump_enabled_p ())
> ! 	dump_printf_loc (MSG_NOTE, vect_location,
> ! 			 "inserting permute node in place of %p\n",
> ! 			 node);
> ! 
> !       unsigned int num_lanes = SLP_TREE_LANES (node);
> !       result = vect_create_new_slp_node (1, VEC_PERM_EXPR);
> !       if (SLP_TREE_SCALAR_STMTS (node).length ())
> ! 	{
> ! 	  auto &stmts = SLP_TREE_SCALAR_STMTS (result);
> ! 	  stmts.safe_splice (SLP_TREE_SCALAR_STMTS (result));
> ! 	  if (from_layout_i != 0)
> ! 	    vect_slp_permute (m_perms[from_layout_i], stmts, false);
> ! 	  if (to_layout_i != 0)
> ! 	    vect_slp_permute (m_perms[to_layout_i], stmts, true);
> ! 	}
> !       SLP_TREE_REPRESENTATIVE (result) = SLP_TREE_REPRESENTATIVE (node);
> !       SLP_TREE_LANES (result) = num_lanes;
> !       SLP_TREE_VECTYPE (result) = SLP_TREE_VECTYPE (node);
> !       result->vertex = -1;
> ! 
> !       auto &lane_perm = SLP_TREE_LANE_PERMUTATION (result);
> !       lane_perm.create (num_lanes);
> !       for (unsigned j = 0; j < num_lanes; ++j)
> ! 	lane_perm.quick_push ({ 0, j });
> !       if (from_layout_i != 0)
> ! 	vect_slp_permute (m_perms[from_layout_i], lane_perm, false);
> !       if (to_layout_i != 0)
> ! 	vect_slp_permute (m_perms[to_layout_i], lane_perm, true);
> ! 
> !       node->refcnt++;
> !       SLP_TREE_CHILDREN (result).safe_push (node);
> !     }
> !   m_node_layouts[result_i] = result;
> !   return result;
> ! }
> ! 
> ! /* Print the partition graph and layout information to the dump file.  */
>   
> ! void
> ! vect_optimize_slp_pass::dump ()
> ! {
> !   dump_printf_loc (MSG_NOTE, vect_location,
> ! 		   "SLP optimize permutes:\n");
> !   for (unsigned int layout_i = 1; layout_i < m_perms.length (); ++layout_i)
>       {
> !       dump_printf_loc (MSG_NOTE, vect_location, "  %d: { ", layout_i);
> !       const char *sep = "";
> !       for (unsigned int idx : m_perms[layout_i])
> ! 	{
> ! 	  dump_printf (MSG_NOTE, "%s%d", sep, idx);
> ! 	  sep = ", ";
> ! 	}
> !       dump_printf (MSG_NOTE, " }\n");
> !     }
> !   dump_printf_loc (MSG_NOTE, vect_location,
> ! 		   "SLP optimize partitions:\n");
> !   for (unsigned int partition_i = 0; partition_i < m_partitions.length ();
> !        ++partition_i)
> !     {
> !       auto &partition = m_partitions[partition_i];
> !       dump_printf_loc (MSG_NOTE, vect_location,  "  -------------\n");
> !       dump_printf_loc (MSG_NOTE, vect_location,
> ! 		       "  partition %d (layout %d):\n",
> ! 		       partition_i, partition.layout);
> !       dump_printf_loc (MSG_NOTE, vect_location, "    nodes:\n");
> !       for (unsigned int order_i = partition.node_begin;
> ! 	   order_i < partition.node_end; ++order_i)
> ! 	{
> ! 	  unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	  slp_tree node = m_vertices[node_i].node;
> ! 	  dump_printf_loc (MSG_NOTE, vect_location, "      - %p:\n",
> ! 			   (void *) m_vertices[node_i].node);
> ! 	  dump_printf_loc (MSG_NOTE, vect_location,
> ! 			   "          weight: %f\n",
> ! 			   m_vertices[node_i].weight.to_double ());
> ! 	  if (m_vertices[node_i].out_degree)
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          out weight: %f (degree %d)\n",
> ! 			     m_vertices[node_i].out_weight.to_double (),
> ! 			     m_vertices[node_i].out_degree);
> ! 	  if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          op: VEC_PERM_EXPR\n");
> ! 	  else if (auto rep = SLP_TREE_REPRESENTATIVE (node))
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "          op template: %G", rep->stmt);
> ! 	}
> !       dump_printf_loc (MSG_NOTE, vect_location, "    edges:\n");
> !       for (unsigned int order_i = partition.node_begin;
> ! 	   order_i < partition.node_end; ++order_i)
> ! 	{
> ! 	  unsigned int node_i = m_partitioned_nodes[order_i];
> ! 	  auto print_edge = [&](graph_edge *, unsigned int node2_i,
> ! 				unsigned int partition2_i)
> ! 	    {
> ! 	      if (partition2_i < partition_i)
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "      - %p [%d] --> %p\n",
> ! 				 m_vertices[node2_i].node, partition2_i,
> ! 				 m_vertices[node_i].node);
> ! 	      else
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "      - %p --> [%d] %p\n",
> ! 				 m_vertices[node_i].node, partition2_i,
> ! 				 m_vertices[node2_i].node);
> ! 	    };
> ! 	  for_each_partition_edge (node_i, print_edge);
> ! 	}
>   
> !       for (unsigned int layout_i = 0; layout_i < m_perms.length (); ++layout_i)
> ! 	{
> ! 	  auto &layout_costs = partition_layout_costs (partition_i, layout_i);
> ! 	  if (layout_costs.is_possible ())
> ! 	    {
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "    layout %d:%s\n", layout_i,
> ! 			       partition.layout == int (layout_i)
> ! 			       ? " (*)" : "");
> ! #define TEMPLATE "{depth: %f, total: %f}"
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "        " TEMPLATE "\n", layout_i,
> ! 			       layout_costs.in_cost.depth.to_double (),
> ! 			       layout_costs.in_cost.total.to_double ());
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "      + " TEMPLATE "\n",
> ! 			       layout_costs.out_cost.depth.to_double (),
> ! 			       layout_costs.out_cost.total.to_double ());
> ! 	      dump_printf_loc (MSG_NOTE, vect_location,
> ! 			       "      = " TEMPLATE "\n",
> ! 			       layout_costs.cost.depth.to_double (),
> ! 			       layout_costs.cost.total.to_double ());
> ! #undef TEMPLATE
> ! 	    }
> ! 	  else
> ! 	    dump_printf_loc (MSG_NOTE, vect_location,
> ! 			     "    layout %d: rejected\n", layout_i);
> ! 	}
> !     }
> ! }
>   
> ! /* Apply the chosen vector layouts to the SLP graph.  */
>   
> ! void
> ! vect_optimize_slp_pass::materialize ()
> ! {
> !   m_partition_layout_costs.release ();
> !   m_node_layouts.safe_grow_cleared (m_vertices.length () * m_perms.length ());
>   
> +   for (unsigned int node_i : m_partitioned_nodes)
> +     {
> +       /* Rearrange the scalar statements to match the chosen layout.  */
> +       slp_tree node = m_vertices[node_i].node;
> +       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> +       gcc_assert (layout_i >= 0);
> +       if (layout_i > 0)
> + 	vect_slp_permute (m_perms[layout_i],
> + 			  SLP_TREE_SCALAR_STMTS (node), true);
> + 
> +       /* Update load and lane permutes.  */
>         if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
>   	{
>   	  /* Apply the common permutes to the input vectors.  */
> ! 	  /* FIXME: check whether the target accepts this folding and
> ! 	     handle it like a normal node if not.  That would include
> ! 	     adjusting the follow-up loop below.  */
> ! 	  unsigned int opno;
> ! 	  slp_tree child;
> ! 	  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), opno, child)
> ! 	    {
> ! 	      unsigned int partition2_i = m_vertices[child->vertex].partition;
> ! 	      int layout2_i = m_partitions[partition2_i].layout;
> ! 	      if (layout2_i == 0)
> ! 		continue;
> ! 
>   	      /* If the node is already a permute node we can apply
>   		 the permutation to the lane selection, effectively
>   		 materializing it on the incoming vectors.  */
> ***************
> *** 3987,4141 ****
>   		dump_printf_loc (MSG_NOTE, vect_location,
>   				 "simplifying permute node %p\n",
>   				 node);
> ! 	      for (unsigned k = 0;
> ! 		   k < SLP_TREE_LANE_PERMUTATION (node).length (); ++k)
> ! 		SLP_TREE_LANE_PERMUTATION (node)[k].second
> ! 		  = perms[perm_in][SLP_TREE_LANE_PERMUTATION (node)[k].second];
> ! 	    }
> ! 	  /* Apply the anticipated output permute to the permute and
> ! 	     stmt vectors.  */
> ! 	  int perm_out = vertices[i].perm_out;
> ! 	  if (perm_out > 0)
> ! 	    {
> ! 	      vect_slp_permute (perms[perm_out],
> ! 				SLP_TREE_SCALAR_STMTS (node), true);
> ! 	      vect_slp_permute (perms[perm_out],
> ! 				SLP_TREE_LANE_PERMUTATION (node), true);
> ! 	    }
> ! 	}
> !       else if (vertices[i].get_perm_materialized () != 0)
> ! 	{
> ! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! 	    /* For loads simply drop the permutation, the load permutation
> ! 	       already performs the desired permutation.  */
> ! 	    ;
> ! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! 	    gcc_unreachable ();
> ! 	  else
> ! 	    {
> ! 	      if (dump_enabled_p ())
> ! 		dump_printf_loc (MSG_NOTE, vect_location,
> ! 				 "inserting permute node in place of %p\n",
> ! 				 node);
> ! 
> ! 	      /* Make a copy of NODE and in-place change it to a
> ! 		 VEC_PERM node to permute the lanes of the copy.  */
> ! 	      slp_tree copy = new _slp_tree;
> ! 	      SLP_TREE_CHILDREN (copy) = SLP_TREE_CHILDREN (node);
> ! 	      SLP_TREE_CHILDREN (node) = vNULL;
> ! 	      SLP_TREE_SCALAR_STMTS (copy)
> ! 		= SLP_TREE_SCALAR_STMTS (node).copy ();
> ! 	      vect_slp_permute (perms[perm_in],
> ! 				SLP_TREE_SCALAR_STMTS (copy), true);
> ! 	      gcc_assert (!SLP_TREE_SCALAR_OPS (node).exists ());
> ! 	      SLP_TREE_REPRESENTATIVE (copy) = SLP_TREE_REPRESENTATIVE (node);
> ! 	      gcc_assert (!SLP_TREE_LOAD_PERMUTATION (node).exists ());
> ! 	      SLP_TREE_LANE_PERMUTATION (copy)
> ! 		= SLP_TREE_LANE_PERMUTATION (node);
> ! 	      SLP_TREE_LANE_PERMUTATION (node) = vNULL;
> ! 	      SLP_TREE_VECTYPE (copy) = SLP_TREE_VECTYPE (node);
> ! 	      copy->refcnt = 1;
> ! 	      copy->max_nunits = node->max_nunits;
> ! 	      SLP_TREE_DEF_TYPE (copy) = SLP_TREE_DEF_TYPE (node);
> ! 	      SLP_TREE_LANES (copy) = SLP_TREE_LANES (node);
> ! 	      SLP_TREE_CODE (copy) = SLP_TREE_CODE (node);
> ! 
> ! 	      /* Now turn NODE into a VEC_PERM.  */
> ! 	      SLP_TREE_CHILDREN (node).safe_push (copy);
> ! 	      SLP_TREE_LANE_PERMUTATION (node).create (SLP_TREE_LANES (node));
> ! 	      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> ! 		SLP_TREE_LANE_PERMUTATION (node)
> ! 		  .quick_push (std::make_pair (0, perms[perm_in][j]));
> ! 	      SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> ! 	    }
> ! 	}
> !       else if (perm_in > 0) /* perm_in == perm_out */
> ! 	{
> ! 	  /* Apply the reverse permutation to our stmts.  */
> ! 	  vect_slp_permute (perms[perm_in],
> ! 			    SLP_TREE_SCALAR_STMTS (node), true);
> ! 	  /* And to the lane/load permutation, which we can simply
> ! 	     make regular by design.  */
> ! 	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> ! 	    {
> ! 	      gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! 	      /* ???  When we handle non-bijective permutes the idea
> ! 		 is that we can force the load-permutation to be
> ! 		 { min, min + 1, min + 2, ... max }.  But then the
> ! 		 scalar defs might no longer match the lane content
> ! 		 which means wrong-code with live lane vectorization.
> ! 		 So we possibly have to have NULL entries for those.  */
> ! 	      vect_slp_permute (perms[perm_in],
> ! 				SLP_TREE_LOAD_PERMUTATION (node), true);
> ! 	    }
> ! 	  else if (SLP_TREE_LANE_PERMUTATION (node).exists ())
> ! 	    gcc_unreachable ();
>   	}
>       }
>   
> !   /* Elide any permutations at BB reduction roots.  */
> !   if (is_a <bb_vec_info> (vinfo))
>       {
> !       for (slp_instance instance : vinfo->slp_instances)
>   	{
> ! 	  if (SLP_INSTANCE_KIND (instance) != slp_inst_kind_bb_reduc)
>   	    continue;
> ! 	  slp_tree old = SLP_INSTANCE_TREE (instance);
> ! 	  if (SLP_TREE_CODE (old) == VEC_PERM_EXPR
> ! 	      && SLP_TREE_CHILDREN (old).length () == 1)
>   	    {
> ! 	      slp_tree child = SLP_TREE_CHILDREN (old)[0];
> ! 	      if (SLP_TREE_DEF_TYPE (child) == vect_external_def)
> ! 		{
> ! 		  /* Preserve the special VEC_PERM we use to shield existing
> ! 		     vector defs from the rest.  But make it a no-op.  */
> ! 		  unsigned i = 0;
> ! 		  for (std::pair<unsigned, unsigned> &p
> ! 		       : SLP_TREE_LANE_PERMUTATION (old))
> ! 		    p.second = i++;
> ! 		}
> ! 	      else
> ! 		{
> ! 		  SLP_INSTANCE_TREE (instance) = child;
> ! 		  SLP_TREE_REF_COUNT (child)++;
> ! 		  vect_free_slp_tree (old);
> ! 		}
> ! 	    }
> ! 	  else if (SLP_TREE_LOAD_PERMUTATION (old).exists ()
> ! 		   && SLP_TREE_REF_COUNT (old) == 1
> ! 		   && vertices[old->vertex].get_perm_materialized () != 0)
> ! 	    {
> ! 	      /* ???  For loads the situation is more complex since
> ! 		 we can't modify the permute in place in case the
> ! 		 node is used multiple times.  In fact for loads this
> ! 		 should be somehow handled in the propagation engine.  */
> ! 	      /* Apply the reverse permutation to our stmts.  */
> ! 	      int perm = vertices[old->vertex].get_perm_materialized ();
> ! 	      vect_slp_permute (perms[perm],
> ! 				SLP_TREE_SCALAR_STMTS (old), true);
> ! 	      vect_slp_permute (perms[perm],
> ! 				SLP_TREE_LOAD_PERMUTATION (old), true);
>   	    }
>   	}
>       }
>   
> !   /* Free the perms vector used for propagation.  */
> !   while (!perms.is_empty ())
> !     perms.pop ().release ();
> !   free_graph (slpg);
> ! 
>   
> !   /* Now elide load permutations that are not necessary.  */
> !   for (i = 0; i < leafs.length (); ++i)
>       {
> !       node = vertices[leafs[i]].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* In basic block vectorization we allow any subchain of an interleaving
>   	 chain.
>   	 FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (vinfo))
>   	{
>   	  bool subchain_p = true;
>   	  stmt_vec_info next_load_info = NULL;
> --- 5094,5173 ----
>   		dump_printf_loc (MSG_NOTE, vect_location,
>   				 "simplifying permute node %p\n",
>   				 node);
> ! 	      for (auto &entry : SLP_TREE_LANE_PERMUTATION (node))
> ! 		if (entry.first == opno)
> ! 		  entry.second = m_perms[layout2_i][entry.second];
> ! 	    }
> ! 	  /* Reverse the permutation that we chose to reverse.  */
> ! 	  if (layout_i > 0)
> ! 	    vect_slp_permute (m_perms[layout_i],
> ! 			      SLP_TREE_LANE_PERMUTATION (node), true);
> ! 	}
> !       else
> ! 	{
> ! 	  gcc_assert (!SLP_TREE_LANE_PERMUTATION (node).exists ());
> ! 	  auto &load_perm = SLP_TREE_LOAD_PERMUTATION (node);
> ! 	  if (layout_i > 0)
> ! 	    /* ???  When we handle non-bijective permutes the idea
> ! 	       is that we can force the load-permutation to be
> ! 	       { min, min + 1, min + 2, ... max }.  But then the
> ! 	       scalar defs might no longer match the lane content
> ! 	       which means wrong-code with live lane vectorization.
> ! 	       So we possibly have to have NULL entries for those.  */
> ! 	    vect_slp_permute (m_perms[layout_i], load_perm, true);
>   	}
>       }
>   
> !   /* Do this before any nodes disappear, since it involves a walk
> !      over the leaves.  */
> !   remove_redundant_permutes ();
> ! 
> !   /* Replace each child with a correctly laid-out version.  */
> !   for (unsigned int node_i : m_partitioned_nodes)
>       {
> !       slp_tree node = m_vertices[node_i].node;
> ! 
> !       /* The loop above adjusted permutes for each child's chosen layout,
> ! 	 so we don't need to adjust them here.  */
> !       if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> ! 	continue;
> ! 
> !       int layout_i = m_partitions[m_vertices[node_i].partition].layout;
> !       gcc_assert (layout_i >= 0);
> !       unsigned j;
> !       slp_tree child;
> !       FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), j, child)
>   	{
> ! 	  if (!child)
>   	    continue;
> ! 
> ! 	  slp_tree new_child = get_result_with_layout (child, layout_i);
> ! 	  if (new_child != child)
>   	    {
> ! 	      vect_free_slp_tree (child);
> ! 	      SLP_TREE_CHILDREN (node)[j] = new_child;
> ! 	      new_child->refcnt += 1;
>   	    }
>   	}
>       }
> + }
>   
> ! /* Elide load permutations that are not necessary.  Such permutations might
> !    be pre-existing, rather than created by the layout optimizations.  */
>   
> ! void
> ! vect_optimize_slp_pass::remove_redundant_permutes ()
> ! {
> !   for (unsigned int node_i : m_leafs)
>       {
> !       slp_tree node = m_vertices[node_i].node;
>         if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
>   	continue;
>   
>         /* In basic block vectorization we allow any subchain of an interleaving
>   	 chain.
>   	 FORNOW: not in loop SLP because of realignment complications.  */
> !       if (is_a <bb_vec_info> (m_vinfo))
>   	{
>   	  bool subchain_p = true;
>   	  stmt_vec_info next_load_info = NULL;
> ***************
> *** 4160,4165 ****
> --- 5192,5198 ----
>   	}
>         else
>   	{
> + 	  loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
>   	  stmt_vec_info load_info;
>   	  bool this_load_permuted = false;
>   	  unsigned j;
> ***************
> *** 4175,4182 ****
>   	      /* The load requires permutation when unrolling exposes
>   		 a gap either because the group is larger than the SLP
>   		 group-size or because there is a gap between the groups.  */
> ! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR
> ! 			      (as_a <loop_vec_info> (vinfo)), 1U)
>   		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>   		      && DR_GROUP_GAP (first_stmt_info) == 0)))
>   	    {
> --- 5208,5214 ----
>   	      /* The load requires permutation when unrolling exposes
>   		 a gap either because the group is larger than the SLP
>   		 group-size or because there is a gap between the groups.  */
> ! 	      && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
>   		  || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
>   		      && DR_GROUP_GAP (first_stmt_info) == 0)))
>   	    {
> ***************
> *** 4187,4192 ****
> --- 5219,5258 ----
>       }
>   }
>   
> + /* Main entry point for the optimization pass.  */
> + 
> + void
> + vect_optimize_slp_pass::run ()
> + {
> +   build_graph ();
> +   create_partitions ();
> +   start_choosing_layouts ();
> +   if (m_perms.length () > 1)
> +     {
> +       forward_pass ();
> +       backward_pass ();
> +       if (dump_enabled_p ())
> + 	dump ();
> +       materialize ();
> +     }
> +   else
> +     remove_redundant_permutes ();
> +   /* Free the perms vector used for propagation.  */
> +   while (!m_perms.is_empty ())
> +     m_perms.pop ().release ();
> +   free_graph (m_slpg);
> + }
> + 
> + /* Optimize the SLP graph of VINFO.  */
> + 
> + void
> + vect_optimize_slp (vec_info *vinfo)
> + {
> +   if (vinfo->slp_instances.is_empty ())
> +     return;
> +   vect_optimize_slp_pass (vinfo).run ();
> + }
> + 
>   /* Gather loads reachable from the individual SLP graph entries.  */
>   
>   void
> ***************
> *** 6647,6653 ****
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> --- 7713,7719 ----
>   {
>     stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>     int vec_index = 0;
> !   tree vectype = SLP_TREE_VECTYPE (node);
>     unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>     unsigned int mask_element;
>     machine_mode mode;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] RFC: Extend SLP permutation optimisations
  2022-08-03 11:23 ` Richard Biener
@ 2022-08-04  9:06   ` Richard Sandiford
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Sandiford @ 2022-08-04  9:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Tue, 2 Aug 2022, Richard Sandiford wrote:
>
>> Currently SLP tries to force permute operations "down" the graph
>> from loads in the hope of reducing the total number of permutes
>> needed or (in the best case) removing the need for the permutes
>> entirely.  This patch tries to extend it as follows:
>> 
>> - Allow loads to take a different permutation from the one they
>>   started with, rather than choosing between "original permutation"
>>   and "no permutation".
>> 
>> - Allow changes in both directions, if the target supports the
>>   reverse permute operation.
>> 
>> - Treat the placement of permute operations as a two-way dataflow
>>   problem: after propagating information from leaves to roots (as now),
>>   propagate information back up the graph.
>> 
>> - Take execution frequency into account when optimising for speed,
>>   so that (for example) permutes inside loops have a higher cost
>>   than permutes outside loops.
>> 
>> - Try to reduce the total number of permutes when optimising for
>>   size, even if that increases the number of permutes on a given
>>   execution path.
>> 
>> See the big block comment above vect_optimize_slp_pass for
>> a detailed description.
>> 
>> A while back I posted a patch to extend the existing optimisation
>> to non-consecutive loads.  This patch doesn't include that change
>> (although it's a possible future addition).
>> 
>> The original motivation for doing this was to add a framework that would
>> allow other layout differences in future.  The two main ones are:
>> 
>> - Make it easier to represent predicated operations, including
>>   predicated operations with gaps.  E.g.:
>> 
>>      a[0] += 1;
>>      a[1] += 1;
>>      a[3] += 1;
>> 
>>   could be a single load/add/store for SVE.  We could handle this
>>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>>   (depending on what's being counted).  We might need to move
>>   elements between lanes at various points, like with permutes.
>> 
>>   (This would first mean adding support for stores with gaps.)
>> 
>> - Make it easier to switch between an even/odd and unpermuted layout
>>   when switching between wide and narrow elements.  E.g. if a widening
>>   operation produces an even vector and an odd vector, we should try
>>   to keep operations on the wide elements in that order rather than
>>   force them to be permuted back "in order".
>> 
>> To give some examples of what the current patch does:
>> 
>> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   a[0] = (b[1] << c[3]) - d[1];
>>   a[1] = (b[0] << c[2]) - d[0];
>>   a[2] = (b[3] << c[1]) - d[3];
>>   a[3] = (b[2] << c[0]) - d[2];
>> }
>> 
>> continues to produce the same code as before when optimising for
>> speed: b, c and d are permuted at load time.  But when optimising
>> for size we instead permute c into the same order as b+d and then
>> permute the result of the arithmetic into the same order as a:
>> 
>>         ldr     q1, [x2]
>>         ldr     q0, [x1]
>>         ext     v1.16b, v1.16b, v1.16b, #8     // <------
>>         sshl    v0.4s, v0.4s, v1.4s
>>         ldr     q1, [x3]
>>         sub     v0.4s, v0.4s, v1.4s
>>         rev64   v0.4s, v0.4s                   // <------
>>         str     q0, [x0]
>>         ret
>> 
>> The following function:
>> 
>> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   a[0] = (b[3] << c[3]) - d[3];
>>   a[1] = (b[2] << c[2]) - d[2];
>>   a[2] = (b[1] << c[1]) - d[1];
>>   a[3] = (b[0] << c[0]) - d[0];
>> }
>> 
>> continues to push the reverse down to just before the store,
>> like the current code does.
>> 
>> In:
>> 
>> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   for (int i = 0; i < 100; ++i)
>>     {
>>       a[0] = (a[0] + c[3]);
>>       a[1] = (a[1] + c[2]);
>>       a[2] = (a[2] + c[1]);
>>       a[3] = (a[3] + c[0]);
>>       c += 4;
>>     }
>> }
>> 
>> the loads of a are hoisted and the stores of a are sunk, so that
>> only the load from c happens in the loop.  When optimising for
>> speed, we prefer to have the loop operate on the reversed layout,
>> changing on entry and exit from the loop:
>> 
>>         mov     x3, x0
>>         adrp    x0, .LC0
>>         add     x1, x2, 1600
>>         ldr     q2, [x0, #:lo12:.LC0]
>>         ldr     q0, [x3]
>>         mov     v1.16b, v0.16b
>>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>>         .p2align 3,,7
>> .L6:
>>         ldr     q1, [x2], 16
>>         add     v0.4s, v0.4s, v1.4s
>>         cmp     x2, x1
>>         bne     .L6
>>         mov     v1.16b, v0.16b
>>         adrp    x0, .LC0
>>         ldr     q2, [x0, #:lo12:.LC0]
>>         tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
>>         str     q0, [x3]
>>         ret
>> 
>> Similarly, for the very artificial testcase:
>> 
>> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>>        int *__restrict d)
>> {
>>   int a0 = a[0];
>>   int a1 = a[1];
>>   int a2 = a[2];
>>   int a3 = a[3];
>>   for (int i = 0; i < 100; ++i)
>>     {
>>       a0 ^= c[0];
>>       a1 ^= c[1];
>>       a2 ^= c[2];
>>       a3 ^= c[3];
>>       c += 4;
>>       for (int j = 0; j < 100; ++j)
>> 	{
>> 	  a0 += d[1];
>> 	  a1 += d[0];
>> 	  a2 += d[3];
>> 	  a3 += d[2];
>> 	  d += 4;
>> 	}
>>       b[0] = a0;
>>       b[1] = a1;
>>       b[2] = a2;
>>       b[3] = a3;
>>       b += 4;
>>     }
>>   a[0] = a0;
>>   a[1] = a1;
>>   a[2] = a2;
>>   a[3] = a3;
>> }
>> 
>> the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
>> even though it's part of an SCC that includes the outer loop.
>> In other words, this is a motivating case for not assigning
>> permutes at SCC granularity.  The code we get is:
>> 
>>         ldr     q0, [x0]
>>         mov     x4, x1
>>         mov     x5, x0
>>         add     x1, x3, 1600
>>         add     x3, x4, 1600
>>         .p2align 3,,7
>> .L11:
>>         ldr     q1, [x2], 16
>>         sub     x0, x1, #1600
>>         eor     v0.16b, v1.16b, v0.16b
>>         rev64   v0.4s, v0.4s              // <---
>>         .p2align 3,,7
>> .L10:
>>         ldr     q1, [x0], 16
>>         add     v0.4s, v0.4s, v1.4s
>>         cmp     x0, x1
>>         bne     .L10
>>         rev64   v0.4s, v0.4s              // <---
>>         add     x1, x0, 1600
>>         str     q0, [x4], 16
>>         cmp     x3, x4
>>         bne     .L11
>>         str     q0, [x5]
>>         ret
>> 
>> The patch is still incomplete, hence the FIXMEs, but it's at the stage
>> where it passes bootstrap & regtest on aarch64-linux-gnu.  (There's one
>> regression in slp-11b.c, but I think it's arguably restoring the
>> expected load-lanes behaviour.)  Does this look like a plausible
>> direction, before I go filling in the FIXME parts?
>
> Yes, it looks nice!

Thanks

> I wonder if you spent much thoughts on
> layout optimizing for SLP patterns and/or how to integrate both?

Hadn't thought of that.  But yeah, it looks like there's some overlap.
Perhaps the main extensions that would help SLP pattern matching are:

(1) Look for multiple loads from the same group.  Represent them as
    a single load node followed by individual VEC_PERM_EXPR nodes.
    Try to optimise the placement of these new VEC_PERM_EXPRs.

(2) Support non-bijective permutes.

(3) Look for CSE opportunities that can be exposed by moving the
    permutes (not sure about this one -- maybe we wouldn't discover
    much that wouldn't also be discovered on scalar code).

If (1) moved the new VEC_PERM_EXPRs as far from the loads as possible
then complex mul could become a peephole pattern match.  But moving
VEC_PERM_EXPRs as far away as possible isn't always the best thing
to do otherwise, so I think in practice we'd still need to track
multiple possibilities.

So yeah, it looks like integrating them would help, but I'm not
sure how mcuh it would reduce the overall complexity.

Richard

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-04  9:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-02  7:20 [PATCH] RFC: Extend SLP permutation optimisations Richard Sandiford
2022-08-03 11:23 ` Richard Biener
2022-08-04  9:06   ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).