Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware  SLP
       [not found] <OFA024DE88.71972625-ONC22574B0.0042F525-C22574B0.0043F9D7@LocalDomain>
@ 2008-08-29  8:43 ` Dorit Nuzman
  0 siblings, 0 replies; 8+ messages in thread
From: Dorit Nuzman @ 2008-08-29  8:43 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc-patches

Ira Rosen/Haifa/IBM wrote on 25/08/2008 15:22:28:

> Dorit Nuzman/Haifa/IBM wrote on 23/08/2008 09:52:47:
>
> > Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:
> >
> > > Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
> > >
> > > > I have a problem with the fact that this specific permutation is so

> > > > hard-coded into the analysis. It's ok to support only one
> > > > permutation as a start, but the analysis itself should be general.
> > > > Hopefully this could be rewritten to identify more general patterns

> > > > during the analysis, represent the identified permutation somehow
> > > > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> > > it or not.
> > >
> > > I changed the analysis part, so now during the SLP tree construction
> > > we only store the permutation, and check if the permutation is
> > > supported afterwards. I am attaching the updated (not fully tested)
> > > analysis part of the patch.
> > >
> >
> > great, thanks! (when you ci this patch, maybe add a couple testcases
> > for permutations that are not yet supported).
>
> Such testcases already exist in the original patch.
>
> >
> > (small question/request: can you please document what's the
> > difference between vect_supported_slp_permutation_p  and
> > vect_supported_load_permutation_p?)
>
> Sure.>
>
> >
> > ...
> > > > I also have a problem with the transformation: it assumes a very
> > > > specific form of permute at the gimple level - a permute that takes

> > > > two vectors as input and a byte mask. I don't think this is a
> > > > general enough representation (I don't think that the SSE shuffles
> > > > take a byte mask for example?).
> > > > We need to think of a more general
> > > > way to represent a permute at this level, and maybe have a target
> > > > specific builtin expand it using byte mask when appropriate.
> > >
> > > AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> > > But the mask is not similar to altivec/spu mask.
> >
> > Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have
> > 8-bit control fields per element (rather than per byte), and some of
> > these insns shufle/permute elements only from a single input vector
> > (rather than two) ...
> >
> > > Maybe I can create
> > > an element mask at the tree level and leave the correct mask
> > > creation to the target (builtin)?
> > >>
> >
> > yes, I think the mask creation should be done on a target specific
> > basis; the vectorizer could create a control mask given as a vector
> > of indices per element.
>
> Actually, the original patch already creates element mask and then
> calls vect_get_mask_element() to convert the mask according to its
> type (received from builtin_vec_perm).
> I must change the variable names (like mask_bytes, first_byte,
> etc.), but otherwise the mask creation is already target specific...
>

great.
ok with these changes,

thanks,
dorit

> >
> > I guess we can start by introducing a 2-operand permute (this is
> > what the vectorizer would currently know how to use), but it may be
> > useful to consider a single operand permute (+ control mask) later on.
> >
> > > >
> > > > Actually, I think the particular testcase you are targeting could
be
> > > > vectorized by preparing an appropriate vector of constants instead
> > > > of working so hard on permuting the loads. Maybe we can try
> > > > something like that for now (and potentially defer the decision on
a
> > > > representation of permute to a separate patch (and testcase)?)>
> > >
> > > I don't think this will work. If we only permute the constants, we
> > > can't get the multiples in the correct order and we will have to
> > > permute them anyway:
> > > yi = M00 * ri + M01 * gi + M02 * bi
> > > ui = M11 * gi + M12 * bi + M10 * ri
> > > vi = M22 * bi + M20 * ri + M21 * gi
> > > (we have gbr and brg in the second an third columns instead of rgb).
> > >
> > > In case that the number of the grouped statements is smaller than
> > > the vector size (as in the rgb conversion), we need to unroll the
> > > loop, and then such permutation will be done across several vectors
> > > and will be as painful as the load permutation.
> > >
> >
> > ok. In a separate followup patch we could look into optimizing cases
> > in which the group size is equal to the vector size (like rgba).
>
> OK.
>
> Thanks,
> Ira

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware  SLP
       [not found] <OF3E4AB8DB.84B1F38D-ONC22574AE.00224129-C22574AE.0025CAB0@LocalDomain>
@ 2008-08-25 12:55 ` Ira Rosen
  0 siblings, 0 replies; 8+ messages in thread
From: Ira Rosen @ 2008-08-25 12:55 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches



Dorit Nuzman/Haifa/IBM wrote on 23/08/2008 09:52:47:

> Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:
>
> > Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
> >
> > > I have a problem with the fact that this specific permutation is so
> > > hard-coded into the analysis. It's ok to support only one
> > > permutation as a start, but the analysis itself should be general.
> > > Hopefully this could be rewritten to identify more general patterns
> > > during the analysis, represent the identified permutation somehow
> > > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> > it or not.
> >
> > I changed the analysis part, so now during the SLP tree construction
> > we only store the permutation, and check if the permutation is
> > supported afterwards. I am attaching the updated (not fully tested)
> > analysis part of the patch.>
> >
>
> great, thanks! (when you ci this patch, maybe add a couple testcases
> for permutations that are not yet supported).

Such testcases already exist in the original patch.

>
> (small question/request: can you please document what's the
> difference between vect_supported_slp_permutation_p  and
> vect_supported_load_permutation_p?)

Sure.

>
> ...
> > > I also have a problem with the transformation: it assumes a very
> > > specific form of permute at the gimple level - a permute that takes
> > > two vectors as input and a byte mask. I don't think this is a
> > > general enough representation (I don't think that the SSE shuffles
> > > take a byte mask for example?).
> > > We need to think of a more general
> > > way to represent a permute at this level, and maybe have a target
> > > specific builtin expand it using byte mask when appropriate.
> >
> > AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> > But the mask is not similar to altivec/spu mask.
>
> Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have
> 8-bit control fields per element (rather than per byte), and some of
> these insns shufle/permute elements only from a single input vector
> (rather than two) ...
>
> > Maybe I can create
> > an element mask at the tree level and leave the correct mask
> > creation to the target (builtin)?,
> >>
>
> yes, I think the mask creation should be done on a target specific
> basis; the vectorizer could create a control mask given as a vector
> of indices per element.

Actually, the original patch already creates element mask and then calls
vect_get_mask_element() to convert the mask according to its type (received
from builtin_vec_perm).
I must change the variable names (like mask_bytes, first_byte, etc.), but
otherwise the mask creation is already target specific...

>
> I guess we can start by introducing a 2-operand permute (this is
> what the vectorizer would currently know how to use), but it may be
> useful to consider a single operand permute (+ control mask) later on.
>
> > >
> > > Actually, I think the particular testcase you are targeting could be
> > > vectorized by preparing an appropriate vector of constants instead
> > > of working so hard on permuting the loads. Maybe we can try
> > > something like that for now (and potentially defer the decision on a
> > > representation of permute to a separate patch (and testcase)?)
> >
> > I don't think this will work. If we only permute the constants, we
> > can't get the multiples in the correct order and we will have to
> > permute them anyway:
> > yi = M00 * ri + M01 * gi + M02 * bi
> > ui = M11 * gi + M12 * bi + M10 * ri
> > vi = M22 * bi + M20 * ri + M21 * gi
> > (we have gbr and brg in the second an third columns instead of rgb).
> >
> > In case that the number of the grouped statements is smaller than
> > the vector size (as in the rgb conversion), we need to unroll the
> > loop, and then such permutation will be done across several vectors
> > and will be as painful as the load permutation.
> >
>
> ok. In a separate followup patch we could look into optimizing cases
> in which the group size is equal to the vector size (like rgba).

OK.>

Thanks,.
Ira


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware  SLP
       [not found] <OF078FB537.452762A7-ONC22574AC.003A132A-C22574AC.003E6955@LocalDomain>
@ 2008-08-23  9:21 ` Dorit Nuzman
  0 siblings, 0 replies; 8+ messages in thread
From: Dorit Nuzman @ 2008-08-23  9:21 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc-patches

Ira Rosen/Haifa/IBM wrote on 21/08/2008 14:21:41:

> Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:
>
> > I have a problem with the fact that this specific permutation is so
> > hard-coded into the analysis. It's ok to support only one
> > permutation as a start, but the analysis itself should be general.
> > Hopefully this could be rewritten to identify more general patterns
> > during the analysis, represent the identified permutation somehow
> > (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize
> it or not.
>
> I changed the analysis part, so now during the SLP tree construction
> we only store the permutation, and check if the permutation is
> supported afterwards. I am attaching the updated (not fully tested)
> analysis part of the patch.
>

great, thanks! (when you ci this patch, maybe add a couple testcases for
permutations that are not yet supported).

(small question/request: can you please document what's the difference
between vect_supported_slp_permutation_p  and
vect_supported_load_permutation_p?)

...
> > I also have a problem with the transformation: it assumes a very
> > specific form of permute at the gimple level - a permute that takes
> > two vectors as input and a byte mask. I don't think this is a
> > general enough representation (I don't think that the SSE shuffles
> > take a byte mask for example?).
> > We need to think of a more general
> > way to represent a permute at this level, and maybe have a target
> > specific builtin expand it using byte mask when appropriate.
>
> AFAIK, SSE5 permute does take two vectors as input and a byte mask.
> But the mask is not similar to altivec/spu mask.

Intels SSE/AVX shuffle/permute insns (e.g. pshuf*, vpermil2*) have 8-bit
control fields per element (rather than per byte), and some of these insns
shufle/permute elements only from a single input vector (rather than
two) ...

> Maybe I can create
> an element mask at the tree level and leave the correct mask
> creation to the target (builtin)?),
>

yes, I think the mask creation should be done on a target specific basis;
the vectorizer could create a control mask given as a vector of indices per
element.

I guess we can start by introducing a 2-operand permute (this is what the
vectorizer would currently know how to use), but it may be useful to
consider a single operand permute (+ control mask) later on.

> >
> > Actually, I think the particular testcase you are targeting could be
> > vectorized by preparing an appropriate vector of constants instead
> > of working so hard on permuting the loads. Maybe we can try
> > something like that for now (and potentially defer the decision on a
> > representation of permute to a separate patch (and testcase)?)
>
> I don't think this will work. If we only permute the constants, we
> can't get the multiples in the correct order and we will have to
> permute them anyway:
> yi = M00 * ri + M01 * gi + M02 * bi
> ui = M11 * gi + M12 * bi + M10 * ri
> vi = M22 * bi + M20 * ri + M21 * gi
> (we have gbr and brg in the second an third columns instead of rgb).
>
> In case that the number of the grouped statements is smaller than
> the vector size (as in the rgb conversion), we need to unroll the
> loop, and then such permutation will be done across several vectors
> and will be as painful as the load permutation.
>

ok. In a separate followup patch we could look into optimizing cases in
which the group size is equal to the vector size (like rgba).

thanks,
dorit

> Thanks,
> Ira
>
> [attachment "slp-perm-updated.txt" deleted by Dorit Nuzman/Haifa/IBM]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware  SLP
       [not found] <OF4F4F147D.44C29413-ONC225749F.004AA5E5-C225749F.005304A0@LocalDomain>
@ 2008-08-21 11:45 ` Ira Rosen
  0 siblings, 0 replies; 8+ messages in thread
From: Ira Rosen @ 2008-08-21 11:45 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3257 bytes --]



Dorit Nuzman/Haifa/IBM wrote on 08/08/2008 18:06:46:

> I have a problem with the fact that this specific permutation is so
> hard-coded into the analysis. It's ok to support only one
> permutation as a start, but the analysis itself should be general.
> Hopefully this could be rewritten to identify more general patterns
> during the analysis, represent the identified permutation somehow
> (e.g. [3,2,1,0]), and then decide if we can proceed to vectorize it or
not.

I changed the analysis part, so now during the SLP tree construction we
only store the permutation, and check if the permutation is supported
afterwards. I am attaching the updated (not fully tested)  analysis part of
the patch.

>
> > to YUV conversion, that can be viewed as {y, u, v} = M * {r, g, b},
where M
> > is a matrix of constant coefficients, and the calculation is performed
in a
> > single-nested loop:
> > for i
> > yi = M00 * ri +  M01 * gi + M02 * bi
> > ui = M10 * ri +  M11 * gi + M12 * bi
> > vi = M20 * ri +  M21 * gi + M22 * bi
> > The required permutation of loads is to transform rgb stream into
{r,r,r},
> > {g,g,g} and {b,b,b} vectors (ignoring vector size for simplicity).
>
> > The SLP analysis detects such cases: all the loads in the same SLP node
> > must access the same memory location, and all the SLP nodes that
contain
> > loads must form a group of adjacent memory accesses. The transformation
> > phase generates vector permutations of the input vectors with compiler
> > generated masks, depending on the data type, vectorization factor and
size
> > of SLP nodes.

> I also have a problem with the transformation: it assumes a very
> specific form of permute at the gimple level - a permute that takes
> two vectors as input and a byte mask. I don't think this is a
> general enough representation (I don't think that the SSE shuffles
> take a byte mask for example?).
> We need to think of a more general
> way to represent a permute at this level, and maybe have a target
> specific builtin expand it using byte mask when appropriate.

AFAIK, SSE5 permute does take two vectors as input and a byte mask. But the
mask is not similar to altivec/spu mask. Maybe I can create an element mask
at the tree level and leave the correct mask creation to the target
(builtin)?

>
> Actually, I think the particular testcase you are targeting could be
> vectorized by preparing an appropriate vector of constants instead
> of working so hard on permuting the loads. Maybe we can try
> something like that for now (and potentially defer the decision on a
> representation of permute to a separate patch (and testcase)?)

I don't think this will work. If we only permute the constants, we can't
get the multiples in the correct order and we will have to permute them
anyway:
yi = M00 * ri + M01 * gi + M02 * bi
ui = M11 * gi + M12 * bi + M10 * ri
vi = M22 * bi + M20 * ri + M21 * gi
(we have gbr and brg in the second an third columns instead of rgb).

In case that the number of the grouped statements is smaller than the
vector size (as in the rgb conversion), we need to unroll the loop, and
then such permutation will be done across several vectors and will be as
painful as the load permutation.

Thanks,
Ira

(See attached file: slp-perm-updated.txt)


[-- Attachment #2: slp-perm-updated.txt --]
[-- Type: text/plain, Size: 20077 bytes --]

Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(revision 139265)
+++ tree-vectorizer.c	(working copy)
@@ -1802,7 +1802,8 @@ destroy_loop_vec_info (loop_vec_info loo
   VEC_free (ddr_p, heap, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo));
   slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
   for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
-    vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
+    vect_free_slp_instance (instance);
+
   VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo));
   VEC_free (gimple, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo));
 
Index: tree-vectorizer.h
===================================================================
--- tree-vectorizer.h	(revision 139265)
+++ tree-vectorizer.h	(working copy)
@@ -105,6 +105,8 @@ typedef struct _slp_tree {
   } cost;
 } *slp_tree;
 
+DEF_VEC_P(slp_tree);
+DEF_VEC_ALLOC_P(slp_tree, heap);
 
 /* SLP instance is a sequence of stmts in a loop that can be packed into
    SIMD stmts.  */
@@ -124,6 +126,13 @@ typedef struct _slp_instance {
     int outside_of_loop;     /* Statements generated outside loop.  */
     int inside_of_loop;      /* Statements generated inside loop.  */
   } cost;
+
+  /* Loads permutation relatively to the stores, NULL if there is no 
+     permutation.  */
+  VEC (int, heap) *load_permutation;
+
+  /* The group of nodes that contain loads of this SLP instance.  */
+  VEC (slp_tree, heap) *loads;
 } *slp_instance;
 
 DEF_VEC_P(slp_instance);
@@ -135,6 +144,8 @@ DEF_VEC_ALLOC_P(slp_instance, heap);
 #define SLP_INSTANCE_UNROLLING_FACTOR(S)         (S)->unrolling_factor
 #define SLP_INSTANCE_OUTSIDE_OF_LOOP_COST(S)     (S)->cost.outside_of_loop
 #define SLP_INSTANCE_INSIDE_OF_LOOP_COST(S)      (S)->cost.inside_of_loop
+#define SLP_INSTANCE_LOAD_PERMUTATION(S)         (S)->load_permutation
+#define SLP_INSTANCE_LOADS(S)                    (S)->loads
 
 #define SLP_TREE_LEFT(S)                         (S)->left
 #define SLP_TREE_RIGHT(S)                        (S)->right
@@ -700,7 +711,7 @@ extern void free_stmt_vec_info (gimple s
 /** In tree-vect-analyze.c  **/
 /* Driver for analysis stage.  */
 extern loop_vec_info vect_analyze_loop (struct loop *);
-extern void vect_free_slp_tree (slp_tree);
+extern void vect_free_slp_instance (slp_instance);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
 
 /** In tree-vect-patterns.c  **/
@@ -714,7 +725,7 @@ void vect_pattern_recog (loop_vec_info);
 
 /** In tree-vect-transform.c  **/
 extern bool vectorizable_load (gimple, gimple_stmt_iterator *, gimple *,
-			       slp_tree);
+			       slp_tree, slp_instance);
 extern bool vectorizable_store (gimple, gimple_stmt_iterator *, gimple *,
 				slp_tree);
 extern bool vectorizable_operation (gimple, gimple_stmt_iterator *, gimple *,
@@ -740,6 +751,9 @@ extern void vect_model_simple_cost (stmt
 extern void vect_model_store_cost (stmt_vec_info, int, enum vect_def_type, 
 				   slp_tree);
 extern void vect_model_load_cost (stmt_vec_info, int, slp_tree);
+extern bool vect_transform_slp_perm_load (gimple, VEC (tree, heap) *, 
+                             gimple_stmt_iterator *, int, slp_instance, bool);
+
 /* Driver for transformation stage.  */
 extern void vect_transform_loop (loop_vec_info);
 
Index: tree-vect-analyze.c
===================================================================
--- tree-vect-analyze.c	(revision 139265)
+++ tree-vect-analyze.c	(working copy)
@@ -467,7 +467,7 @@ vect_analyze_operations (loop_vec_info l
 		|| vectorizable_conversion (stmt, NULL, NULL, NULL)
 		|| vectorizable_operation (stmt, NULL, NULL, NULL)
 		|| vectorizable_assignment (stmt, NULL, NULL, NULL)
-		|| vectorizable_load (stmt, NULL, NULL, NULL)
+		|| vectorizable_load (stmt, NULL, NULL, NULL, NULL)
 		|| vectorizable_call (stmt, NULL, NULL)
 		|| vectorizable_store (stmt, NULL, NULL, NULL)
 		|| vectorizable_condition (stmt, NULL, NULL)
@@ -827,6 +827,31 @@ vect_analyze_scalar_cycles (loop_vec_inf
 }
 
 
+/* Find the place of the data-ref in STMT in the interleaving chain that starts
+   from FIRST_STMT. Return -1 if the data-ref is not a part of the chain.  */
+
+static int 
+vect_get_place_in_interleaving_chain (gimple stmt, gimple first_stmt)
+{
+  gimple next_stmt = first_stmt;
+  int result = 0;
+
+  if (first_stmt != DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)))
+    return -1;
+
+  while (next_stmt && next_stmt != stmt)
+    {
+      result++;
+      next_stmt = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_stmt));
+    }
+
+  if (next_stmt)
+    return result;
+  else
+    return -1;
+}
+
+
 /* Function vect_insert_into_interleaving_chain.
 
    Insert DRA into the interleaving chain of DRB according to DRA's INIT.  */
@@ -2463,7 +2488,7 @@ vect_prune_runtime_alias_test_list (loop
 
 /* Recursively free the memory allocated for the SLP tree rooted at NODE.  */
 
-void
+static void
 vect_free_slp_tree (slp_tree node)
 {
   if (!node)
@@ -2484,6 +2509,17 @@ vect_free_slp_tree (slp_tree node)
 }
 
 
+/* Free the memory allocated for the SLP instance.  */
+
+void
+vect_free_slp_instance (slp_instance instance)
+{
+  vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
+  VEC_free (int, heap, SLP_INSTANCE_LOAD_PERMUTATION (instance));
+  VEC_free (slp_tree, heap, SLP_INSTANCE_LOADS (instance));
+}
+
+
 /* Get the defs for the rhs of STMT (collect them in DEF_STMTS0/1), check that
    they are of a legal type and that they match the defs of the first stmt of
    the SLP group (stored in FIRST_STMT_...).  */
@@ -2683,7 +2719,9 @@ static bool
 vect_build_slp_tree (loop_vec_info loop_vinfo, slp_tree *node, 
 		     unsigned int group_size, 
 		     int *inside_cost, int *outside_cost,
-		     int ncopies_for_cost, unsigned int *max_nunits)
+		     int ncopies_for_cost, unsigned int *max_nunits,
+                     VEC (int, heap) **load_permutation,
+                     VEC (slp_tree, heap) **loads)
 {
   VEC (gimple, heap) *def_stmts0 = VEC_alloc (gimple, heap, group_size);
   VEC (gimple, heap) *def_stmts1 =  VEC_alloc (gimple, heap, group_size);
@@ -2694,7 +2732,6 @@ vect_build_slp_tree (loop_vec_info loop_
   enum tree_code first_stmt_code = 0, rhs_code;
   tree first_stmt_def1_type = NULL_TREE, first_stmt_def0_type = NULL_TREE;
   tree lhs;
-  gimple prev_stmt = NULL;
   bool stop_recursion = false, need_same_oprnds = false;
   tree vectype, scalar_type, first_op1 = NULL_TREE;
   unsigned int vectorization_factor = 0, ncopies;
@@ -2705,6 +2742,9 @@ vect_build_slp_tree (loop_vec_info loop_
   tree first_stmt_const_oprnd = NULL_TREE;
   struct data_reference *first_dr;
   bool pattern0 = false, pattern1 = false;
+  bool permutation = false;
+  unsigned int load_place;
+  gimple first_load;
 
   /* For every stmt in NODE find its def stmt/s.  */
   for (i = 0; VEC_iterate (gimple, stmts, i, stmt); i++)
@@ -2852,77 +2892,62 @@ vect_build_slp_tree (loop_vec_info loop_
                                                 &pattern0, &pattern1))
 		return false;
 	    }
-	    else
-	      {
-		/* Load.  */
-		if (i == 0)
-		  {
-                    /* In case of multiple types we need to detect the smallest
-                       type.  */
-                    if (*max_nunits < TYPE_VECTOR_SUBPARTS (vectype))
-                       *max_nunits = TYPE_VECTOR_SUBPARTS (vectype);
-
-		    /* First stmt of the SLP group should be the first load of 
-		       the interleaving loop if data permutation is not allowed.
-		       Check that there is no gap between the loads.  */
-		    if (DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) != stmt
-                        || DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 0) 
-		      {
-			/* FORNOW: data permutations and gaps in loads are not 
-                           supported.  */
-			if (vect_print_dump_info (REPORT_SLP)) 
-			  {
-			    fprintf (vect_dump, "Build SLP failed: strided "
-				     " loads need permutation or have gaps ");
-			    print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
-			  }
-
-			return false;
-		      }
-
-		    first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
-		    if (vect_supportable_dr_alignment (first_dr)
-			== dr_unaligned_unsupported)
-		      {
-			if (vect_print_dump_info (REPORT_SLP)) 
-			  {
-			    fprintf (vect_dump, "Build SLP failed: unsupported "
-				     " unaligned load ");
-			    print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
-			  }
-
-			return false;
-		      }
-
-		    /* Analyze costs (for the first stmt in the group).  */
-		    vect_model_load_cost (vinfo_for_stmt (stmt), 
-					  ncopies_for_cost, *node);
-		  }
-		else
-		  {
-                    /* Check that we have consecutive loads from interleaving
-                       chain and that there is no gap between the loads.  */
-		    if (DR_GROUP_NEXT_DR (vinfo_for_stmt (prev_stmt)) != stmt
-                        || DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 1)
-		      {
-			/* FORNOW: data permutations and gaps in loads are not
-                           supported.  */
-			if (vect_print_dump_info (REPORT_SLP)) 
-			  {
-			    fprintf (vect_dump, "Build SLP failed: strided "
-				     " loads need permutation or have gaps ");
-			    print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
-			  }
-			return false;
-		      }
-		  }
-
-		prev_stmt = stmt;
-
-		/* We stop the tree when we reach a group of loads.  */
-		stop_recursion = true;
-		continue;
-	      }
+          else
+	    {
+	      /* Load.  */
+              /* FORNOW: Check that there is no gap between the loads.  */
+              if ((DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) == stmt
+                   && DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 0)
+                  || (DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) != stmt
+                      && DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 1))
+                {
+                  if (vect_print_dump_info (REPORT_SLP))
+                    {
+                      fprintf (vect_dump, "Build SLP failed: strided "
+                                          "loads have gaps ");
+                      print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+                    }
+
+                  return false;
+                }
+
+              first_load = DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt));
+
+	      if (first_load == stmt)
+	        {
+		  first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
+		  if (vect_supportable_dr_alignment (first_dr)
+		      == dr_unaligned_unsupported)
+	            {
+		      if (vect_print_dump_info (REPORT_SLP)) 
+			{
+			  fprintf (vect_dump, "Build SLP failed: unsupported "
+			  	              "unaligned load ");
+			  print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+		        }
+
+		      return false;
+		    }
+
+		  /* Analyze costs (for the first stmt in the group).  */
+		  vect_model_load_cost (vinfo_for_stmt (stmt), 
+				        ncopies_for_cost, *node);
+		}
+ 
+              /* Store the place of this load in the interleaving chain. In 
+                 case that permutation is needed we later decide if a specific
+                 permutation is supported.  */
+              load_place = vect_get_place_in_interleaving_chain (stmt,
+                                                                 first_load);
+              if (load_place != i)
+                permutation = true;
+
+              VEC_safe_push (int, heap, *load_permutation, load_place);
+
+	      /* We stop the tree when we reach a group of loads.  */
+	      stop_recursion = true;
+	      continue;
+	    }
 	} /* Strided access.  */
       else
 	{
@@ -2972,7 +2997,12 @@ vect_build_slp_tree (loop_vec_info loop_
 
   /* Strided loads were reached - stop the recursion.  */
   if (stop_recursion)
-    return true;
+    {
+      if (permutation)
+        VEC_safe_push (slp_tree, heap, *loads, *node); 
+
+      return true;
+    }
 
   /* Create SLP_TREE nodes for the definition node/s.  */ 
   if (first_stmt_dt0 == vect_loop_def)
@@ -2986,7 +3016,8 @@ vect_build_slp_tree (loop_vec_info loop_
       SLP_TREE_INSIDE_OF_LOOP_COST (left_node) = 0;
       if (!vect_build_slp_tree (loop_vinfo, &left_node, group_size, 
 				inside_cost, outside_cost,
-				ncopies_for_cost, max_nunits))
+				ncopies_for_cost, max_nunits, load_permutation,
+                                loads))
 	return false;
       
       SLP_TREE_LEFT (*node) = left_node;
@@ -3003,7 +3034,8 @@ vect_build_slp_tree (loop_vec_info loop_
       SLP_TREE_INSIDE_OF_LOOP_COST (right_node) = 0;
       if (!vect_build_slp_tree (loop_vinfo, &right_node, group_size,
 				inside_cost, outside_cost,
-				ncopies_for_cost, max_nunits))
+				ncopies_for_cost, max_nunits, load_permutation,
+                                loads))
 	return false;
       
       SLP_TREE_RIGHT (*node) = right_node;
@@ -3058,6 +3090,115 @@ vect_mark_slp_stmts (slp_tree node, enum
 }
 
 
+/* Check if the permutation required by the SLP INSTANCE is supported.  
+   Reorganize the SLP nodes stored in SLP_INSTANCE_LOADS if needed.  */
+
+static bool
+vect_supported_slp_permutation_p (slp_instance instance)
+{
+  slp_tree node = VEC_index (slp_tree, SLP_INSTANCE_LOADS (instance), 0);
+  gimple stmt = VEC_index (gimple, SLP_TREE_SCALAR_STMTS (node), 0);
+  gimple first_load = DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt));
+  VEC (slp_tree, heap) *sorted_loads = NULL;
+  int index;
+  slp_tree *tmp_loads = NULL;
+  int group_size = SLP_INSTANCE_GROUP_SIZE (instance), i, j; 
+  slp_tree load;
+ 
+  /* FORNOW: The only supported loads permutation is loads from the same 
+     location in all the loads in the node, when the data-refs in
+     nodes of LOADS constitute an interleaving chain.  
+     Sort the nodes according to the order of accesses in the chain.  */
+  tmp_loads = (slp_tree *) xmalloc (sizeof (slp_tree) * group_size);
+  for (i = 0, j = 0; 
+       VEC_iterate (int, SLP_INSTANCE_LOAD_PERMUTATION (instance), i, index) 
+       && VEC_iterate (slp_tree, SLP_INSTANCE_LOADS (instance), j, load); 
+       i += group_size, j++)
+    {
+      gimple scalar_stmt = VEC_index (gimple, SLP_TREE_SCALAR_STMTS (load), 0);
+      /* Check that the loads are all in the same interleaving chain.  */
+      if (DR_GROUP_FIRST_DR (vinfo_for_stmt (scalar_stmt)) != first_load)
+        {
+          if (vect_print_dump_info (REPORT_DETAILS))
+            {
+              fprintf (vect_dump, "Build SLP failed: unsupported data "
+                                   "permutation ");
+              print_gimple_stmt (vect_dump, scalar_stmt, 0, TDF_SLIM);
+            }
+             
+          free (tmp_loads);
+          return false; 
+        }
+
+      tmp_loads[index] = load;
+    }
+  
+  sorted_loads = VEC_alloc (slp_tree, heap, group_size);
+  for (i = 0; i < group_size; i++)
+     VEC_safe_push (slp_tree, heap, sorted_loads, tmp_loads[i]);
+
+  VEC_free (slp_tree, heap, SLP_INSTANCE_LOADS (instance));
+  SLP_INSTANCE_LOADS (instance) = sorted_loads;
+  free (tmp_loads);
+
+  if (!vect_transform_slp_perm_load (stmt, NULL, NULL,
+                                     SLP_INSTANCE_UNROLLING_FACTOR (instance),
+                                     instance, true))
+    return false;
+
+  return true;
+}
+
+
+/* Check if the required load permutation is supported.
+   LOAD_PERMUTATION contains a list of indices of the loads.
+   In SLP this permutation is relative to the order of strided stores that are
+   the base of the SLP instance.  */
+static bool
+vect_supported_load_permutation_p (slp_instance slp_instn, int group_size,
+                                   VEC (int, heap) *load_permutation)
+{
+  int i = 0, j, prev = -1, next;
+  bool supported;
+
+  /* FORNOW: permutations are only supported for loop-aware SLP.  */
+  if (!slp_instn)
+    return false;
+
+  if (vect_print_dump_info (REPORT_SLP))
+    {
+      fprintf (vect_dump, "Load permutation ");
+      for (i = 0; VEC_iterate (int, load_permutation, i, next); i++)
+        fprintf (vect_dump, "%d ", next);
+    }
+
+  /* FORNOW: the only supported permutation is 0..01..1.. of length equal to 
+     GROUP_SIZE and where each sequence of same drs is of GROUP_SIZE length as 
+     well.  */
+  supported = true;
+  for (j = 0; j < group_size; j++)
+    {
+      for (i = j; 
+           VEC_iterate (int, load_permutation, i, next) && i < group_size;
+           i++)
+       {
+         if (i != j && next != prev)
+          {
+            supported = false;
+            break;
+          }
+
+         prev = next;
+       }  
+    }
+
+  if (supported && i == group_size * group_size 
+      && vect_supported_slp_permutation_p (slp_instn))
+    return true;
+
+  return false; 
+}
+
 /* Analyze an SLP instance starting from a group of strided stores. Call
    vect_build_slp_tree to build a tree of packed stmts if possible.  
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -3075,7 +3216,9 @@ vect_analyze_slp_instance (loop_vec_info
   bool slp_impossible = false; 
   int inside_cost = 0, outside_cost = 0, ncopies_for_cost;
   unsigned int max_nunits = 0;
-
+  VEC (int, heap) *load_permutation;
+  VEC (slp_tree, heap) *loads;
+ 
   scalar_type = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt))));
   vectype = get_vectype_for_scalar_type (scalar_type);
   if (!vectype)
@@ -3116,16 +3259,21 @@ vect_analyze_slp_instance (loop_vec_info
      factor (number of vectors is 1 if NUNITS >= GROUP_SIZE, and is
      GROUP_SIZE / NUNITS otherwise.  */
   ncopies_for_cost = unrolling_factor * group_size / nunits;
+  
+  load_permutation = VEC_alloc (int, heap, group_size * group_size); 
+  loads = VEC_alloc (slp_tree, heap, group_size); 
 
   /* Build the tree for the SLP instance.  */
   if (vect_build_slp_tree (loop_vinfo, &node, group_size, &inside_cost,  
-			   &outside_cost, ncopies_for_cost, &max_nunits))
+			   &outside_cost, ncopies_for_cost, &max_nunits,
+                           &load_permutation, &loads))
     {
       /* Create a new SLP instance.  */  
       new_instance = XNEW (struct _slp_instance);
       SLP_INSTANCE_TREE (new_instance) = node;
       SLP_INSTANCE_GROUP_SIZE (new_instance) = group_size;
-      /* Calculate the unrolling factor based on the smallest type.  */
+      /* Calculate the unrolling factor based on the smallest type in the
+         loop.  */
       if (max_nunits > nunits)
         unrolling_factor = least_common_multiple (max_nunits, group_size)
                            / group_size;
@@ -3133,6 +3281,27 @@ vect_analyze_slp_instance (loop_vec_info
       SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = unrolling_factor;
       SLP_INSTANCE_OUTSIDE_OF_LOOP_COST (new_instance) = outside_cost;
       SLP_INSTANCE_INSIDE_OF_LOOP_COST (new_instance) = inside_cost;
+      SLP_INSTANCE_LOADS (new_instance) = loads;
+      SLP_INSTANCE_LOAD_PERMUTATION (new_instance) = load_permutation;
+      if (VEC_length (slp_tree, loads))
+        {
+          if (!vect_supported_load_permutation_p (new_instance, group_size,
+                                                  load_permutation)) 
+            {
+              if (vect_print_dump_info (REPORT_SLP))
+                {
+                  fprintf (vect_dump, "Build SLP failed: unsupported load "
+                                      "permutation ");
+                  print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+                }
+
+              vect_free_slp_instance (new_instance);
+              return false;
+            }
+        }
+      else
+        VEC_free (int, heap, SLP_INSTANCE_LOAD_PERMUTATION (new_instance));
+
       VEC_safe_push (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo), 
 		     new_instance);
       if (vect_print_dump_info (REPORT_SLP))
@@ -3144,7 +3313,9 @@ vect_analyze_slp_instance (loop_vec_info
   /* Failed to SLP.  */
   /* Free the allocated memory.  */
   vect_free_slp_tree (node);
-
+  VEC_free (int, heap, load_permutation);
+  VEC_free (slp_tree, heap, loads);
+   
   if (slp_impossible)
     return false;
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware  SLP
  2008-07-06  9:24 Ira Rosen
  2008-07-06 22:14 ` Andrew Pinski
  2008-07-07  0:45 ` David Edelsohn
@ 2008-08-08 15:11 ` Dorit Nuzman
  2 siblings, 0 replies; 8+ messages in thread
From: Dorit Nuzman @ 2008-08-08 15:11 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc-patches

> Hi,

Hi Ira,

> Current loop-aware SLP scheme starts from a group of adjacent stores and
> follows use-def chains until getting to a group of loads. The loads must
be
> adjacent and their order must match the order of the stores, i.e., no
> permutations are currently allowed.

> This patch adds a support of a specific type of load permutations along
> with general support of load permutations in SLP. It aims to vectorize
RGB

I have a problem with the fact that this specific permutation is so
hard-coded into the analysis. It's ok to support only one permutation as a
start, but the analysis itself should be general. Hopefully this could be
rewritten to identify more general patterns during the analysis, represent
the identified permutation somehow (e.g. [3,2,1,0]), and then decide if we
can proceed to vectorize it or not.

> to YUV conversion, that can be viewed as {y, u, v} = M * {r, g, b}, where
M
> is a matrix of constant coefficients, and the calculation is performed in
a
> single-nested loop:
> for i
> yi = M00 * ri +  M01 * gi + M02 * bi
> ui = M10 * ri +  M11 * gi + M12 * bi
> vi = M20 * ri +  M21 * gi + M22 * bi
> The required permutation of loads is to transform rgb stream into
{r,r,r},
> {g,g,g} and {b,b,b} vectors (ignoring vector size for simplicity).

> The SLP analysis detects such cases: all the loads in the same SLP node
> must access the same memory location, and all the SLP nodes that contain
> loads must form a group of adjacent memory accesses. The transformation
> phase generates vector permutations of the input vectors with compiler
> generated masks, depending on the data type, vectorization factor and
size
> of SLP nodes.

I also have a problem with the transformation: it assumes a very specific
form of permute at the gimple level - a permute that takes two vectors as
input and a byte mask. I don't think this is a general enough
representation (I don't think that the SSE shuffles take a byte mask for
example?). We need to think of a more general way to represent a permute at
this level, and maybe have a target specific builtin expand it using byte
mask when appropriate.

Actually, I think the particular testcase you are targeting could be
vectorized by preparing an appropriate vector of constants instead of
working so hard on permuting the loads. Maybe we can try something like
that for now (and potentially defer the decision on a representation of
permute to a separate patch (and testcase)?)

thanks,
dorit

> Bootstrapped with vectorization enabled on ppc-linux and tested on Cell
SPU
> and ppc-linux.
> O.K. for mainline?

> Thanks,
> Ira

> ChangeLog:

> * target.h (struct vectorize): Add new target builtin.
> * tree-vectorizer.h (enum slp_load_perm_type): New.
> (struct _slp_tree): Add new field loads_perm_type..
> (struct _slp_instance): Add new field same_perm_nodes.
> (SLP_INSTANCE_SAME_PERM_NODES): New.
> (SLP_TREE_LOADS_PERM_TYPE, TARG_VEC_PERMUTE_COST): New.
> (vectorizable_load): Add argument.
> (vect_transform_slp_perm_load): new.
> * tree-vect-analyze.c (vect_analyze_operations): Add an argument to
> vectorizable_load.
> (vect_build_slp_tree): Add new argument. Allow load permutations for
> the case
> when all the loads in the same SLP node access the same memory
> location.
> (vect_analyze_slp_instance): In case of same location loads check
> that the
> loads from different nodes form an interleaving chain. Sort the nodes
> according
> to the chain.
> * target-def.h (TARGET_VECTORIZE_BUILTIN_VEC_PERM): New.
> * tree-vect-transform.c (vect_transform_stmt): Add new argument.
> (vectorizable_store): Allow number of created vectors to be greater
> than the
> size of an interleaving group. Don't go along the interleaving chain
> for SLP.
> (vect_create_mask_and_perm): New function.
> (vect_get_mask_element, vect_transform_slp_perm_load): Likewise.
> (vectorizable_load): Allocate DR_CHAIN according to the number of
> generated
> vectors. Don't keep the created vectors statements in the node if
> permutation
> is required. Call vect_transform_slp_perm_load to generate the
> permutation.
> (vect_transform_stmt): Add new argument. Call vectorizable_load with
> additional
> argument. Don't wait for other stores in case of SLP.
> (vect_schedule_slp_instance): Add new argument. Calculate the number
> of vector
> statements. In case of loads from the same location, allocate
> vectorized
> statements structure for all the related SLP nodes. Call
> vect_transform_stmt with
> additional argument.
> (vect_schedule_slp): Remove one argument. Move number of vector
> statements
> calculation to vect_schedule_slp_instance.
> (vect_transform_loop): Call vect_transform_stmt and vect_schedule_slp
> with
> correct arguments.
> * config/spu/spu.c (spu_builtin_vec_perm): New.
> (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine..
> * config/spu/spu.h (TARG_VEC_PERMUTE_COS): Define.
> * config/rs6000/rs6000.c (rs6000_builtin_vec_perm): New.
> (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine.,
>
> testsuite/ChangeLog:

> * lib/target-supports.exp (check_effective_target_vect_perm): New.
> * gcc.dg/vect/slp-perm-1.c: New testcase.
> * gcc.dg/vect/slp-perm-2.c: Likewise.
> * gcc.dg/vect/slp-perm-3.c: Likewise.
> * gcc.dg/vect/slp-perm-4.c: Likewise.
> * gcc.dg/vect/slp-perm-5.c: Likewise.
> * gcc.dg/vect/slp-perm-6.c: Likewise.
> * gcc.dg/vect/slp-perm-7.c: Likewise.
> * gcc.dg/vect/slp-perm-8.c: Likewise.
> * gcc.dg/vect/slp-perm-9.c: Likewise.
>
> (See attached file: slp-perm.txt)(See attached file: tests.txt)

>
> [attachment "tests.txt" deleted by Dorit Nuzman/Haifa/IBM]
>
> #### slp-perm.txt has been deleted (was saved in repository
> MyAttachments Repository ->) from this note on 07 July 2008 by Dorit
Nuzman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP
  2008-07-06  9:24 Ira Rosen
  2008-07-06 22:14 ` Andrew Pinski
@ 2008-07-07  0:45 ` David Edelsohn
  2008-08-08 15:11 ` Dorit Nuzman
  2 siblings, 0 replies; 8+ messages in thread
From: David Edelsohn @ 2008-07-07  0:45 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc-patches

      * config/rs6000/rs6000.c (rs6000_builtin_vec_perm): New.
      (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine.

The rs6000 part of the patch is okay.

Thanks, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP
  2008-07-06  9:24 Ira Rosen
@ 2008-07-06 22:14 ` Andrew Pinski
  2008-07-07  0:45 ` David Edelsohn
  2008-08-08 15:11 ` Dorit Nuzman
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Pinski @ 2008-07-06 22:14 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc-patches

On Sun, Jul 6, 2008 at 2:06 AM, Ira Rosen <IRAR@il.ibm.com> wrote:
> Bootstrapped with vectorization enabled on ppc-linux and tested on Cell SPU
> and ppc-linux.
> O.K. for mainline?
>      * config/spu/spu.c (spu_builtin_vec_perm): New.
>      (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine..
>      * config/spu/spu.h (TARG_VEC_PERMUTE_COS): Define.

The SPU part is ok if the rest of the patch is approved.

Thanks,
Andrew Thomas Pinski

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP
@ 2008-07-06  9:24 Ira Rosen
  2008-07-06 22:14 ` Andrew Pinski
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ira Rosen @ 2008-07-06  9:24 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4420 bytes --]


Hi,

Current loop-aware SLP scheme starts from a group of adjacent stores and
follows use-def chains until getting to a group of loads. The loads must be
adjacent and their order must match the order of the stores, i.e., no
permutations are currently allowed.

This patch adds a support of a specific type of load permutations along
with general support of load permutations in SLP. It aims to vectorize RGB
to YUV conversion, that can be viewed as {y, u, v} = M * {r, g, b}, where M
is a matrix of constant coefficients, and the calculation is performed in a
single-nested loop:
  for i
    yi = M00 * ri +  M01 * gi + M02 * bi
    ui = M10 * ri +  M11 * gi + M12 * bi
    vi = M20 * ri +  M21 * gi + M22 * bi
The required permutation of loads is to transform rgb stream into {r,r,r},
{g,g,g} and {b,b,b} vectors (ignoring vector size for simplicity).

The SLP analysis detects such cases: all the loads in the same SLP node
must access the same memory location, and all the SLP nodes that contain
loads must form a group of adjacent memory accesses. The transformation
phase generates vector permutations of the input vectors with compiler
generated masks, depending on the data type, vectorization factor and size
of SLP nodes.

Bootstrapped with vectorization enabled on ppc-linux and tested on Cell SPU
and ppc-linux.
O.K. for mainline?

Thanks,
Ira

ChangeLog:

      * target.h (struct vectorize): Add new target builtin.
      * tree-vectorizer.h (enum slp_load_perm_type): New.
      (struct _slp_tree): Add new field loads_perm_type..
      (struct _slp_instance): Add new field same_perm_nodes.
      (SLP_INSTANCE_SAME_PERM_NODES): New.
      (SLP_TREE_LOADS_PERM_TYPE, TARG_VEC_PERMUTE_COST): New.
      (vectorizable_load): Add argument.
      (vect_transform_slp_perm_load): new.
      * tree-vect-analyze.c (vect_analyze_operations): Add an argument to
      vectorizable_load.
      (vect_build_slp_tree): Add new argument. Allow load permutations for
the case
      when all the loads in the same SLP node access the same memory
location.
      (vect_analyze_slp_instance): In case of same location loads check
that the
      loads from different nodes form an interleaving chain. Sort the nodes
according
      to the chain.
      * target-def.h (TARGET_VECTORIZE_BUILTIN_VEC_PERM): New.
      * tree-vect-transform.c (vect_transform_stmt): Add new argument.
      (vectorizable_store): Allow number of created vectors to be greater
than the
      size of an interleaving group. Don't go along the interleaving chain
for SLP.
      (vect_create_mask_and_perm): New function.
      (vect_get_mask_element, vect_transform_slp_perm_load): Likewise.
      (vectorizable_load): Allocate DR_CHAIN according to the number of
generated
      vectors. Don't keep the created vectors statements in the node if
permutation
      is required. Call vect_transform_slp_perm_load to generate the
permutation.
      (vect_transform_stmt): Add new argument. Call vectorizable_load with
additional
      argument. Don't wait for other stores in case of SLP.
      (vect_schedule_slp_instance): Add new argument. Calculate the number
of vector
      statements. In case of loads from the same location, allocate
vectorized
      statements structure for all the related SLP nodes. Call
vect_transform_stmt with
      additional argument.
      (vect_schedule_slp): Remove one argument. Move number of vector
statements
      calculation to vect_schedule_slp_instance.
      (vect_transform_loop): Call vect_transform_stmt and vect_schedule_slp
with
      correct arguments.
      * config/spu/spu.c (spu_builtin_vec_perm): New.
      (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine..
      * config/spu/spu.h (TARG_VEC_PERMUTE_COS): Define.
      * config/rs6000/rs6000.c (rs6000_builtin_vec_perm): New.
      (TARGET_VECTORIZE_BUILTIN_VEC_PERM): Redefine.

testsuite/ChangeLog:

      * lib/target-supports.exp (check_effective_target_vect_perm): New.
      * gcc.dg/vect/slp-perm-1.c: New testcase.
      * gcc.dg/vect/slp-perm-2.c: Likewise.
      * gcc.dg/vect/slp-perm-3.c: Likewise.
      * gcc.dg/vect/slp-perm-4.c: Likewise.
      * gcc.dg/vect/slp-perm-5.c: Likewise.
      * gcc.dg/vect/slp-perm-6.c: Likewise.
      * gcc.dg/vect/slp-perm-7.c: Likewise.
      * gcc.dg/vect/slp-perm-8.c: Likewise.
      * gcc.dg/vect/slp-perm-9.c: Likewise.

(See attached file: slp-perm.txt)(See attached file: tests.txt)




[-- Attachment #2: slp-perm.txt --]
[-- Type: text/plain, Size: 48610 bytes --]

Index: target.h
===================================================================
--- target.h	(revision 137312)
+++ target.h	(working copy)
@@ -438,6 +438,9 @@ struct gcc_target
     /* Return true if vector alignment is reachable (by peeling N
        iterations) for the given type.  */
     bool (* vector_alignment_reachable) (const_tree, bool);
+
+    /* Target builtin that implements vector permute.  */
+    tree (* builtin_vec_perm) (tree, tree*);
   } vectorize;
 
   /* The initial value of target_flags.  */
Index: tree-vectorizer.h
===================================================================
--- tree-vectorizer.h	(revision 137312)
+++ tree-vectorizer.h	(working copy)
@@ -79,6 +79,12 @@ enum verbosity_levels {
 /************************************************************************
   SLP
  ************************************************************************/
+/* Define possible load permutation types: interleaved data or loads from the
+   same location.  */
+enum slp_load_perm_type {
+  interleaved,
+  same
+};
 
 /* A computation tree of an SLP instance. Each node corresponds to a group of
    stmts to be packed in a SIMD stmt.  */
@@ -103,8 +109,12 @@ typedef struct _slp_tree {
     int outside_of_loop;     /* Statements generated outside loop.  */
     int inside_of_loop;      /* Statements generated inside loop.  */
   } cost;
+  /* The type of load permutation.  */
+  enum slp_load_perm_type loads_perm_type;
 } *slp_tree;
 
+DEF_VEC_P(slp_tree);
+DEF_VEC_ALLOC_P(slp_tree, heap);
 
 /* SLP instance is a sequence of stmts in a loop that can be packed into
    SIMD stmts.  */
@@ -124,6 +134,9 @@ typedef struct _slp_instance {
     int outside_of_loop;     /* Statements generated outside loop.  */
     int inside_of_loop;      /* Statements generated inside loop.  */
   } cost;
+  
+  /* The list of nodes with loads from the same location in the node.  */
+  VEC (slp_tree, heap) *same_perm_nodes;
 } *slp_instance;
 
 DEF_VEC_P(slp_instance);
@@ -135,6 +148,7 @@ DEF_VEC_ALLOC_P(slp_instance, heap);
 #define SLP_INSTANCE_UNROLLING_FACTOR(S)         (S)->unrolling_factor
 #define SLP_INSTANCE_OUTSIDE_OF_LOOP_COST(S)     (S)->cost.outside_of_loop
 #define SLP_INSTANCE_INSIDE_OF_LOOP_COST(S)      (S)->cost.inside_of_loop
+#define SLP_INSTANCE_SAME_PERM_NODES(S)          (S)->same_perm_nodes
 
 #define SLP_TREE_LEFT(S)                         (S)->left
 #define SLP_TREE_RIGHT(S)                        (S)->right
@@ -143,6 +157,8 @@ DEF_VEC_ALLOC_P(slp_instance, heap);
 #define SLP_TREE_NUMBER_OF_VEC_STMTS(S)          (S)->vec_stmts_size
 #define SLP_TREE_OUTSIDE_OF_LOOP_COST(S)         (S)->cost.outside_of_loop
 #define SLP_TREE_INSIDE_OF_LOOP_COST(S)          (S)->cost.inside_of_loop
+#define SLP_TREE_LOADS_PERM_TYPE(S)              (S)->loads_perm_type
+
 
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
@@ -522,6 +538,11 @@ typedef struct _stmt_vec_info {
 #define TARG_VEC_STORE_COST          1
 #endif
 
+/* Cost of vector permutation.  */
+#ifndef TARG_VEC_PERMUTE_COST
+#define TARG_VEC_PERMUTE_COST          1
+#endif
+
 static inline void set_stmt_info (stmt_ann_t ann, stmt_vec_info stmt_info);
 static inline stmt_vec_info vinfo_for_stmt (tree stmt);
 
@@ -679,7 +700,8 @@ void vect_pattern_recog (loop_vec_info);
 
 
 /** In tree-vect-transform.c  **/
-extern bool vectorizable_load (tree, block_stmt_iterator *, tree *, slp_tree);
+extern bool vectorizable_load (tree, block_stmt_iterator *, tree *, slp_tree,
+                               slp_instance);
 extern bool vectorizable_store (tree, block_stmt_iterator *, tree *, slp_tree);
 extern bool vectorizable_operation (tree, block_stmt_iterator *, tree *, 
 				    slp_tree);
@@ -701,6 +723,9 @@ extern void vect_model_simple_cost (stmt
 extern void vect_model_store_cost (stmt_vec_info, int, enum vect_def_type, 
 				   slp_tree);
 extern void vect_model_load_cost (stmt_vec_info, int, slp_tree);
+extern bool vect_transform_slp_perm_load (tree, VEC(tree,heap) *, 
+                                          block_stmt_iterator *, int,
+                                          slp_instance, bool);
 /* Driver for transformation stage.  */
 extern void vect_transform_loop (loop_vec_info);
 
Index: tree-vect-analyze.c
===================================================================
--- tree-vect-analyze.c	(revision 137312)
+++ tree-vect-analyze.c	(working copy)
@@ -468,7 +468,7 @@ vect_analyze_operations (loop_vec_info l
 		|| vectorizable_conversion (stmt, NULL, NULL, NULL)
 		|| vectorizable_operation (stmt, NULL, NULL, NULL)
 		|| vectorizable_assignment (stmt, NULL, NULL, NULL)
-		|| vectorizable_load (stmt, NULL, NULL, NULL)
+		|| vectorizable_load (stmt, NULL, NULL, NULL, NULL)
 		|| vectorizable_call (stmt, NULL, NULL)
 		|| vectorizable_store (stmt, NULL, NULL, NULL)
 		|| vectorizable_condition (stmt, NULL, NULL)
@@ -1596,6 +1596,7 @@ vect_verify_datarefs_alignment (loop_vec
           && vect_print_dump_info (REPORT_ALIGNMENT))
         fprintf (vect_dump, "Vectorizing an unaligned access.");
     }
+
   return true;
 }
 
@@ -2632,7 +2633,8 @@ static bool
 vect_build_slp_tree (loop_vec_info loop_vinfo, slp_tree *node, 
 		     unsigned int group_size, bool *slp_impossible,
 		     int *inside_cost, int *outside_cost,
-		     int ncopies_for_cost)
+		     int ncopies_for_cost,
+                     VEC (slp_tree, heap) **same_perm_nodes)
 {
   VEC (tree, heap) *def_stmts0 = VEC_alloc (tree, heap, group_size);
   VEC (tree, heap) *def_stmts1 =  VEC_alloc (tree, heap, group_size);
@@ -2651,7 +2653,8 @@ vect_build_slp_tree (loop_vec_info loop_
   enum machine_mode optab_op2_mode;
   enum machine_mode vec_mode;
   tree first_stmt_const_oprnd = NULL_TREE;
-  struct data_reference *first_dr;
+  struct data_reference *first_dr = NULL;
+  bool not_strided_loads = false;
 
   /* For every stmt in NODE find its def stmt/s.  */
   for (i = 0; VEC_iterate (tree, stmts, i, stmt); i++)
@@ -2798,23 +2801,11 @@ vect_build_slp_tree (loop_vec_info loop_
 		/* Load.  */
 		if (i == 0)
 		  {
-		    /* First stmt of the SLP group should be the first load of 
-		       the interleaving loop if data permutation is not allowed.
-		       Check that there is no gap between the loads.  */
-		    if (DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) != stmt
-                        || DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 0) 
-		      {
-			/* FORNOW: data permutations and gaps in loads are not 
-                           supported.  */
-			if (vect_print_dump_info (REPORT_SLP)) 
-			  {
-			    fprintf (vect_dump, "Build SLP failed: strided "
-				     " loads need permutation or have gaps ");
-			    print_generic_expr (vect_dump, stmt, TDF_SLIM);
-			  }
-
-			return false;
-		      }
+                    /* First stmt of the SLP group should be the first load of
+                       the interleaving loop, otherwise we check later if this 
+                       is a supported permutation.  */
+                    if  (DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) != stmt)
+                       not_strided_loads = true;
 
 		    first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
 		    if (vect_supportable_dr_alignment (first_dr)
@@ -2823,7 +2814,7 @@ vect_build_slp_tree (loop_vec_info loop_
 			if (vect_print_dump_info (REPORT_SLP)) 
 			  {
 			    fprintf (vect_dump, "Build SLP failed: unsupported "
-				     " unaligned load ");
+		  		                " unaligned load ");
 			    print_generic_expr (vect_dump, stmt, TDF_SLIM);
 			  }
 
@@ -2839,18 +2830,44 @@ vect_build_slp_tree (loop_vec_info loop_
                     /* Check that we have consecutive loads from interleaving
                        chain and that there is no gap between the loads.  */
 		    if (DR_GROUP_NEXT_DR (vinfo_for_stmt (prev_stmt)) != stmt
-                        || DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 1)
+                        || not_strided_loads)
+                      {
+                        /* The only supported load permutation is same
+                           data-refs in all the loads in the node.  */
+                        if (operand_equal_p (DR_REF (first_dr),
+                                             DR_REF (STMT_VINFO_DATA_REF (
+                                                   vinfo_for_stmt (stmt))), 0))
+                           not_strided_loads = true;
+                        else
+                          {
+                            /* FORNOW: other data permutations are not
+                               supported.  */
+                            if (vect_print_dump_info (REPORT_DETAILS))
+                              {
+                                fprintf (vect_dump, "Build SLP failed: strided "
+                                                    " loads need permutation ");
+                                print_generic_expr (vect_dump, stmt, TDF_SLIM);
+                              }
+
+                             return false;
+                          }
+                      }
+                  }
+                  
+                if ((DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) == stmt 
+                     && DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 0)
+                    || (DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)) != stmt
+                     && DR_GROUP_GAP (vinfo_for_stmt (stmt)) != 1))
+	          {
+		    /* FORNOW: gaps in loads are not supported.  */
+		    if (vect_print_dump_info (REPORT_SLP)) 
 		      {
-			/* FORNOW: data permutations and gaps in loads are not
-                           supported.  */
-			if (vect_print_dump_info (REPORT_SLP)) 
-			  {
-			    fprintf (vect_dump, "Build SLP failed: strided "
-				     " loads need permutation or have gaps ");
-			    print_generic_expr (vect_dump, stmt, TDF_SLIM);
-			  }
-			return false;
+		        fprintf (vect_dump, "Build SLP failed: strided "
+			                    "loads have gaps ");
+  		        print_generic_expr (vect_dump, stmt, TDF_SLIM);
 		      }
+
+		    return false;
 		  }
 
 		prev_stmt = stmt;
@@ -2906,7 +2923,28 @@ vect_build_slp_tree (loop_vec_info loop_
 
   /* Strided loads were reached - stop the recursion.  */
   if (stop_recursion)
-    return true;
+    {
+      if (not_strided_loads)
+        {
+          /* Not strided loads are supported only for same data-ref in loads
+             of the same node.  */
+	  SLP_TREE_LOADS_PERM_TYPE (*node) = same;
+          if (!*same_perm_nodes)
+            *same_perm_nodes = VEC_alloc (slp_tree, heap, group_size);
+
+          VEC_safe_push (slp_tree, heap, *same_perm_nodes, *node); 
+
+          *inside_cost += TARG_VEC_PERMUTE_COST * group_size;  
+
+          if (vect_print_dump_info (REPORT_DETAILS))
+            {
+              fprintf (vect_dump, "Build SLP: detected not strided loads ");
+              print_generic_expr (vect_dump, stmt, TDF_SLIM);
+            }
+        }
+
+      return true;
+    }
 
   /* Create SLP_TREE nodes for the definition node/s.  */ 
   if (first_stmt_dt0 == vect_loop_def)
@@ -2918,9 +2956,11 @@ vect_build_slp_tree (loop_vec_info loop_
       SLP_TREE_RIGHT (left_node) = NULL;
       SLP_TREE_OUTSIDE_OF_LOOP_COST (left_node) = 0;
       SLP_TREE_INSIDE_OF_LOOP_COST (left_node) = 0;
+      SLP_TREE_LOADS_PERM_TYPE (left_node) = interleaved;
       if (!vect_build_slp_tree (loop_vinfo, &left_node, group_size, 
-				slp_impossible, inside_cost, outside_cost,
-				ncopies_for_cost))
+                                slp_impossible, 
+				inside_cost, outside_cost, ncopies_for_cost, 
+				same_perm_nodes))
 	return false;
       
       SLP_TREE_LEFT (*node) = left_node;
@@ -2935,9 +2975,11 @@ vect_build_slp_tree (loop_vec_info loop_
       SLP_TREE_RIGHT (right_node) = NULL;
       SLP_TREE_OUTSIDE_OF_LOOP_COST (right_node) = 0;
       SLP_TREE_INSIDE_OF_LOOP_COST (right_node) = 0;
-      if (!vect_build_slp_tree (loop_vinfo, &right_node, group_size,
-				slp_impossible, inside_cost, outside_cost,
-				ncopies_for_cost))
+      SLP_TREE_LOADS_PERM_TYPE (right_node) = interleaved;
+      if (!vect_build_slp_tree (loop_vinfo, &right_node, group_size, 
+                                slp_impossible,
+				inside_cost, outside_cost, ncopies_for_cost,
+				same_perm_nodes))
 	return false;
       
       SLP_TREE_RIGHT (*node) = right_node;
@@ -2993,20 +3035,26 @@ vect_mark_slp_stmts (slp_tree node, enum
 
 
 /* Analyze an SLP instance starting from a group of strided stores. Call
-   vect_build_slp_tree to build a tree of packed stmts if possible. 
+   vect_build_slp_tree to build a tree of packed stmts if possible.  
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
 
-static bool
+static bool 
 vect_analyze_slp_instance (loop_vec_info loop_vinfo, tree stmt)
 {
   slp_instance new_instance;
   slp_tree node = XNEW (struct _slp_tree);
   unsigned int group_size = DR_GROUP_SIZE (vinfo_for_stmt (stmt));
-  unsigned int unrolling_factor = 1, nunits;
+  unsigned int unrolling_factor = 1, nunits, i;
   tree vectype, scalar_type, next;
   unsigned int vectorization_factor = 0, ncopies;
   bool slp_impossible = false; 
   int inside_cost = 0, outside_cost = 0, ncopies_for_cost;
+  VEC (slp_tree, heap) *same_perm_nodes = NULL;
+  slp_tree first_node, next_node;
+  tree first_load = NULL_TREE, next_load, load;
+  VEC (slp_tree, heap) *sorted_same_perm_nodes = NULL;
+  unsigned int index;
+  slp_tree *temp_nodes = NULL;
 
   /* FORNOW: multiple types are not supported.  */
   scalar_type = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt))));
@@ -3018,6 +3066,7 @@ vect_analyze_slp_instance (loop_vec_info
           fprintf (vect_dump, "Build SLP failed: unsupported data-type ");
           print_generic_expr (vect_dump, scalar_type, TDF_SLIM);
         }
+
       return false;
     }
 
@@ -3048,6 +3097,7 @@ vect_analyze_slp_instance (loop_vec_info
   SLP_TREE_RIGHT (node) = NULL;
   SLP_TREE_OUTSIDE_OF_LOOP_COST (node) = 0;
   SLP_TREE_INSIDE_OF_LOOP_COST (node) = 0;
+  SLP_TREE_LOADS_PERM_TYPE (node) = interleaved;
 
   /* Calculate the unrolling factor.  */
   unrolling_factor = least_common_multiple (nunits, group_size) / group_size;
@@ -3058,21 +3108,101 @@ vect_analyze_slp_instance (loop_vec_info
   ncopies_for_cost = unrolling_factor * group_size / nunits;
 
   /* Build the tree for the SLP instance.  */
-  if (vect_build_slp_tree (loop_vinfo, &node, group_size, &slp_impossible,
-			   &inside_cost, &outside_cost, ncopies_for_cost))
-    {
+  if (vect_build_slp_tree (loop_vinfo, &node, group_size, &slp_impossible, 
+                           &inside_cost, &outside_cost, ncopies_for_cost,
+			   &same_perm_nodes))
+    {
+      if (same_perm_nodes)
+        {
+          /* The only supported loads permutation is loads from the same 
+             location in all the loads in the node, when the data-refs in
+             nodes of SAME_PERM_NODES constitute an interleaving chain.  
+             First we check that the number of the such nodes is equal to the
+             size of the interleaving chain.  */
+          if (VEC_length (slp_tree, same_perm_nodes) != group_size)
+            {
+              if (vect_print_dump_info (REPORT_DETAILS))
+                {
+                  fprintf (vect_dump, "Build SLP failed: unsupported data "
+                                      "permutation ");
+                  print_generic_expr (vect_dump, first_load, TDF_SLIM);
+                }
+              
+              VEC_free (slp_tree, heap, same_perm_nodes);
+              return false; 
+            }
+
+          /* Check that the loads in SAME_PERM_NODES constitute an interleaving
+             chain. Sort the nodes according to the order of accesses in the
+             chain.  */ 
+          temp_nodes = (slp_tree *) xmalloc (sizeof (slp_tree) * group_size);
+          first_node = VEC_index (slp_tree, same_perm_nodes, 0);
+          first_load = VEC_index (tree, SLP_TREE_SCALAR_STMTS (first_node), 0);
+          first_load = DR_GROUP_FIRST_DR (vinfo_for_stmt (first_load));
+
+          for (i = 0; VEC_iterate (slp_tree, same_perm_nodes, i, next_node); 
+               i++)
+            {
+              load = VEC_index (tree, SLP_TREE_SCALAR_STMTS (next_node), 0);
+
+              index = 0;
+              next_load = first_load;
+
+              while (load != next_load)
+                {
+                  next_load = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_load));
+                  index++;
+                }
+ 
+              /* No matching data-ref was found in the interleaving chain.  */
+              if (index >= group_size)
+                {
+                  if (vect_print_dump_info (REPORT_DETAILS))
+                    {
+                      fprintf (vect_dump, "Build SLP failed: unsupported data "
+                                          "permutation ");
+                      print_generic_expr (vect_dump, first_load, TDF_SLIM);
+                    }
+
+                  VEC_free (slp_tree, heap, same_perm_nodes);
+                  return false;
+                }
+
+              temp_nodes[index] = next_node;
+            }
+
+          sorted_same_perm_nodes = VEC_alloc (slp_tree, heap, group_size);
+          for (i = 0; i < group_size; i++)
+             VEC_safe_push (slp_tree, heap, sorted_same_perm_nodes, temp_nodes[i]);
+
+          VEC_free (slp_tree, heap, same_perm_nodes);
+        }
+
       /* Create a new SLP instance.  */  
       new_instance = XNEW (struct _slp_instance);
       SLP_INSTANCE_TREE (new_instance) = node;
       SLP_INSTANCE_GROUP_SIZE (new_instance) = group_size;
-      SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = unrolling_factor;
       SLP_INSTANCE_OUTSIDE_OF_LOOP_COST (new_instance) = outside_cost;
       SLP_INSTANCE_INSIDE_OF_LOOP_COST (new_instance) = inside_cost;
-      VEC_safe_push (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo), 
-		     new_instance);
-      if (vect_print_dump_info (REPORT_SLP))
-	vect_print_slp_tree (node);
+      SLP_INSTANCE_SAME_PERM_NODES (new_instance) = sorted_same_perm_nodes;
+      SLP_INSTANCE_UNROLLING_FACTOR (new_instance) = unrolling_factor;
+
+      /* Check that the permutation of loads is possible.  */
+      if (sorted_same_perm_nodes
+          && !vect_transform_slp_perm_load (first_load, NULL, NULL, unrolling_factor,
+                                            new_instance, true))
+        {
+          vect_free_slp_tree (node);
+          VEC_free (slp_tree, heap, sorted_same_perm_nodes);
+          free (new_instance);
+          return false;
+        }
 
+      VEC_safe_push (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo),
+                     new_instance);
+      if (vect_print_dump_info (REPORT_DETAILS))
+        vect_print_slp_tree (node);
+      
       return true;
     }
 
Index: target-def.h
===================================================================
--- target-def.h	(revision 137312)
+++ target-def.h	(working copy)
@@ -364,6 +364,7 @@
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST 0
 #define TARGET_VECTOR_ALIGNMENT_REACHABLE \
   default_builtin_vector_alignment_reachable
+#define TARGET_VECTORIZE_BUILTIN_VEC_PERM 0
 
 #define TARGET_VECTORIZE                                                \
   {									\
@@ -373,7 +374,8 @@
     TARGET_VECTORIZE_BUILTIN_MUL_WIDEN_EVEN,                            \
     TARGET_VECTORIZE_BUILTIN_MUL_WIDEN_ODD,				\
     TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST,			\
-    TARGET_VECTOR_ALIGNMENT_REACHABLE					\
+    TARGET_VECTOR_ALIGNMENT_REACHABLE,                                  \
+    TARGET_VECTORIZE_BUILTIN_VEC_PERM                                   \
   }
 
 #define TARGET_DEFAULT_TARGET_FLAGS 0
Index: tree-vect-transform.c
===================================================================
--- tree-vect-transform.c	(revision 137312)
+++ tree-vect-transform.c	(working copy)
@@ -46,7 +46,8 @@ along with GCC; see the file COPYING3.  
 #include "real.h"
 
 /* Utility functions for the code transformation.  */
-static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *, slp_tree);
+static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *, slp_tree,
+                                 slp_instance);
 static tree vect_create_destination_var (tree, tree);
 static tree vect_create_data_ref_ptr 
   (tree, struct loop*, tree, tree *, tree *, bool, bool *); 
@@ -906,7 +907,7 @@ vect_create_addr_base_for_vector_ref (tr
       base_offset = force_gimple_operand (base_offset, &new_stmt, false, tmp);
       append_to_statement_list_force (new_stmt, new_stmt_list);
     }
-  
+
   /* base + base_offset */
   addr_base = fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (data_ref_base), 
 			   data_ref_base, base_offset);
@@ -4841,7 +4842,7 @@ vectorizable_store (tree stmt, block_stm
 	strided_store = false;
 
       /* VEC_NUM is the number of vect stmts to be created for this group.  */
-      if (slp && SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) < group_size)
+      if (slp)
 	vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       else
 	vec_num = group_size;
@@ -5004,7 +5005,10 @@ vectorizable_store (tree stmt, block_stm
 	  new_stmt = build_gimple_modify_stmt (data_ref, vec_oprnd);
 	  vect_finish_stmt_generation (stmt, new_stmt, bsi);
 	  mark_symbols_for_renaming (new_stmt);
-	  
+	 
+          if (slp)
+            continue;
+ 
           if (j == 0)
             STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt =  new_stmt;
 	  else
@@ -5537,6 +5541,299 @@ vect_transform_strided_load (tree stmt, 
 }
 
 
+/* Create NCOPIES permutation statements using the mask MASK_BYTES (by 
+   building a vector of type MASK_TYPE from it) and two input vectors placed in
+   DR_CHAIN at FIRST_VEC_INDX and SECOND_VEC_INDX for the first copy and
+   shifting by STRIDE elements of DR_CHAIN for every copy.
+   (STRIDE is the number of vectorized stmts for NODE divided by the number of
+   copies).  
+   VECT_STMTS_COUNTER specifies the index in the vectorized stmts of NODE, where
+   the created stmts must be inserted.  */
+
+static inline void
+vect_create_mask_and_perm (tree stmt, tree next_scalar_stmt, int *mask_bytes,
+                           int mask_nunits, tree mask_element_type,
+                           tree mask_type, int first_vec_indx, 
+                           int second_vec_indx, block_stmt_iterator *bsi,
+                           slp_tree node,  tree builtin_decl, 
+                           tree vectype, VEC(tree,heap) *dr_chain,
+                           int ncopies, int vect_stmts_counter)
+{
+  tree t = NULL_TREE, mask_vec, mask, perm_dest, perm_stmt = NULL_TREE;
+  stmt_vec_info next_stmt_info;
+  int i, group_size, stride, dr_chain_size;
+  tree params, first_vec, second_vec, data_ref;
+
+  /* Create a vector mask.  */
+  for (i = mask_nunits - 1; i >= 0; --i)
+    t = tree_cons (NULL_TREE, build_int_cst (mask_element_type, mask_bytes[i]),
+                   t);
+
+  mask_vec = build_vector (mask_type, t);
+  mask = vect_init_vector (stmt, mask_vec, mask_type, NULL);
+
+  group_size = VEC_length (tree, SLP_TREE_SCALAR_STMTS (node));
+  stride = SLP_TREE_NUMBER_OF_VEC_STMTS (node) / ncopies;
+  dr_chain_size = VEC_length (tree, dr_chain); 
+
+  /* Initialize the vect stmts of NODE to properly insert the generated 
+     stmts later.  */
+  for (i = VEC_length (tree, SLP_TREE_VEC_STMTS (node)); 
+       i < (int) SLP_TREE_NUMBER_OF_VEC_STMTS (node); i++)
+    VEC_quick_push (tree, SLP_TREE_VEC_STMTS (node), NULL_TREE);
+
+  for (i = 0; i < ncopies; i++)
+    {
+      first_vec = VEC_index (tree, dr_chain, first_vec_indx);
+      second_vec = VEC_index (tree, dr_chain, second_vec_indx);
+
+      /* Build parameters list for permute statement.  */
+      params = build_tree_list (NULL_TREE, mask);
+      params = tree_cons (NULL_TREE, second_vec, params);
+      params = tree_cons (NULL_TREE, first_vec, params);
+
+      /* Generate the permute statement.  */
+      perm_stmt = build_function_call_expr (builtin_decl, params);
+      perm_dest = create_tmp_var (vectype, "vec_perm");
+      DECL_GIMPLE_REG_P (perm_dest) = 1;
+      add_referenced_var (perm_dest);
+      perm_stmt = build_gimple_modify_stmt (perm_dest, perm_stmt);
+      data_ref = make_ssa_name (perm_dest, perm_stmt);
+      GIMPLE_STMT_OPERAND (perm_stmt, 0) = data_ref;
+      vect_finish_stmt_generation (stmt, perm_stmt, bsi);
+      mark_symbols_for_renaming (perm_stmt);
+
+      /* Store the vector statement in NODE.  */ 
+      VEC_replace (tree, SLP_TREE_VEC_STMTS (node), 
+                   stride * i + vect_stmts_counter, perm_stmt);
+
+      first_vec_indx += stride;
+      second_vec_indx += stride;
+    }
+
+  /* Mark the scalar stmt as vectorized.  */
+  next_stmt_info = vinfo_for_stmt (next_scalar_stmt);
+  STMT_VINFO_VEC_STMT (next_stmt_info) = perm_stmt;
+}
+
+
+/* Given FIRST_BYTE - the mask element in element representation, return
+   in CURRENT_BYTE its equivalent in byte representation. Check that the mask
+   is valid and return FALSE if not. Return TRUE in NEED_NEXT_VECTOR if the 
+   permutation requires to move to the next vector, i.e., the current first
+   is not needed.  */
+   
+static bool
+vect_get_mask_element (tree stmt, int first_byte, int m, int mask_nunits,
+                       bool only_one_vec, int index, int *mask_bytes,
+                       int *current_byte, bool *need_next_vector)
+{
+  int i;
+  static int number_of_mask_fixes = 1;
+  static bool mask_fixed = false;
+  static bool needs_first_vector = false;
+
+  /* Convert to bytes.  */
+  *current_byte = first_byte + m;
+  /* Adjust the value in case it's a mask for second and third vectors.  */
+  *current_byte -= mask_nunits * (number_of_mask_fixes - 1);
+
+  if (*current_byte < mask_nunits)
+    needs_first_vector = true;
+
+  /* We have only one input vector to permute but the mask accesses values in
+     the next vector as well.  */
+  if (only_one_vec && *current_byte >= mask_nunits)
+    {
+      if (vect_print_dump_info (REPORT_DETAILS))
+        {
+          fprintf (vect_dump, "permutation requires at least two vectors ");
+          print_generic_expr (vect_dump, stmt, TDF_SLIM);
+        }
+
+      return false;
+    }
+
+  /* The mask requires the next vector.  */
+  if (*current_byte >= mask_nunits * 2)
+    {
+      if (needs_first_vector || mask_fixed)
+        {
+          /* We either need the first vector too or have already moved to the
+             next vector. In both cases, this permutation needs three 
+             vectors.  */
+          if (vect_print_dump_info (REPORT_DETAILS))
+            {
+              fprintf (vect_dump, "permutation requires at "
+                                  "least three vectors ");
+              print_generic_expr (vect_dump, stmt, TDF_SLIM);
+            }
+
+          return false;
+        }
+
+      /* We move to the next vector, dropping the first one and working with
+         the second and the third - we need to adjust the values of the mask
+         accordingly.  */
+      *current_byte -= mask_nunits * number_of_mask_fixes;
+
+      for (i = 0; i < index; i++)
+        mask_bytes[i] -= mask_nunits * number_of_mask_fixes;
+
+      (number_of_mask_fixes)++;
+      mask_fixed = true;
+    }
+
+  *need_next_vector = mask_fixed;
+
+  /* This was the last element of this mask. Start a new one.  */
+  if (index == mask_nunits - 1)
+    {
+      number_of_mask_fixes = 1;
+      mask_fixed = false;
+      needs_first_vector = false;
+    }
+
+  return true;
+}
+
+
+/* Generate vector permute statements from a list of loads in DR_CHAIN.
+   If ANALYZE_ONLY is TRUE, only check that it is possible to create valid
+   permute statements for SLP_NODE_INSTANCE.  */
+bool
+vect_transform_slp_perm_load (tree stmt, VEC(tree,heap) *dr_chain,
+                              block_stmt_iterator *bsi, int vf,
+                              slp_instance slp_node_instance, bool analyze_only)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  tree mask_element_type = NULL_TREE, mask_type;
+  int i, j, k, m, scale, mask_nunits, nunits, vec_index = 0, scalar_index;
+  slp_tree node;
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info), builtin_decl;
+  tree next_scalar_stmt;
+  int group_size = SLP_INSTANCE_GROUP_SIZE (slp_node_instance), first_byte;
+  int index, unroll_factor, *mask_bytes, current_byte, ncopies;
+  bool only_one_vec = false, need_next_vector = false;
+  int first_vec_index, second_vec_index, orig_vec_stmts_num, vect_stmts_counter;
+
+  if (!targetm.vectorize.builtin_vec_perm)
+    {
+      if (vect_print_dump_info (REPORT_DETAILS))
+        {
+          fprintf (vect_dump, "no builtin for vect permute for ");
+          print_generic_expr (vect_dump, stmt, TDF_SLIM);
+        }
+
+       return false;
+    }
+
+  builtin_decl = targetm.vectorize.builtin_vec_perm (vectype,
+                                                     &mask_element_type);
+  if (!builtin_decl || !mask_element_type)
+    {
+      if (vect_print_dump_info (REPORT_DETAILS))
+        {
+          fprintf (vect_dump, "no builtin for vect permute for ");
+          print_generic_expr (vect_dump, stmt, TDF_SLIM);
+        }
+
+       return false;
+    }
+
+  mask_type = get_vectype_for_scalar_type (mask_element_type);
+  mask_nunits = TYPE_VECTOR_SUBPARTS (mask_type);
+  mask_bytes = (int *) xmalloc (sizeof (int) * mask_nunits);
+  nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  scale = mask_nunits / nunits;
+  unroll_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance);
+
+  /* The number of vector stmts to generate based only on SLP_NODE_INSTANCE
+     unrolling factor.  */
+  orig_vec_stmts_num = group_size * 
+                SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance) / nunits;
+  if (orig_vec_stmts_num == 1)
+    only_one_vec = true;
+
+  /* Number of copies is determined by the final vectorization factor 
+     relatively to SLP_NODE_INSTANCE unrolling factor.  */
+  ncopies = vf / SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance); 
+
+  /* Generate permutation masks for every NODE. Number of masks for each NODE 
+     is equal to GROUP_SIZE.  
+     E.g., we have a group of three nodes with three loads from the same 
+     location in each node, and the vector size is 4. I.e., we have a 
+     a0b0c0a1b1c1... sequence and we need to create the following vectors: 
+     for a's: a0a0a0a1 a1a1a2a2 a2a3a3a3
+     for b's: b0b0b0b1 b1b1b2b2 b2b3b3b3
+     ...
+
+     The masks for a's should be: {0,0,0,3} {3,3,6,6} {6,9,9,9}, but in bytes.
+     The last mask is illegal since and must be converted into {2,5,5,5}.
+     For the first two permutations we need the first and the second input 
+     vectors: {a0,b0,c0,a1} and {b1,c1,a2,b2}, and for the last permutation
+     we need the second and the third vectors: {b1,c1,a2,b2} and 
+     {c2,a3,b3,c3}.  */
+
+  for (i = 0;
+       VEC_iterate (slp_tree, SLP_INSTANCE_SAME_PERM_NODES (slp_node_instance),
+                    i, node);
+       i++)
+    {
+      scalar_index = 0;
+      index = 0;
+      vect_stmts_counter = 0;
+      vec_index = 0;
+      first_vec_index = vec_index++;
+      if (only_one_vec)
+        second_vec_index = first_vec_index;
+      else
+        second_vec_index =  vec_index++;
+
+      for (j = 0; j < unroll_factor; j++)
+        {
+          for (k = 0; k < group_size; k++)
+            {
+              first_byte = (i + j * group_size) * scale;
+              for (m = 0; m < scale; m++)
+                {
+                  if (!vect_get_mask_element (stmt, first_byte, m, mask_nunits,
+                                              only_one_vec, index, mask_bytes,
+                                              &current_byte, &need_next_vector))
+                    return false;
+
+                  mask_bytes[index++] = current_byte;
+                } 
+
+              if (index == mask_nunits)
+                {
+                  index = 0;
+                  if (!analyze_only)
+                    {
+                      if (need_next_vector)
+                        {
+                          first_vec_index = second_vec_index;
+                          second_vec_index = vec_index;
+                        }
+
+                      next_scalar_stmt = VEC_index (tree,
+                                SLP_TREE_SCALAR_STMTS (node), scalar_index++);
+
+                      vect_create_mask_and_perm (stmt, next_scalar_stmt,
+                                    mask_bytes, mask_nunits, mask_element_type,
+                                    mask_type, first_vec_index, second_vec_index, bsi,
+                                    node, builtin_decl, vectype, dr_chain, ncopies,
+                                    vect_stmts_counter++); 
+                    }
+                } 
+            } 
+        } 
+    } 
+
+  free (mask_bytes);
+  return true;
+}
+
 /* vectorizable_load.
 
    Check if STMT reads a non scalar data-ref (array/pointer/structure) that 
@@ -5547,7 +5844,7 @@ vect_transform_strided_load (tree stmt, 
 
 bool
 vectorizable_load (tree stmt, block_stmt_iterator *bsi, tree *vec_stmt, 
-		   slp_tree slp_node)
+		   slp_tree slp_node, slp_instance slp_node_instance)
 {
   tree scalar_dest;
   tree vec_dest = NULL;
@@ -5569,7 +5866,7 @@ vectorizable_load (tree stmt, block_stmt
   tree dataref_ptr = NULL_TREE;
   tree ptr_incr;
   int nunits = TYPE_VECTOR_SUBPARTS (vectype);
-  int ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
+  int ncopies;
   int i, j, group_size;
   tree msq = NULL_TREE, lsq;
   tree offset = NULL_TREE;
@@ -5584,11 +5881,17 @@ vectorizable_load (tree stmt, block_stmt
   struct loop *at_loop;
   int vec_num;
   bool slp = (slp_node != NULL);
+  bool slp_perm = false;
+
+  if (slp && SLP_TREE_LOADS_PERM_TYPE (slp_node) == same)
+    slp_perm = true;
 
   /* FORNOW: SLP with multiple types is not supported. The SLP analysis verifies
-      this, so we can safely override NCOPIES with 1 here.  */
+     this, so we can safely override NCOPIES with 1 here.  */
   if (slp)
-    ncopies = 1;
+    ncopies = 1; 
+  else
+    ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
 
   gcc_assert (ncopies >= 1);
 
@@ -5682,8 +5985,6 @@ vectorizable_load (tree stmt, block_stmt
 	}
       first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt));
       group_size = DR_GROUP_SIZE (vinfo_for_stmt (first_stmt));
-      dr_chain = VEC_alloc (tree, heap, group_size);
-
       /* VEC_NUM is the number of vect stmts to be created for this group.  */
       if (slp)
 	{
@@ -5692,6 +5993,8 @@ vectorizable_load (tree stmt, block_stmt
 	}
       else
 	vec_num = group_size;
+
+      dr_chain = VEC_alloc (tree, heap, vec_num);
     }
   else
     {
@@ -5964,34 +6267,48 @@ vectorizable_load (tree stmt, block_stmt
 	    }
 
 	  /* Collect vector loads and later create their permutation in
-	     vect_transform_strided_load ().  */
-          if (strided_load)
+	     vect_transform_strided_load () or 
+             vect_transform_slp_perm_load ().  */
+          if (strided_load || slp_perm)
             VEC_quick_push (tree, dr_chain, new_temp);
 
-         /* Store vector loads in the corresponding SLP_NODE.  */
-	  if (slp)
+          /* Store vector loads in the corresponding SLP_NODE.  */
+	  if (slp && !slp_perm)
 	    VEC_quick_push (tree, SLP_TREE_VEC_STMTS (slp_node), new_stmt);
 	}
 
-      /* FORNOW: SLP with multiple types is unsupported.  */
-      if (slp)
-	return true;
+      if (slp && !slp_perm)
+	continue;
 
-      if (strided_load)
-	{
-	  if (!vect_transform_strided_load (stmt, dr_chain, group_size, bsi))
-	    return false;	  
-	  *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
-          VEC_free (tree, heap, dr_chain);
-	  dr_chain = VEC_alloc (tree, heap, group_size);
-	}
-      else
-	{
-	  if (j == 0)
-	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
-	  else
-	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
-	  prev_stmt_info = vinfo_for_stmt (new_stmt);
+      if (slp_perm)
+        {
+          if (!vect_transform_slp_perm_load (stmt, dr_chain, bsi,
+                                   LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+                                             slp_node_instance, false))
+            {
+              VEC_free (tree, heap, dr_chain);
+              return false;
+            }
+        }
+      else 
+        {
+          if (strided_load)
+  	    { 
+	      if (!vect_transform_strided_load (stmt, dr_chain, group_size, 
+                                                bsi))
+   	        return false;	  
+	      *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
+              VEC_free (tree, heap, dr_chain);
+	      dr_chain = VEC_alloc (tree, heap, group_size);
+            }
+          else
+	    {
+	      if (j == 0)
+	        STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	      else
+	        STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
+   	      prev_stmt_info = vinfo_for_stmt (new_stmt);
+            }
 	}
     }
 
@@ -6247,7 +6564,7 @@ vectorizable_condition (tree stmt, block
 
 static bool
 vect_transform_stmt (tree stmt, block_stmt_iterator *bsi, bool *strided_store, 
-		     slp_tree slp_node)
+		     slp_tree slp_node, slp_instance slp_node_instance)
 {
   bool is_store = false;
   tree vec_stmt = NULL_TREE;
@@ -6291,14 +6608,15 @@ vect_transform_stmt (tree stmt, block_st
       break;
 
     case load_vec_info_type:
-      done = vectorizable_load (stmt, bsi, &vec_stmt, slp_node);
+      done = vectorizable_load (stmt, bsi, &vec_stmt, slp_node, 
+                                slp_node_instance);
       gcc_assert (done);
       break;
 
     case store_vec_info_type:
       done = vectorizable_store (stmt, bsi, &vec_stmt, slp_node);
       gcc_assert (done);
-      if (STMT_VINFO_STRIDED_ACCESS (stmt_info))
+      if (STMT_VINFO_STRIDED_ACCESS (stmt_info) && !slp_node)
 	{
 	  /* In case of interleaving, the whole chain is vectorized when the
 	     last store in the chain is reached. Store stmts before the last
@@ -7346,23 +7664,62 @@ vect_remove_stores (tree first_stmt)
 /* Vectorize SLP instance tree in postorder.  */
 
 static bool
-vect_schedule_slp_instance (slp_tree node, unsigned int vec_stmts_size)
+vect_schedule_slp_instance (slp_tree node, slp_instance instance,
+                            unsigned int vectorization_factor)
 {
   tree stmt;
   bool strided_store, is_store;
   block_stmt_iterator si;
   stmt_vec_info stmt_info;
+  unsigned int i;
+  slp_tree same_loads_node;
+  unsigned int vec_stmts_size, nunits, group_size;
+  tree vectype;
 
   if (!node)
     return false;
 
-  vect_schedule_slp_instance (SLP_TREE_LEFT (node), vec_stmts_size);
-  vect_schedule_slp_instance (SLP_TREE_RIGHT (node), vec_stmts_size);
+  vect_schedule_slp_instance (SLP_TREE_LEFT (node), instance, 
+                              vectorization_factor);
+  vect_schedule_slp_instance (SLP_TREE_RIGHT (node), instance,
+                              vectorization_factor);
   
   stmt = VEC_index(tree, SLP_TREE_SCALAR_STMTS (node), 0);
   stmt_info = vinfo_for_stmt (stmt);
-  SLP_TREE_VEC_STMTS (node) = VEC_alloc (tree, heap, vec_stmts_size);
-  SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vec_stmts_size;
+
+  /* VECTYPE is the type of the destination.  */
+  vectype = get_vectype_for_scalar_type (TREE_TYPE (GIMPLE_STMT_OPERAND (stmt, 
+                                                                         0)));
+  nunits = (unsigned int) TYPE_VECTOR_SUBPARTS (vectype);
+  group_size = SLP_INSTANCE_GROUP_SIZE (instance);
+
+  /* For each SLP instance calculate number of vector stmts to be created
+     for the scalar stmts in each node of the SLP tree. Number of vector
+     elements in one vector iteration is the number of scalar elements in
+     one scalar iteration (GROUP_SIZE) multiplied by VF divided by vector
+     size.  */
+  vec_stmts_size = (vectorization_factor * group_size) / nunits;
+
+  if (SLP_TREE_LOADS_PERM_TYPE (node) == same)
+    {
+      if (!SLP_TREE_VEC_STMTS (node))
+        {
+          for (i = 0;
+               VEC_iterate (slp_tree, SLP_INSTANCE_SAME_PERM_NODES (instance),
+                            i, same_loads_node);
+               i++)
+            {
+              SLP_TREE_VEC_STMTS (same_loads_node) = VEC_alloc (tree, heap,
+                                                                vec_stmts_size);
+              SLP_TREE_NUMBER_OF_VEC_STMTS (same_loads_node) = vec_stmts_size;
+            }
+        }
+    }
+  else
+    {
+      SLP_TREE_VEC_STMTS (node) = VEC_alloc (tree, heap, vec_stmts_size);
+      SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vec_stmts_size;
+    }
 
   if (vect_print_dump_info (REPORT_DETAILS))
     {
@@ -7371,7 +7728,7 @@ vect_schedule_slp_instance (slp_tree nod
     }	
 
   si = bsi_for_stmt (stmt);
-  is_store = vect_transform_stmt (stmt, &si, &strided_store, node);
+  is_store = vect_transform_stmt (stmt, &si, &strided_store, node, instance);
   if (is_store)
     {
       if (DR_GROUP_FIRST_DR (stmt_info))
@@ -7392,29 +7749,20 @@ vect_schedule_slp_instance (slp_tree nod
 
 
 static bool
-vect_schedule_slp (loop_vec_info loop_vinfo, unsigned int nunits)
+vect_schedule_slp (loop_vec_info loop_vinfo)
 {
   VEC (slp_instance, heap) *slp_instances = 
     LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
   slp_instance instance;
-  unsigned int vec_stmts_size;
-  unsigned int group_size, i;
-  unsigned int vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int i;
   bool is_store = false;
 
   for (i = 0; VEC_iterate (slp_instance, slp_instances, i, instance); i++)
     {
-      group_size = SLP_INSTANCE_GROUP_SIZE (instance);
-      /* For each SLP instance calculate number of vector stmts to be created 
-	 for the scalar stmts in each node of the SLP tree. Number of vector 
-	 elements in one vector iteration is the number of scalar elements in 
-	 one scalar iteration (GROUP_SIZE) multiplied by VF divided by vector 
-	 size.  */		      
-      vec_stmts_size = vectorization_factor * group_size / nunits;
-			  
       /* Schedule the tree of INSTANCE.  */
-      is_store = vect_schedule_slp_instance (SLP_INSTANCE_TREE (instance), 
-					     vec_stmts_size);
+      is_store = vect_schedule_slp_instance (SLP_INSTANCE_TREE (instance),
+                                          instance,
+                                          LOOP_VINFO_VECT_FACTOR (loop_vinfo)); 
 		     
       if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS)
 	  || vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
@@ -7519,7 +7867,7 @@ vect_transform_loop (loop_vec_info loop_
 	    {
 	      if (vect_print_dump_info (REPORT_DETAILS))
 		fprintf (vect_dump, "transform phi.");
-	      vect_transform_stmt (phi, NULL, NULL, NULL);
+	      vect_transform_stmt (phi, NULL, NULL, NULL, NULL);
 	    }
 	}
 
@@ -7573,7 +7921,7 @@ vect_transform_loop (loop_vec_info loop_
 		  if (vect_print_dump_info (REPORT_DETAILS))
 		    fprintf (vect_dump, "=== scheduling SLP instances ===");
 
-		  is_store = vect_schedule_slp (loop_vinfo, nunits);
+		  is_store = vect_schedule_slp (loop_vinfo);
 
 		  /* IS_STORE is true if STMT is a store. Stores cannot be of
 		     hybrid SLP type. They are removed in
@@ -7598,7 +7946,7 @@ vect_transform_loop (loop_vec_info loop_
 	    fprintf (vect_dump, "transform statement.");
 
 	  strided_store = false;
-	  is_store = vect_transform_stmt (stmt, &si, &strided_store, NULL);
+	  is_store = vect_transform_stmt (stmt, &si, &strided_store, NULL, NULL);
           if (is_store)
             {
 	      if (STMT_VINFO_STRIDED_ACCESS (stmt_info))
Index: config/spu/spu.c
===================================================================
--- config/spu/spu.c	(revision 137312)
+++ config/spu/spu.c	(working copy)
@@ -137,6 +137,7 @@ static tree spu_builtin_mul_widen_odd (t
 static tree spu_builtin_mask_for_load (void);
 static int spu_builtin_vectorization_cost (bool);
 static bool spu_vector_alignment_reachable (const_tree, bool);
+static tree spu_builtin_vec_perm (tree, tree *);
 static int spu_sms_res_mii (struct ddg *g);
 
 extern const char *reg_names[];
@@ -288,6 +289,9 @@ const struct attribute_spec spu_attribut
 #undef TARGET_VECTOR_ALIGNMENT_REACHABLE
 #define TARGET_VECTOR_ALIGNMENT_REACHABLE spu_vector_alignment_reachable
 
+#undef TARGET_VECTORIZE_BUILTIN_VEC_PERM
+#define TARGET_VECTORIZE_BUILTIN_VEC_PERM spu_builtin_vec_perm
+
 #undef TARGET_LIBGCC_CMP_RETURN_MODE
 #define TARGET_LIBGCC_CMP_RETURN_MODE spu_libgcc_cmp_return_mode
 
@@ -5531,6 +5535,60 @@ spu_vector_alignment_reachable (const_tr
   return true;
 }
 
+/* Implement targetm.vectorize.builtin_vec_perm.  */
+tree
+spu_builtin_vec_perm (tree type, tree *mask_element_type)
+{
+  struct spu_builtin_description *d;
+
+  *mask_element_type = unsigned_char_type_node;
+
+  switch (TYPE_MODE (type))
+    {
+    case V16QImode:
+      if (TYPE_UNSIGNED (type))
+        d = &spu_builtins[SPU_SHUFFLE_0];
+      else
+        d = &spu_builtins[SPU_SHUFFLE_1];
+      break;
+
+    case V8HImode:
+      if (TYPE_UNSIGNED (type))
+        d = &spu_builtins[SPU_SHUFFLE_2];
+      else
+        d = &spu_builtins[SPU_SHUFFLE_3];
+      break;
+
+    case V4SImode:
+      if (TYPE_UNSIGNED (type))
+        d = &spu_builtins[SPU_SHUFFLE_4];
+      else
+        d = &spu_builtins[SPU_SHUFFLE_5];
+      break;
+
+    case V2DImode:
+      if (TYPE_UNSIGNED (type))
+        d = &spu_builtins[SPU_SHUFFLE_6];
+      else
+        d = &spu_builtins[SPU_SHUFFLE_7];
+      break;
+
+    case V4SFmode:
+      d = &spu_builtins[SPU_SHUFFLE_8];
+      break;
+
+    case V2DFmode:
+      d = &spu_builtins[SPU_SHUFFLE_9];
+      break;
+
+    default:
+      return NULL_TREE;
+    }
+
+  gcc_assert (d);
+  return d->fndecl;
+}
+
 /* Count the total number of instructions in each pipe and return the
    maximum, which is used as the Minimum Iteration Interval (MII)
    in the modulo scheduler.  get_pipe() will return -2, -1, 0, or 1.
Index: config/spu/spu.h
===================================================================
--- config/spu/spu.h	(revision 137312)
+++ config/spu/spu.h	(working copy)
@@ -594,6 +594,11 @@ targetm.resolve_overloaded_builtin = spu
 #undef TARG_VEC_STORE_COST
 #define TARG_VEC_STORE_COST          1
 
+/* Cost of vector permutation.  */
+#ifndef TARG_VEC_PERMUTE_COST
+#define TARG_VEC_PERMUTE_COST        1 
+#endif
+
 \f
 /* Misc */
 
Index: config/rs6000/rs6000.c
===================================================================
--- config/rs6000/rs6000.c	(revision 137312)
+++ config/rs6000/rs6000.c	(working copy)
@@ -862,6 +862,7 @@ static tree rs6000_builtin_mask_for_load
 static tree rs6000_builtin_mul_widen_even (tree);
 static tree rs6000_builtin_mul_widen_odd (tree);
 static tree rs6000_builtin_conversion (enum tree_code, tree);
+static tree rs6000_builtin_vec_perm (tree, tree *);
 
 static void def_builtin (int, const char *, tree, int);
 static bool rs6000_vector_alignment_reachable (const_tree, bool);
@@ -1138,6 +1139,8 @@ static const char alt_reg_names[][8] =
 #define TARGET_VECTORIZE_BUILTIN_MUL_WIDEN_ODD rs6000_builtin_mul_widen_odd
 #undef TARGET_VECTORIZE_BUILTIN_CONVERSION
 #define TARGET_VECTORIZE_BUILTIN_CONVERSION rs6000_builtin_conversion
+#undef TARGET_VECTORIZE_BUILTIN_VEC_PERM
+#define TARGET_VECTORIZE_BUILTIN_VEC_PERM rs6000_builtin_vec_perm
 
 #undef TARGET_VECTOR_ALIGNMENT_REACHABLE
 #define TARGET_VECTOR_ALIGNMENT_REACHABLE rs6000_vector_alignment_reachable
@@ -2060,6 +2063,40 @@ rs6000_vector_alignment_reachable (const
     }
 }
 
+/* Implement targetm.vectorize.builtin_vec_perm.  */
+tree
+rs6000_builtin_vec_perm (tree type, tree *mask_element_type)
+{
+  tree d;
+
+  *mask_element_type = unsigned_char_type_node;
+
+  switch (TYPE_MODE (type))
+    {
+    case V16QImode:
+      d = rs6000_builtin_decls[ALTIVEC_BUILTIN_VPERM_16QI];
+      break;
+
+    case V8HImode:
+      d = rs6000_builtin_decls[ALTIVEC_BUILTIN_VPERM_8HI];
+      break;
+
+    case V4SImode:
+      d = rs6000_builtin_decls[ALTIVEC_BUILTIN_VPERM_4SI];
+      break;
+
+    case V4SFmode:
+      d = rs6000_builtin_decls[ALTIVEC_BUILTIN_VPERM_4SF];
+      break;
+
+    default:
+      return NULL_TREE;
+    }
+
+  gcc_assert (d);
+  return d;
+}
+
 /* Handle generic options of the form -mfoo=yes/no.
    NAME is the option name.
    VALUE is the option value.

[-- Attachment #3: tests.txt --]
[-- Type: text/plain, Size: 18069 bytes --]

Index: testsuite/gcc.dg/vect/slp-perm-7.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-7.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-7.c	(revision 0)
@@ -0,0 +1,77 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M02 74
+#define M12 191
+#define M22 500
+
+#define K00 405
+#define K10 112
+#define K01 4322
+#define K11 135
+
+#define N 16
+
+/* SLP with load permutation and loop-based vectorization.  */
+void foo (int *__restrict__ pInput, int *__restrict__ pOutput,
+          int *__restrict__ pInput2, int *__restrict__ pOutput2)
+{
+  int i, a, b, c, d;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       d = *pInput2++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c;
+       *pOutput++ = M10 * a + M11 * b + M12 * c;
+       *pOutput++ = M20 * a + M21 * b + M22 * c;
+
+       /* Loop-based vectorization.  */
+       *pOutput2++ = K00 * d;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  int input[N], output[N], i;
+  int check_results[N] = {1470, 395, 28271, 5958, 1655, 111653, 10446, 2915, 195035, 14934, 4175, 278417, 19422, 5435, 361799, 0};
+  int input2[N], output2[N];
+  int check_results2[N] = {0, 405, 810, 1215, 1620, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      input2[i] = i%256;
+      output[i] = 0;
+      output2[i] = 0;
+      if (input[i] > 200)
+        abort ();
+  }
+
+  foo (input, output, input2, output2);
+
+  for (i = 0; i < N; i++)
+     if (output[i] != check_results[i] || output2[i] != check_results2[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-8.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-8.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-8.c	(revision 0)
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 200
+
+void foo (unsigned char *__restrict__ pInput, unsigned char *__restrict__ pOutput)
+{
+  unsigned char i, a, b, c;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       *pOutput++ = a + b + c + 3;
+       *pOutput++ = a + b + c + 12;
+       *pOutput++ = a + b + c + 1;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned char input[N], output[N], i;
+  unsigned char check_results[N];
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i;
+      output[i] = 0;
+      if (input[i] > 256)
+        abort ();
+    }
+
+  for (i = 0; i < N / 3; i++)
+    {
+      check_results[3*i] = 9 * i + 6;
+      check_results[3*i+1] = 9 * i + 15;
+      check_results[3*i+2] = 9 * i + 4;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N - (N % 3); i++)
+     if (output[i] != check_results[i]) 
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-1.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-1.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-1.c	(revision 0)
@@ -0,0 +1,61 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M02 74
+#define M12 191
+#define M22 500
+
+#define N 16
+
+void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
+{
+  unsigned int i, a, b, c;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c;
+       *pOutput++ = M10 * a + M11 * b + M12 * c;
+       *pOutput++ = M20 * a + M21 * b + M22 * c;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned int input[N], output[N], i;
+  unsigned int check_results[N] = {1470, 395, 28271, 5958, 1655, 111653, 10446, 2915, 195035, 14934, 4175, 278417, 19422, 5435, 361799, 0};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      if (input[i] > 200)
+        abort();
+      output[i] = 0;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N; i++)
+     if (output[i] != check_results[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-9.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-9.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-9.c	(revision 0)
@@ -0,0 +1,58 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define N 200
+
+void foo (unsigned short *__restrict__ pInput, unsigned short *__restrict__ pOutput)
+{
+  unsigned short i, a, b, c;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       *pOutput++ = a + b + c + 3;
+       *pOutput++ = a + b + c + 12;
+       *pOutput++ = a + b + c + 1;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned short input[N], output[N], i;
+  unsigned short check_results[N];
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i;
+      output[i] = 0;
+      if (input[i] > 256)
+        abort ();
+    }
+
+  for (i = 0; i < N / 3; i++)
+    {
+      check_results[3*i] = 9 * i + 6;
+      check_results[3*i+1] = 9 * i + 15;
+      check_results[3*i+2] = 9 * i + 4;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N - (N % 3); i++)
+     if (output[i] != check_results[i]) 
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
+/* { dg-final { scan-tree-dump-times "permutation requires at least three vectors" 1 "vect" { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
Index: testsuite/gcc.dg/vect/slp-perm-2.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-2.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-2.c	(revision 0)
@@ -0,0 +1,56 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M01 1322
+#define M11 13
+#define M02 74
+#define M12 191
+
+#define N 16
+
+void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
+{
+  unsigned int i, a, b;
+
+  for (i = 0; i < N / 2; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+
+       *pOutput++ = M00 * a + M01 * b;
+       *pOutput++ = M10 * a + M11 * b;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned int input[N], output[N], i;
+  unsigned int check_results[N] = {1322, 13, 4166, 471, 7010, 929, 9854, 1387, 12698, 1845, 15542, 2303, 18386, 2761, 21230, 3219};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      if (input[i] > 200)
+        abort();
+      output[i] = 0;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N; i++)
+     if (output[i] != check_results[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-3.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-3.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-3.c	(revision 0)
@@ -0,0 +1,70 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M30 237
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M31 2280
+#define M02 74
+#define M12 191
+#define M22 500
+#define M32 111
+#define M03 134
+#define M13 117
+#define M23 11
+#define M33 771
+
+#define N 16
+
+void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
+{
+  unsigned int i, a, b, c, d;
+
+  for (i = 0; i < N / 4; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+       d = *pInput++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c + M03 * d;
+       *pOutput++ = M10 * a + M11 * b + M12 * c + M13 * d;
+       *pOutput++ = M20 * a + M21 * b + M22 * c + M23 * d;
+       *pOutput++ = M30 * a + M31 * b + M32 * c + M33 * d;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned int input[N], output[N], i;
+  unsigned int check_results[N] = {1872, 746, 28304, 4815, 8392, 2894, 139524, 18411, 14912, 5042, 250744, 32007, 21432, 7190, 361964, 45603};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      if (input[i] > 200)
+        abort();
+      output[i] = 0;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N - N; i++)
+     if (output[i] != check_results[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-4.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-4.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-4.c	(revision 0)
@@ -0,0 +1,87 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M30 237
+#define M40 437
+
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M31 2280
+#define M41 284
+
+#define M02 74
+#define M12 191
+#define M22 500
+#define M32 111
+#define M42 1114
+
+#define M03 134
+#define M13 117
+#define M23 11
+#define M33 771
+#define M43 71
+
+#define M04 334
+#define M14 147
+#define M24 115
+#define M34 7716
+#define M44 16
+
+
+#define N 16
+
+void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
+{
+  unsigned int i, a, b, c, d, e;
+
+  for (i = 0; i < N / 5; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+       d = *pInput++;
+       e = *pInput++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c + M03 * d + M04 * e;
+       *pOutput++ = M10 * a + M11 * b + M12 * c + M13 * d + M14 * e;
+       *pOutput++ = M20 * a + M21 * b + M22 * c + M23 * d + M24 * e;
+       *pOutput++ = M30 * a + M31 * b + M32 * c + M33 * d + M34 * e;
+       *pOutput++ = M40 * a + M41 * b + M42 * c + M43 * d + M44 * e;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  unsigned int input[N], output[N], i;
+  unsigned int check_results[N] = {3208, 1334, 28764, 35679, 2789, 13028, 4754, 168364, 91254, 12399, 22848, 8174, 307964, 146829, 22009, 0};
+  
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      if (input[i] > 200)
+        abort();
+      output[i] = 0;
+    }
+
+  foo (input, output);
+
+  for (i = 0; i < N - N; i++)
+     if (output[i] != check_results[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 0 "vect"  } } */
+/* { dg-final { scan-tree-dump-times "permutation requires at least three vectors" 1 "vect" { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-5.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-5.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-5.c	(revision 0)
@@ -0,0 +1,77 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M02 74
+#define M12 191
+#define M22 500
+
+#define K00 405
+#define K10 112
+#define K01 4322
+#define K11 135
+
+#define N 16
+
+void foo (int *__restrict__ pInput, int *__restrict__ pOutput,
+          int *__restrict__ pInput2, int *__restrict__ pOutput2)
+{
+  int i, a, b, c, d, e;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       d = *pInput2++;
+       e = *pInput2++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c;
+       *pOutput++ = M10 * a + M11 * b + M12 * c;
+       *pOutput++ = M20 * a + M21 * b + M22 * c;
+
+       *pOutput2++ = K00 * d + K01 * e;
+       *pOutput2++ = K10 * d + K11 * e;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  int input[N], output[N], i;
+  int check_results[N] = {1470, 395, 28271, 5958, 1655, 111653, 10446, 2915, 195035, 14934, 4175, 278417, 19422, 5435, 361799, 0};
+  int input2[N], output2[N];
+  int check_results2[N] = {4322, 135, 13776, 629, 23230, 1123, 32684, 1617, 42138, 2111, 0, 0, 0, 0, 0, 0};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      input2[i] = i%256;
+      output[i] = 0;
+      output2[i] = 0;
+      if (input[i] > 256)
+        abort ();
+    }
+
+  foo (input, output, input2, output2);
+
+  for (i = 0; i < N; i++)
+     if (output[i] != check_results[i] || output2[i] != check_results2[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/gcc.dg/vect/slp-perm-6.c
===================================================================
--- testsuite/gcc.dg/vect/slp-perm-6.c	(revision 0)
+++ testsuite/gcc.dg/vect/slp-perm-6.c	(revision 0)
@@ -0,0 +1,78 @@
+/* { dg-require-effective-target vect_int } */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include "tree-vect.h"
+
+#define M00 100
+#define M10 216
+#define M20 23
+#define M01 1322
+#define M11 13
+#define M21 27271
+#define M02 74
+#define M12 191
+#define M22 500
+
+#define K00 405
+#define K10 112
+#define K01 4322
+#define K11 135
+
+#define N 16
+
+void foo (int *__restrict__ pInput, int *__restrict__ pOutput,
+          int *__restrict__ pInput2, int *__restrict__ pOutput2)
+{
+  int i, a, b, c, d, e;
+
+  for (i = 0; i < N / 3; i++)
+    {
+       a = *pInput++;
+       b = *pInput++;
+       c = *pInput++;
+
+       d = *pInput2++;
+       e = *pInput2++;
+
+       *pOutput++ = M00 * a + M01 * b + M02 * c;
+       *pOutput++ = M10 * a + M11 * b + M12 * c;
+       *pOutput++ = M20 * a + M21 * b + M22 * c;
+
+       /* Regular SLP - no permutation required.  */
+       *pOutput2++ = K00 * d;
+       *pOutput2++ = K10 * e;
+    }
+}
+
+int main (int argc, const char* argv[])
+{
+  int input[N], output[N], i;
+  int check_results[N] = {1470, 395, 28271, 5958, 1655, 111653, 10446, 2915, 195035, 14934, 4175, 278417, 19422, 5435, 361799, 0};
+  int input2[N], output2[N];
+  int check_results2[N] = {0, 112, 810, 336, 1620, 560, 2430, 784, 3240, 1008, 0, 0, 0, 0, 0, 0};
+
+  for (i = 0; i < N; i++)
+    {
+      input[i] = i%256;
+      input2[i] = i%256;
+      output[i] = 0;
+      output2[i] = 0;
+      if (input[i] > 256)
+        abort ();
+    }
+
+  foo (input, output, input2, output2);
+
+  for (i = 0; i < N; i++)
+     if (output[i] != check_results[i] || output2[i] != check_results2[i])
+       abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_perm } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
+
Index: testsuite/lib/target-supports.exp
===================================================================
--- testsuite/lib/target-supports.exp	(revision 137312)
+++ testsuite/lib/target-supports.exp	(working copy)
@@ -1531,6 +1531,28 @@ proc check_effective_target_vect_no_bitw
     return $et_vect_no_bitwise_saved
 }
 
+# Return 1 if the target plus current options supports vector permutation,
+# 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_perm { } {
+    global et_vect_perm
+
+    if [info exists et_vect_perm_saved] {
+        verbose "check_effective_target_vect_perm: using cached result" 2
+    } else {
+        set et_vect_perm_saved 0
+        if { [istarget powerpc*-*-*]
+             || [istarget spu-*-*] } {
+            set et_vect_perm_saved 1
+        }
+    }
+    verbose "check_effective_target_vect_perm: returning $et_vect_perm_saved" 2
+    return $et_vect_perm_saved
+}
+
+
 # Return 1 if the target plus current options supports a vector
 # widening summation of *short* args into *int* result, 0 otherwise.
 # A target can also support this widening summation if it can support

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-08-28  9:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <OFA024DE88.71972625-ONC22574B0.0042F525-C22574B0.0043F9D7@LocalDomain>
2008-08-29  8:43 ` [patch][vectorizer, SPU, PPC] Support load permutation in loop-aware SLP Dorit Nuzman
     [not found] <OF3E4AB8DB.84B1F38D-ONC22574AE.00224129-C22574AE.0025CAB0@LocalDomain>
2008-08-25 12:55 ` Ira Rosen
     [not found] <OF078FB537.452762A7-ONC22574AC.003A132A-C22574AC.003E6955@LocalDomain>
2008-08-23  9:21 ` Dorit Nuzman
     [not found] <OF4F4F147D.44C29413-ONC225749F.004AA5E5-C225749F.005304A0@LocalDomain>
2008-08-21 11:45 ` Ira Rosen
2008-07-06  9:24 Ira Rosen
2008-07-06 22:14 ` Andrew Pinski
2008-07-07  0:45 ` David Edelsohn
2008-08-08 15:11 ` Dorit Nuzman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).