public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Patch ping
@ 2011-11-02 20:19 Jakub Jelinek
  2011-11-04 10:11 ` Richard Guenther
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2011-11-02 20:19 UTC (permalink / raw)
  To: Richard Guenther, Ira Rosen, Richard Henderson; +Cc: gcc-patches

Hi!

- Gather vectorization patch + incremental patches
  http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02411.html
  http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02846.html
  http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02851.html

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Patch ping
  2011-11-02 20:19 Patch ping Jakub Jelinek
@ 2011-11-04 10:11 ` Richard Guenther
  2011-11-04 10:39   ` Jakub Jelinek
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Guenther @ 2011-11-04 10:11 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Ira Rosen, Richard Henderson, gcc-patches

On Wed, 2 Nov 2011, Jakub Jelinek wrote:

> Hi!
> 
> - Gather vectorization patch + incremental patches
>   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02411.html

I'm not sure I like using new builtins for gather representation
on the tree level too much, given that we are now moving
towards using tree codes for suffle.  Thus, how complicated
would it be to have a gather tree code and optab and to
handle the mixed size index issue in the expander?

I realize this would be quite some reorg to the patchset ...
so, why did you choose builtins over a more generic approach?

>   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02846.html

+  if (TREE_CODE (base) == MEM_REF)
     {
-      off = TREE_OPERAND (base, 1);
+      if (!integer_zerop (TREE_OPERAND (base, 1)))
+	{
+	  if (off == NULL_TREE)
+	    {
+	      double_int moff = mem_ref_offset (base);
+	      off = double_int_to_tree (sizetype, moff);
+	    }
+	  else
+	    off = size_binop (PLUS_EXPR, off, TREE_OPERAND (base, 1));

that's not safe, TREE_OPEAND (base, 1) is of pointer type, so
you unconditionally need to do the conversion to sizetype of
TREE_OPEAND (base, 1).

The routine lacks comments - it's got quite big and fails to
state any reason for its complexity.  I'm also not sure why
DR would include any loop invariant parts of the SCEV - doesn't
it instantiate just for the loop in question?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Patch ping
  2011-11-04 10:11 ` Richard Guenther
@ 2011-11-04 10:39   ` Jakub Jelinek
  2011-11-04 11:44     ` Richard Guenther
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2011-11-04 10:39 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Ira Rosen, Richard Henderson, gcc-patches

On Fri, Nov 04, 2011 at 10:52:44AM +0100, Richard Guenther wrote:
> > - Gather vectorization patch + incremental patches
> >   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02411.html
> 
> I'm not sure I like using new builtins for gather representation
> on the tree level too much, given that we are now moving
> towards using tree codes for suffle.  Thus, how complicated
> would it be to have a gather tree code and optab and to
> handle the mixed size index issue in the expander?
> 
> I realize this would be quite some reorg to the patchset ...
> so, why did you choose builtins over a more generic approach?

Because while permutations etc. are common to most targets,
currently gather is very specialized, specific to one target,
with lots of details how it must look like (e.g. the mask stuff where
we currently don't even have tree code for conditional loads or conditional
stores), and additionally unlike VEC_PERM_EXPR etc. which are normal
expressions this one is a reference (and conditional one too), so
I'm afraid I'd need to touch huge amounts of code (most places that
currently handle MEM_REF/TARGET_MEM_REF would need to handle
VEC_GATHER_MEM_REF too, as it is a memory read (well, set of conditional
memory reads).  The i?86 backend already has (except 4) all the needed
builtins anyway and handles expanding them too, the 4 ones are just
to cope with the weirdo definition of some of them (half sized vectors).
And when it is represented as builtin, the side-effects are handled by
all optimization passes automatically, similarly how e.g. atomic builtins
are right now builtins and not expressions.

So I thought it is better to use builtins right now, then when we in 4.8+
hopefully do something about conditional loads/stores and their
vectorization and if we introduce for that some new GIMPLE representation,
this could be done on top of that.

> >   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02846.html
> 
> +  if (TREE_CODE (base) == MEM_REF)
>      {
> -      off = TREE_OPERAND (base, 1);
> +      if (!integer_zerop (TREE_OPERAND (base, 1)))
> +	{
> +	  if (off == NULL_TREE)
> +	    {
> +	      double_int moff = mem_ref_offset (base);
> +	      off = double_int_to_tree (sizetype, moff);
> +	    }
> +	  else
> +	    off = size_binop (PLUS_EXPR, off, TREE_OPERAND (base, 1));
> 
> that's not safe, TREE_OPEAND (base, 1) is of pointer type, so
> you unconditionally need to do the conversion to sizetype of
> TREE_OPEAND (base, 1).

Ok, will fix.

> The routine lacks comments - it's got quite big and fails to

And add the comments.

> state any reason for its complexity.  I'm also not sure why
> DR would include any loop invariant parts of the SCEV - doesn't
> it instantiate just for the loop in question?

I'm not sure I understand your question.  With the incremental
patch I'm not using any DR info appart from DR_REF to determine
what is loop invariant part of the address and what is not.
The reason for that is that split_constant_offset turns the GIMPLE
code into a sometimes big tree, which actually may contain a mixture
of loop invariants/constants and SSA_NAMEs defined in the loop,
that all with casts, multiplications/additions/subtractions.
For gather I need to split it into a single loop invariant
argument (which can be computed before the loop as loop invariant
and thus can be arbitrary tree expression that is just gimplified
there) and another SSA_NAME defined into the loop which can be
vectorized which is perhaps sign-extended and multiplied by 1/2/4/8.

With the approach the incremental patch does I just
walk what split_constant_offset during DR walks and peel off
loop invariants until I have something that should be used as the
vectorized index.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Patch ping
  2011-11-04 10:39   ` Jakub Jelinek
@ 2011-11-04 11:44     ` Richard Guenther
  2011-11-04 13:32       ` [PATCH] Gather vectorization (PR tree-optimization/50789, take 2) Jakub Jelinek
  2011-11-04 14:09       ` Patch ping Michael Matz
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Guenther @ 2011-11-04 11:44 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Ira Rosen, Richard Henderson, gcc-patches

On Fri, 4 Nov 2011, Jakub Jelinek wrote:

> On Fri, Nov 04, 2011 at 10:52:44AM +0100, Richard Guenther wrote:
> > > - Gather vectorization patch + incremental patches
> > >   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02411.html
> > 
> > I'm not sure I like using new builtins for gather representation
> > on the tree level too much, given that we are now moving
> > towards using tree codes for suffle.  Thus, how complicated
> > would it be to have a gather tree code and optab and to
> > handle the mixed size index issue in the expander?
> > 
> > I realize this would be quite some reorg to the patchset ...
> > so, why did you choose builtins over a more generic approach?
> 
> Because while permutations etc. are common to most targets,
> currently gather is very specialized, specific to one target,
> with lots of details how it must look like (e.g. the mask stuff where
> we currently don't even have tree code for conditional loads or conditional
> stores), and additionally unlike VEC_PERM_EXPR etc. which are normal
> expressions this one is a reference (and conditional one too), so
> I'm afraid I'd need to touch huge amounts of code (most places that
> currently handle MEM_REF/TARGET_MEM_REF would need to handle
> VEC_GATHER_MEM_REF too, as it is a memory read (well, set of conditional
> memory reads).  The i?86 backend already has (except 4) all the needed
> builtins anyway and handles expanding them too, the 4 ones are just
> to cope with the weirdo definition of some of them (half sized vectors).
> And when it is represented as builtin, the side-effects are handled by
> all optimization passes automatically, similarly how e.g. atomic builtins
> are right now builtins and not expressions.
> 
> So I thought it is better to use builtins right now, then when we in 4.8+
> hopefully do something about conditional loads/stores and their
> vectorization and if we introduce for that some new GIMPLE representation,
> this could be done on top of that.

Ok.  I guess it's ok to use builtins for now - I didn't think of
the memory reference issue ;)

> > >   http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02846.html
> > 
> > +  if (TREE_CODE (base) == MEM_REF)
> >      {
> > -      off = TREE_OPERAND (base, 1);
> > +      if (!integer_zerop (TREE_OPERAND (base, 1)))
> > +	{
> > +	  if (off == NULL_TREE)
> > +	    {
> > +	      double_int moff = mem_ref_offset (base);
> > +	      off = double_int_to_tree (sizetype, moff);
> > +	    }
> > +	  else
> > +	    off = size_binop (PLUS_EXPR, off, TREE_OPERAND (base, 1));
> > 
> > that's not safe, TREE_OPEAND (base, 1) is of pointer type, so
> > you unconditionally need to do the conversion to sizetype of
> > TREE_OPEAND (base, 1).
> 
> Ok, will fix.
> 
> > The routine lacks comments - it's got quite big and fails to
> 
> And add the comments.
> 
> > state any reason for its complexity.  I'm also not sure why
> > DR would include any loop invariant parts of the SCEV - doesn't
> > it instantiate just for the loop in question?
> 
> I'm not sure I understand your question.  With the incremental
> patch I'm not using any DR info appart from DR_REF to determine
> what is loop invariant part of the address and what is not.
> The reason for that is that split_constant_offset turns the GIMPLE
> code into a sometimes big tree, which actually may contain a mixture
> of loop invariants/constants and SSA_NAMEs defined in the loop,
> that all with casts, multiplications/additions/subtractions.
> For gather I need to split it into a single loop invariant
> argument (which can be computed before the loop as loop invariant
> and thus can be arbitrary tree expression that is just gimplified
> there) and another SSA_NAME defined into the loop which can be
> vectorized which is perhaps sign-extended and multiplied by 1/2/4/8.
> 
> With the approach the incremental patch does I just
> walk what split_constant_offset during DR walks and peel off
> loop invariants until I have something that should be used as the
> vectorized index.

It looks like split_constant_offset walks def stmts in an unbound
fashion.  That's surely a bad idea - SCEV should already have
expanded everything non-loop-invariant, thus it should at most
look through DEFs that trivially add to the constant offset,
not through others.

Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] Gather vectorization (PR tree-optimization/50789, take 2)
  2011-11-04 11:44     ` Richard Guenther
@ 2011-11-04 13:32       ` Jakub Jelinek
  2011-11-04 23:54         ` Richard Guenther
  2011-11-07 15:49         ` Uros Bizjak
  2011-11-04 14:09       ` Patch ping Michael Matz
  1 sibling, 2 replies; 8+ messages in thread
From: Jakub Jelinek @ 2011-11-04 13:32 UTC (permalink / raw)
  To: Richard Guenther, Ira Rosen, Uros Bizjak, Richard Henderson; +Cc: gcc-patches

Hi!

On Fri, Nov 04, 2011 at 12:21:49PM +0100, Richard Guenther wrote:
> Ok.  I guess it's ok to use builtins for now - I didn't think of
> the memory reference issue ;)

Based on IRC discussion I'm posting an updated patch (both former
patches (base and incremental) in one).

I'm now using expr_invariant_in_loop_p instead of chrec_contains*,
which nicely handles also the is_gimple_min_invariant case,
and I've added several comments and fixed the MEM_REF offset
folding.  Smoke tested on the *gather* testcases, will do full
bootstrap/regtest soon.

2011-11-04  Jakub Jelinek  <jakub@redhat.com>

	PR tree-optimization/50789
	* tree-vect-stmts.c (process_use): Add force argument, avoid
	exist_non_indexing_operands_for_use_p check if true.
	(vect_mark_stmts_to_be_vectorized): Adjust callers.  Handle
	STMT_VINFO_GATHER_P.
	(gen_perm_mask): New function.
	(perm_mask_for_reverse): Use it.
	(reverse_vec_element): Rename to...
	(permute_vec_elements): ... this.  Add Y and MASK_VEC arguments,
	generalize for any permutations.
	(vectorizable_load): Adjust caller.  Handle STMT_VINFO_GATHER_P.
	* target.def (TARGET_VECTORIZE_BUILTIN_GATHER): New hook.
	* doc/tm.texi.in (TARGET_VECTORIZE_BUILTIN_GATHER): Document it.
	* doc/tm.texi: Regenerate.
	* tree-data-ref.c (initialize_data_dependence_relation,
	compute_self_dependence): No longer static.
	* tree-data-ref.h (initialize_data_dependence_relation,
	compute_self_dependence): New prototypes.
	* tree-vect-data-refs.c (vect_check_gather): New function.
	(vect_analyze_data_refs): Detect possible gather load data
	refs.
	* tree-vectorizer.h (struct _stmt_vec_info): Add gather_p field.
	(STMT_VINFO_GATHER_P): Define.
	(vect_check_gather): New prototype.
	* config/i386/i386-builtin-types.def: Add types for alternate
	gather builtins.
	* config/i386/sse.md (AVXMODE48P_DI): Remove.
	(VEC_GATHER_MODE): Rename mode_attr to...
	(VEC_GATHER_IDXSI): ... this.
	(VEC_GATHER_IDXDI, VEC_GATHER_SRCDI): New mode_attrs.
	(avx2_gathersi<mode>, *avx2_gathersi<mode>): Use <VEC_GATHER_IDXSI>
	instead of <VEC_GATHER_MODE>.
	(avx2_gatherdi<mode>): Use <VEC_GATHER_IDXDI> instead of
	<<AVXMODE48P_DI> and <VEC_GATHER_SRCDI> instead of VEC_GATHER_MODE
	on src and mask operands.
	(*avx2_gatherdi<mode>): Likewise.  Use VEC_GATHER_MODE iterator
	instead of AVXMODE48P_DI.
	(avx2_gatherdi<mode>256, *avx2_gatherdi<mode>256): Removed.
	* config/i386/i386.c (enum ix86_builtins): Add
	IX86_BUILTIN_GATHERALTSIV4DF, IX86_BUILTIN_GATHERALTDIV8SF,
	IX86_BUILTIN_GATHERALTSIV4DI and IX86_BUILTIN_GATHERALTDIV8SI.
	(ix86_init_mmx_sse_builtins): Create those builtins.
	(ix86_expand_builtin): Handle those builtins and adjust expansions
	of other gather builtins.
	(ix86_vectorize_builtin_gather): New function.
	(TARGET_VECTORIZE_BUILTIN_GATHER): Define.

	* gcc.target/i386/avx2-gather-1.c: New test.
	* gcc.target/i386/avx2-gather-2.c: New test.
	* gcc.target/i386/avx2-gather-3.c: New test.
	* gcc.target/i386/avx2-gather-4.c: New test.

--- gcc/tree-vect-stmts.c.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/tree-vect-stmts.c	2011-11-04 08:54:11.000000000 +0100
@@ -332,6 +332,8 @@ exist_non_indexing_operands_for_use_p (t
    - LIVE_P, RELEVANT - enum values to be set in the STMT_VINFO of the stmt
      that defined USE.  This is done by calling mark_relevant and passing it
      the WORKLIST (to add DEF_STMT to the WORKLIST in case it is relevant).
+   - FORCE is true if exist_non_indexing_operands_for_use_p check shouldn't
+     be performed.
 
    Outputs:
    Generally, LIVE_P and RELEVANT are used to define the liveness and
@@ -351,7 +353,8 @@ exist_non_indexing_operands_for_use_p (t
 
 static bool
 process_use (gimple stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
-	     enum vect_relevant relevant, VEC(gimple,heap) **worklist)
+	     enum vect_relevant relevant, VEC(gimple,heap) **worklist,
+	     bool force)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
@@ -363,7 +366,7 @@ process_use (gimple stmt, tree use, loop
 
   /* case 1: we are only interested in uses that need to be vectorized.  Uses
      that are used for address computation are not considered relevant.  */
-  if (!exist_non_indexing_operands_for_use_p (use, stmt))
+  if (!force && !exist_non_indexing_operands_for_use_p (use, stmt))
      return true;
 
   if (!vect_is_simple_use (use, loop_vinfo, NULL, &def_stmt, &def, &dt))
@@ -646,7 +649,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
             break;
         }
 
-      if (is_pattern_stmt_p (vinfo_for_stmt (stmt)))
+      if (is_pattern_stmt_p (stmt_vinfo))
         {
           /* Pattern statements are not inserted into the code, so
              FOR_EACH_PHI_OR_STMT_USE optimizes their operands out, and we
@@ -660,9 +663,9 @@ vect_mark_stmts_to_be_vectorized (loop_v
 	      if (rhs_code == COND_EXPR && COMPARISON_CLASS_P (op))
 		{
 		  if (!process_use (stmt, TREE_OPERAND (op, 0), loop_vinfo,
-				    live_p, relevant, &worklist)
+				    live_p, relevant, &worklist, false)
 		      || !process_use (stmt, TREE_OPERAND (op, 1), loop_vinfo,
-				       live_p, relevant, &worklist))
+				       live_p, relevant, &worklist, false))
 		    {
 		      VEC_free (gimple, heap, worklist);
 		      return false;
@@ -673,7 +676,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
                 {
 		  op = gimple_op (stmt, i);
                   if (!process_use (stmt, op, loop_vinfo, live_p, relevant,
-                                    &worklist))
+				    &worklist, false))
                     {
                       VEC_free (gimple, heap, worklist);
                       return false;
@@ -686,7 +689,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
                 {
                   tree arg = gimple_call_arg (stmt, i);
                   if (!process_use (stmt, arg, loop_vinfo, live_p, relevant,
-                                    &worklist))
+				    &worklist, false))
                     {
                       VEC_free (gimple, heap, worklist);
                       return false;
@@ -699,12 +702,25 @@ vect_mark_stmts_to_be_vectorized (loop_v
           {
             tree op = USE_FROM_PTR (use_p);
             if (!process_use (stmt, op, loop_vinfo, live_p, relevant,
-                              &worklist))
+			      &worklist, false))
               {
                 VEC_free (gimple, heap, worklist);
                 return false;
               }
           }
+
+      if (STMT_VINFO_GATHER_P (stmt_vinfo))
+	{
+	  tree off;
+	  tree decl = vect_check_gather (stmt, loop_vinfo, NULL, &off, NULL);
+	  gcc_assert (decl);
+	  if (!process_use (stmt, off, loop_vinfo, live_p, relevant,
+			    &worklist, true))
+	    {
+	      VEC_free (gimple, heap, worklist);
+	      return false;
+	    }
+	}
     } /* while worklist */
 
   VEC_free (gimple, heap, worklist);
@@ -4142,23 +4158,17 @@ vectorizable_store (gimple stmt, gimple_
   return true;
 }
 
-/* Given a vector type VECTYPE returns a builtin DECL to be used
-   for vector permutation and returns the mask that implements
-   reversal of the vector elements.  If that is impossible to do,
-   returns NULL.  */
+/* Given a vector type VECTYPE and permutation SEL returns
+   the VECTOR_CST mask that implements the permutation of the
+   vector elements.  If that is impossible to do, returns NULL.  */
 
 static tree
-perm_mask_for_reverse (tree vectype)
+gen_perm_mask (tree vectype, unsigned char *sel)
 {
   tree mask_elt_type, mask_type, mask_vec;
   int i, nunits;
-  unsigned char *sel;
 
   nunits = TYPE_VECTOR_SUBPARTS (vectype);
-  sel = XALLOCAVEC (unsigned char, nunits);
-
-  for (i = 0; i < nunits; ++i)
-    sel[i] = nunits - 1 - i;
 
   if (!can_vec_perm_p (TYPE_MODE (vectype), false, sel))
     return NULL;
@@ -4169,33 +4179,52 @@ perm_mask_for_reverse (tree vectype)
   mask_type = get_vectype_for_scalar_type (mask_elt_type);
 
   mask_vec = NULL;
-  for (i = 0; i < nunits; i++)
-    mask_vec = tree_cons (NULL, build_int_cst (mask_elt_type, i), mask_vec);
+  for (i = nunits - 1; i >= 0; i--)
+    mask_vec = tree_cons (NULL, build_int_cst (mask_elt_type, sel[i]),
+			  mask_vec);
   mask_vec = build_vector (mask_type, mask_vec);
 
   return mask_vec;
 }
 
-/* Given a vector variable X, that was generated for the scalar LHS of
-   STMT, generate instructions to reverse the vector elements of X,
-   insert them a *GSI and return the permuted vector variable.  */
+/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
+   reversal of the vector elements.  If that is impossible to do,
+   returns NULL.  */
 
 static tree
-reverse_vec_elements (tree x, gimple stmt, gimple_stmt_iterator *gsi)
+perm_mask_for_reverse (tree vectype)
+{
+  int i, nunits;
+  unsigned char *sel;
+
+  nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  sel = XALLOCAVEC (unsigned char, nunits);
+
+  for (i = 0; i < nunits; ++i)
+    sel[i] = nunits - 1 - i;
+
+  return gen_perm_mask (vectype, sel);
+}
+
+/* Given a vector variable X and Y, that was generated for the scalar
+   STMT, generate instructions to permute the vector elements of X and Y
+   using permutation mask MASK_VEC, insert them at *GSI and return the
+   permuted vector variable.  */
+
+static tree
+permute_vec_elements (tree x, tree y, tree mask_vec, gimple stmt,
+		      gimple_stmt_iterator *gsi)
 {
   tree vectype = TREE_TYPE (x);
-  tree mask_vec, perm_dest, data_ref;
+  tree perm_dest, data_ref;
   gimple perm_stmt;
 
-  mask_vec = perm_mask_for_reverse (vectype);
-
   perm_dest = vect_create_destination_var (gimple_assign_lhs (stmt), vectype);
+  data_ref = make_ssa_name (perm_dest, NULL);
 
   /* Generate the permute statement.  */
-  perm_stmt = gimple_build_assign_with_ops3 (VEC_PERM_EXPR, perm_dest,
-					     x, x, mask_vec);
-  data_ref = make_ssa_name (perm_dest, perm_stmt);
-  gimple_set_lhs (perm_stmt, data_ref);
+  perm_stmt = gimple_build_assign_with_ops3 (VEC_PERM_EXPR, data_ref,
+					     x, y, mask_vec);
   vect_finish_stmt_generation (stmt, perm_stmt, gsi);
 
   return data_ref;
@@ -4254,6 +4283,10 @@ vectorizable_load (gimple stmt, gimple_s
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
   int vf;
   tree aggr_type;
+  tree gather_base = NULL_TREE, gather_off = NULL_TREE;
+  tree gather_off_vectype = NULL_TREE, gather_decl = NULL_TREE;
+  int gather_scale = 1;
+  enum vect_def_type gather_dt = vect_unknown_def_type;
 
   if (loop_vinfo)
     {
@@ -4334,7 +4367,7 @@ vectorizable_load (gimple stmt, gimple_s
     {
       strided_load = true;
       /* FORNOW */
-      gcc_assert (! nested_in_vect_loop);
+      gcc_assert (! nested_in_vect_loop && !STMT_VINFO_GATHER_P (stmt_info));
 
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
       if (!slp && !PURE_SLP_STMT (stmt_info))
@@ -4349,7 +4382,7 @@ vectorizable_load (gimple stmt, gimple_s
 
   if (negative)
     {
-      gcc_assert (!strided_load);
+      gcc_assert (!strided_load && !STMT_VINFO_GATHER_P (stmt_info));
       alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
       if (alignment_support_scheme != dr_aligned
 	  && alignment_support_scheme != dr_unaligned_supported)
@@ -4366,6 +4399,23 @@ vectorizable_load (gimple stmt, gimple_s
 	}
     }
 
+  if (STMT_VINFO_GATHER_P (stmt_info))
+    {
+      gimple def_stmt;
+      tree def;
+      gather_decl = vect_check_gather (stmt, loop_vinfo, &gather_base,
+				       &gather_off, &gather_scale);
+      gcc_assert (gather_decl);
+      if (!vect_is_simple_use_1 (gather_off, loop_vinfo, bb_vinfo,
+				 &def_stmt, &def, &gather_dt,
+				 &gather_off_vectype))
+	{
+	  if (vect_print_dump_info (REPORT_DETAILS))
+	    fprintf (vect_dump, "gather index use not simple.");
+	  return false;
+	}
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
@@ -4378,6 +4428,161 @@ vectorizable_load (gimple stmt, gimple_s
 
   /** Transform.  **/
 
+  if (STMT_VINFO_GATHER_P (stmt_info))
+    {
+      tree vec_oprnd0 = NULL_TREE, op;
+      tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
+      tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
+      tree ptr, mask, var, scale, perm_mask = NULL_TREE, prev_res = NULL_TREE;
+      edge pe = loop_preheader_edge (loop);
+      gimple_seq seq;
+      basic_block new_bb;
+      enum { NARROW, NONE, WIDEN } modifier;
+      int gather_off_nunits = TYPE_VECTOR_SUBPARTS (gather_off_vectype);
+
+      if (nunits == gather_off_nunits)
+	modifier = NONE;
+      else if (nunits == gather_off_nunits / 2)
+	{
+	  unsigned char *sel = XALLOCAVEC (unsigned char, gather_off_nunits);
+	  modifier = WIDEN;
+
+	  for (i = 0; i < gather_off_nunits; ++i)
+	    sel[i] = i | nunits;
+
+	  perm_mask = gen_perm_mask (gather_off_vectype, sel);
+	  gcc_assert (perm_mask != NULL_TREE);
+	}
+      else if (nunits == gather_off_nunits * 2)
+	{
+	  unsigned char *sel = XALLOCAVEC (unsigned char, nunits);
+	  modifier = NARROW;
+
+	  for (i = 0; i < nunits; ++i)
+	    sel[i] = i < gather_off_nunits
+		     ? i : i + nunits - gather_off_nunits;
+
+	  perm_mask = gen_perm_mask (vectype, sel);
+	  gcc_assert (perm_mask != NULL_TREE);
+	  ncopies *= 2;
+	}
+      else
+	gcc_unreachable ();
+
+      rettype = TREE_TYPE (TREE_TYPE (gather_decl));
+      srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      ptrtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      scaletype = TREE_VALUE (arglist);
+      gcc_checking_assert (types_compatible_p (srctype, rettype)
+			   && types_compatible_p (srctype, masktype));
+
+      vec_dest = vect_create_destination_var (scalar_dest, vectype);
+
+      ptr = fold_convert (ptrtype, gather_base);
+      if (!is_gimple_min_invariant (ptr))
+	{
+	  ptr = force_gimple_operand (ptr, &seq, true, NULL_TREE);
+	  new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
+	  gcc_assert (!new_bb);
+	}
+
+      /* Currently we support only unconditional gather loads,
+	 so mask should be all ones.  */
+      if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+	mask = build_int_cst (TREE_TYPE (masktype), -1);
+      else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
+	{
+	  REAL_VALUE_TYPE r;
+	  long tmp[6];
+	  for (j = 0; j < 6; ++j)
+	    tmp[j] = -1;
+	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
+	  mask = build_real (TREE_TYPE (masktype), r);
+	}
+      else
+	gcc_unreachable ();
+      mask = build_vector_from_val (masktype, mask);
+      mask = vect_init_vector (stmt, mask, masktype, NULL);
+
+      scale = build_int_cst (scaletype, gather_scale);
+
+      prev_stmt_info = NULL;
+      for (j = 0; j < ncopies; ++j)
+	{
+	  if (modifier == WIDEN && (j & 1))
+	    op = permute_vec_elements (vec_oprnd0, vec_oprnd0,
+				       perm_mask, stmt, gsi);
+	  else if (j == 0)
+	    op = vec_oprnd0
+	      = vect_get_vec_def_for_operand (gather_off, stmt, NULL);
+	  else
+	    op = vec_oprnd0
+	      = vect_get_vec_def_for_stmt_copy (gather_dt, vec_oprnd0);
+
+	  if (!useless_type_conversion_p (idxtype, TREE_TYPE (op)))
+	    {
+	      gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (op))
+			  == TYPE_VECTOR_SUBPARTS (idxtype));
+	      var = vect_get_new_vect_var (idxtype, vect_simple_var, NULL);
+	      add_referenced_var (var);
+	      var = make_ssa_name (var, NULL);
+	      op = build1 (VIEW_CONVERT_EXPR, idxtype, op);
+	      new_stmt
+		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
+						op, NULL_TREE);
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	      op = var;
+	    }
+
+	  new_stmt
+	    = gimple_build_call (gather_decl, 5, mask, ptr, op, mask, scale);
+
+	  if (!useless_type_conversion_p (vectype, rettype))
+	    {
+	      gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+			  == TYPE_VECTOR_SUBPARTS (rettype));
+	      var = vect_get_new_vect_var (rettype, vect_simple_var, NULL);
+	      add_referenced_var (var);
+	      op = make_ssa_name (var, new_stmt);
+	      gimple_call_set_lhs (new_stmt, op);
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	      var = make_ssa_name (vec_dest, NULL);
+	      op = build1 (VIEW_CONVERT_EXPR, vectype, op);
+	      new_stmt
+		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var, op,
+						NULL_TREE);
+	    }
+	  else
+	    {
+	      var = make_ssa_name (vec_dest, new_stmt);
+	      gimple_call_set_lhs (new_stmt, var);
+	    }
+
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+
+	  if (modifier == NARROW)
+	    {
+	      if ((j & 1) == 0)
+		{
+		  prev_res = var;
+		  continue;
+		}
+	      var = permute_vec_elements (prev_res, var,
+					  perm_mask, stmt, gsi);
+	      new_stmt = SSA_NAME_DEF_STMT (var);
+	    }
+
+	  if (prev_stmt_info == NULL)
+	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	  else
+	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
+	  prev_stmt_info = vinfo_for_stmt (new_stmt);
+	}
+      return true;
+    }
+
   if (strided_load)
     {
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
@@ -4769,7 +4974,9 @@ vectorizable_load (gimple stmt, gimple_s
 
 	      if (negative)
 		{
-		  new_temp = reverse_vec_elements (new_temp, stmt, gsi);
+		  tree perm_mask = perm_mask_for_reverse (vectype);
+		  new_temp = permute_vec_elements (new_temp, new_temp,
+						   perm_mask, stmt, gsi);
 		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
 		}
 
--- gcc/target.def.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/target.def	2011-11-04 08:53:13.000000000 +0100
@@ -1021,6 +1021,14 @@ DEFHOOK
  (void),
  default_autovectorize_vector_sizes)
 
+/* Target builtin that implements vector gather operation.  */
+DEFHOOK
+(builtin_gather,
+ "",
+ tree,
+ (const_tree mem_vectype, const_tree index_type, int scale),
+ NULL)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
--- gcc/tree-data-ref.c.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/tree-data-ref.c	2011-11-04 08:53:13.000000000 +0100
@@ -1351,13 +1351,11 @@ dr_may_alias_p (const struct data_refere
   return refs_may_alias_p (addr_a, addr_b);
 }
 
-static void compute_self_dependence (struct data_dependence_relation *);
-
 /* Initialize a data dependence relation between data accesses A and
    B.  NB_LOOPS is the number of loops surrounding the references: the
    size of the classic distance/direction vectors.  */
 
-static struct data_dependence_relation *
+struct data_dependence_relation *
 initialize_data_dependence_relation (struct data_reference *a,
 				     struct data_reference *b,
  				     VEC (loop_p, heap) *loop_nest)
@@ -4121,7 +4119,7 @@ compute_affine_dependence (struct data_d
 /* This computes the dependence relation for the same data
    reference into DDR.  */
 
-static void
+void
 compute_self_dependence (struct data_dependence_relation *ddr)
 {
   unsigned int i;
--- gcc/tree-data-ref.h.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/tree-data-ref.h	2011-11-04 13:22:28.000000000 +0100
@@ -1,5 +1,5 @@
 /* Data references and dependences detectors.
-   Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010
+   Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011
    Free Software Foundation, Inc.
    Contributed by Sebastian Pop <pop@cri.ensmp.fr>
 
@@ -423,6 +423,9 @@ extern bool graphite_find_data_reference
 						   VEC (data_reference_p, heap) **);
 struct data_reference *create_data_ref (loop_p, loop_p, tree, gimple, bool);
 extern bool find_loop_nest (struct loop *, VEC (loop_p, heap) **);
+extern struct data_dependence_relation *initialize_data_dependence_relation
+     (struct data_reference *, struct data_reference *, VEC (loop_p, heap) *); 
+extern void compute_self_dependence (struct data_dependence_relation *);
 extern void compute_all_dependences (VEC (data_reference_p, heap) *,
 				     VEC (ddr_p, heap) **, VEC (loop_p, heap) *,
 				     bool);
--- gcc/doc/tm.texi.in.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/doc/tm.texi.in	2011-11-04 08:53:13.000000000 +0100
@@ -5696,6 +5696,14 @@ mode returned by @code{TARGET_VECTORIZE_
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@hook TARGET_VECTORIZE_BUILTIN_GATHER
+Target builtin that implements vector gather operation.  @var{mem_vectype}
+is the vector type of the load and @var{index_type} is scalar type of
+the index, scaled by @var{scale}.
+The default is @code{NULL_TREE} which means to not vectorize gather
+loads.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
--- gcc/doc/tm.texi.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/doc/tm.texi	2011-11-04 08:53:13.000000000 +0100
@@ -5758,6 +5758,14 @@ mode returned by @code{TARGET_VECTORIZE_
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
+Target builtin that implements vector gather operation.  @var{mem_vectype}
+is the vector type of the load and @var{index_type} is scalar type of
+the index, scaled by @var{scale}.
+The default is @code{NULL_TREE} which means to not vectorize gather
+loads.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
--- gcc/tree-vect-data-refs.c.jj	2011-11-04 08:52:57.000000000 +0100
+++ gcc/tree-vect-data-refs.c	2011-11-04 14:10:17.000000000 +0100
@@ -2497,6 +2497,199 @@ vect_prune_runtime_alias_test_list (loop
   return true;
 }
 
+/* Check whether a non-affine read in stmt is suitable for gather load
+   and if so, return a builtin decl for that operation.  */
+
+tree
+vect_check_gather (gimple stmt, loop_vec_info loop_vinfo, tree *basep,
+		   tree *offp, int *scalep)
+{
+  HOST_WIDE_INT scale = 1, pbitpos, pbitsize;
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+  tree offtype = NULL_TREE;
+  tree decl, base, off;
+  enum machine_mode pmode;
+  int punsignedp, pvolatilep;
+
+  /* The gather builtins need address of the form
+     loop_invariant + vector * {1, 2, 4, 8}
+     or
+     loop_invariant + sign_extend (vector) * { 1, 2, 4, 8 }.
+     Unfortunately DR_BASE_ADDRESS/DR_OFFSET can be a mixture
+     of loop invariants/SSA_NAMEs defined in the loop, with casts,
+     multiplications and additions in it.  To get a vector, we need
+     a single SSA_NAME that will be defined in the loop and will
+     contain everything that is not loop invariant and that can be
+     vectorized.  The following code attempts to find such a preexistng
+     SSA_NAME OFF and put the loop invariants into a tree BASE
+     that can be gimplified before the loop.  */
+  base = get_inner_reference (DR_REF (dr), &pbitsize, &pbitpos, &off,
+			      &pmode, &punsignedp, &pvolatilep, false);
+  gcc_assert (base != NULL_TREE && (pbitpos % BITS_PER_UNIT) == 0);
+
+  if (TREE_CODE (base) == MEM_REF)
+    {
+      if (!integer_zerop (TREE_OPERAND (base, 1)))
+	{
+	  if (off == NULL_TREE)
+	    {
+	      double_int moff = mem_ref_offset (base);
+	      off = double_int_to_tree (sizetype, moff);
+	    }
+	  else
+	    off = size_binop (PLUS_EXPR, off,
+			      fold_convert (sizetype, TREE_OPERAND (base, 1)));
+	}
+      base = TREE_OPERAND (base, 0);
+    }
+  else
+    base = build_fold_addr_expr (base);
+
+  if (off == NULL_TREE)
+    off = size_zero_node;
+
+  /* If base is not loop invariant, either off is 0, then we start with just
+     the constant offset in the loop invariant BASE and continue with base
+     as OFF, otherwise give up.
+     We could handle that case by gimplifying the addition of base + off
+     into some SSA_NAME and use that as off, but for now punt.  */
+  if (!expr_invariant_in_loop_p (loop, base))
+    {
+      if (!integer_zerop (off))
+	return NULL_TREE;
+      off = base;
+      base = size_int (pbitpos / BITS_PER_UNIT);
+    }
+  /* Otherwise put base + constant offset into the loop invariant BASE
+     and continue with OFF.  */
+  else
+    {
+      base = fold_convert (sizetype, base);
+      base = size_binop (PLUS_EXPR, base, size_int (pbitpos / BITS_PER_UNIT));
+    }
+
+  /* OFF at this point may be either a SSA_NAME or some tree expression
+     from get_inner_reference.  Try to peel off loop invariants from it
+     into BASE as long as possible.  */
+  STRIP_NOPS (off);
+  while (offtype == NULL_TREE)
+    {
+      enum tree_code code;
+      tree op0, op1, add = NULL_TREE;
+
+      if (TREE_CODE (off) == SSA_NAME)
+	{
+	  gimple def_stmt = SSA_NAME_DEF_STMT (off);
+
+	  if (expr_invariant_in_loop_p (loop, off))
+	    return NULL_TREE;
+
+	  if (gimple_code (def_stmt) != GIMPLE_ASSIGN)
+	    break;
+
+	  op0 = gimple_assign_rhs1 (def_stmt);
+	  code = gimple_assign_rhs_code (def_stmt);
+	  op1 = gimple_assign_rhs2 (def_stmt);
+	}
+      else
+	{
+	  if (get_gimple_rhs_class (TREE_CODE (off)) == GIMPLE_TERNARY_RHS)
+	    return NULL_TREE;
+	  code = TREE_CODE (off);
+	  extract_ops_from_tree (off, &code, &op0, &op1);
+	}
+      switch (code)
+	{
+	case POINTER_PLUS_EXPR:
+	case PLUS_EXPR:
+	  if (expr_invariant_in_loop_p (loop, op0))
+	    {
+	      add = op0;
+	      off = op1;
+	    do_add:
+	      add = fold_convert (sizetype, add);
+	      if (scale != 1)
+		add = size_binop (MULT_EXPR, add, size_int (scale));
+	      base = size_binop (PLUS_EXPR, base, add);
+	      continue;
+	    }
+	  if (expr_invariant_in_loop_p (loop, op1))
+	    {
+	      add = op1;
+	      off = op0;
+	      goto do_add;
+	    }
+	  break;
+	case MINUS_EXPR:
+	  if (expr_invariant_in_loop_p (loop, op1))
+	    {
+	      add = fold_convert (sizetype, op1);
+	      add = size_binop (MINUS_EXPR, size_zero_node, add);
+	      off = op0;
+	      goto do_add;
+	    }
+	  break;
+	case MULT_EXPR:
+	  if (scale == 1 && host_integerp (op1, 0))
+	    {
+	      scale = tree_low_cst (op1, 0);
+	      off = op0;
+	      continue;
+	    }
+	  break;
+	case SSA_NAME:
+	  off = op0;
+	  continue;
+	CASE_CONVERT:
+	  if (!POINTER_TYPE_P (TREE_TYPE (op0))
+	      && !INTEGRAL_TYPE_P (TREE_TYPE (op0)))
+	    break;
+	  if (TYPE_PRECISION (TREE_TYPE (op0))
+	      == TYPE_PRECISION (TREE_TYPE (off)))
+	    {
+	      off = op0;
+	      continue;
+	    }
+	  if (TYPE_PRECISION (TREE_TYPE (op0))
+	      < TYPE_PRECISION (TREE_TYPE (off)))
+	    {
+	      off = op0;
+	      offtype = TREE_TYPE (off);
+	      STRIP_NOPS (off);
+	      continue;
+	    }
+	  break;
+	default:
+	  break;
+	}
+      break;
+    }
+
+  /* If at the end OFF still isn't a SSA_NAME or isn't
+     defined in the loop, punt.  */
+  if (TREE_CODE (off) != SSA_NAME
+      || expr_invariant_in_loop_p (loop, off))
+    return NULL_TREE;
+
+  if (offtype == NULL_TREE)
+    offtype = TREE_TYPE (off);
+
+  decl = targetm.vectorize.builtin_gather (STMT_VINFO_VECTYPE (stmt_info),
+					   offtype, scale);
+  if (decl == NULL_TREE)
+    return NULL_TREE;
+
+  if (basep)
+    *basep = base;
+  if (offp)
+    *offp = off;
+  if (scalep)
+    *scalep = scale;
+  return decl;
+}
+
 
 /* Function vect_analyze_data_refs.
 
@@ -2573,6 +2766,7 @@ vect_analyze_data_refs (loop_vec_info lo
       gimple stmt;
       stmt_vec_info stmt_info;
       tree base, offset, init;
+      bool gather = false;
       int vf;
 
       if (!dr || !DR_REF (dr))
@@ -2594,22 +2788,51 @@ vect_analyze_data_refs (loop_vec_info lo
 
       /* Check that analysis of the data-ref succeeded.  */
       if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
-          || !DR_STEP (dr))
+	  || !DR_STEP (dr))
         {
-          if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
-            {
-              fprintf (vect_dump, "not vectorized: data ref analysis failed ");
-              print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
-            }
+	  /* If target supports vector gather loads, see if they can't
+	     be used.  */
+	  if (loop_vinfo
+	      && DR_IS_READ (dr)
+	      && !TREE_THIS_VOLATILE (DR_REF (dr))
+	      && targetm.vectorize.builtin_gather != NULL
+	      && !nested_in_vect_loop_p (loop, stmt))
+	    {
+	      struct data_reference *newdr
+		= create_data_ref (NULL, loop_containing_stmt (stmt),
+				   DR_REF (dr), stmt, true);
+	      gcc_assert (newdr != NULL && DR_REF (newdr));
+	      if (DR_BASE_ADDRESS (newdr)
+		  && DR_OFFSET (newdr)
+		  && DR_INIT (newdr)
+		  && DR_STEP (newdr)
+		  && integer_zerop (DR_STEP (newdr)))
+		{
+		  dr = newdr;
+		  gather = true;
+		}
+	      else
+		free_data_ref (newdr);
+	    }
 
-          if (bb_vinfo)
-            {
-              STMT_VINFO_VECTORIZABLE (stmt_info) = false;
-              stop_bb_analysis = true;
-              continue;
-            }
+	  if (!gather)
+	    {
+	      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+		{
+		  fprintf (vect_dump, "not vectorized: data ref analysis "
+				      "failed ");
+		  print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+		}
+
+	      if (bb_vinfo)
+		{
+		  STMT_VINFO_VECTORIZABLE (stmt_info) = false;
+		  stop_bb_analysis = true;
+		  continue;
+		}
 
-          return false;
+	      return false;
+	    }
         }
 
       if (TREE_CODE (DR_BASE_ADDRESS (dr)) == INTEGER_CST)
@@ -2625,7 +2848,9 @@ vect_analyze_data_refs (loop_vec_info lo
               continue;
             }
 
-           return false;
+	  if (gather)
+	    free_data_ref (dr);
+	  return false;
         }
 
       if (TREE_THIS_VOLATILE (DR_REF (dr)))
@@ -2666,6 +2891,8 @@ vect_analyze_data_refs (loop_vec_info lo
               continue;
             }
 
+	  if (gather)
+	    free_data_ref (dr);
           return false;
         }
 
@@ -2791,6 +3018,8 @@ vect_analyze_data_refs (loop_vec_info lo
               continue;
             }
 
+	  if (gather)
+	    free_data_ref (dr);
           return false;
         }
 
@@ -2818,8 +3047,13 @@ vect_analyze_data_refs (loop_vec_info lo
               stop_bb_analysis = true;
               continue;
             }
-          else
-            return false;
+
+	  if (gather)
+	    {
+	      STMT_VINFO_DATA_REF (stmt_info) = NULL;
+	      free_data_ref (dr);
+	    }
+	  return false;
         }
 
       /* Adjust the minimal vectorization factor according to the
@@ -2827,6 +3061,86 @@ vect_analyze_data_refs (loop_vec_info lo
       vf = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
       if (vf > *min_vf)
 	*min_vf = vf;
+
+      if (gather)
+	{
+	  unsigned int j, k, n;
+	  struct data_reference *olddr
+	    = VEC_index (data_reference_p, datarefs, i);
+	  VEC (ddr_p, heap) *ddrs = LOOP_VINFO_DDRS (loop_vinfo);
+	  struct data_dependence_relation *ddr, *newddr;
+	  bool bad = false;
+	  tree off;
+	  VEC (loop_p, heap) *nest = LOOP_VINFO_LOOP_NEST (loop_vinfo);
+
+	  if (!vect_check_gather (stmt, loop_vinfo, NULL, &off, NULL)
+	      || get_vectype_for_scalar_type (TREE_TYPE (off)) == NULL_TREE)
+	    {
+	      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+		{
+		  fprintf (vect_dump,
+			   "not vectorized: not suitable for gather ");
+		  print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+		}
+	      return false;
+	    }
+
+	  n = VEC_length (data_reference_p, datarefs) - 1;
+	  for (j = 0, k = i - 1; j < i; j++)
+	    {
+	      ddr = VEC_index (ddr_p, ddrs, k);
+	      gcc_assert (DDR_B (ddr) == olddr);
+	      newddr = initialize_data_dependence_relation (DDR_A (ddr), dr,
+							    nest);
+	      VEC_replace (ddr_p, ddrs, k, newddr);
+	      free_dependence_relation (ddr);
+	      if (!bad
+		  && DR_IS_WRITE (DDR_A (newddr))
+		  && DDR_ARE_DEPENDENT (newddr) != chrec_known)
+		bad = true;
+	      k += --n;
+	    }
+
+	  k++;
+	  n = k + VEC_length (data_reference_p, datarefs) - i - 1;
+	  for (; k < n; k++)
+	    {
+	      ddr = VEC_index (ddr_p, ddrs, k);
+	      gcc_assert (DDR_A (ddr) == olddr);
+	      newddr = initialize_data_dependence_relation (dr, DDR_B (ddr),
+							    nest);
+	      VEC_replace (ddr_p, ddrs, k, newddr);
+	      free_dependence_relation (ddr);
+	      if (!bad
+		  && DR_IS_WRITE (DDR_B (newddr))
+		  && DDR_ARE_DEPENDENT (newddr) != chrec_known)
+		bad = true;
+	    }
+
+	  k = VEC_length (ddr_p, ddrs)
+	      - VEC_length (data_reference_p, datarefs) + i;
+	  ddr = VEC_index (ddr_p, ddrs, k);
+	  gcc_assert (DDR_A (ddr) == olddr && DDR_B (ddr) == olddr);
+	  newddr = initialize_data_dependence_relation (dr, dr, nest);
+	  compute_self_dependence (newddr);
+	  VEC_replace (ddr_p, ddrs, k, newddr);
+	  free_dependence_relation (ddr);
+	  VEC_replace (data_reference_p, datarefs, i, dr);
+
+	  if (bad)
+	    {
+	      if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+		{
+		  fprintf (vect_dump,
+			   "not vectorized: data dependence conflict"
+			   " prevents gather");
+		  print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
+		}
+	      return false;
+	    }
+
+	  STMT_VINFO_GATHER_P (stmt_info) = true;
+	}
     }
 
   return true;
--- gcc/tree-vectorizer.h.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/tree-vectorizer.h	2011-11-04 08:53:13.000000000 +0100
@@ -535,6 +535,9 @@ typedef struct _stmt_vec_info {
   /* Is this statement vectorizable or should it be skipped in (partial)
      vectorization.  */
   bool vectorizable;
+
+  /* For loads only, true if this is a gather load.  */
+  bool gather_p;
 } *stmt_vec_info;
 
 /* Access Functions.  */
@@ -548,6 +551,7 @@ typedef struct _stmt_vec_info {
 #define STMT_VINFO_VEC_STMT(S)             (S)->vectorized_stmt
 #define STMT_VINFO_VECTORIZABLE(S)         (S)->vectorizable
 #define STMT_VINFO_DATA_REF(S)             (S)->data_ref_info
+#define STMT_VINFO_GATHER_P(S)		   (S)->gather_p
 
 #define STMT_VINFO_DR_BASE_ADDRESS(S)      (S)->dr_base_address
 #define STMT_VINFO_DR_INIT(S)              (S)->dr_init
@@ -858,6 +862,8 @@ extern bool vect_analyze_data_refs_align
 extern bool vect_verify_datarefs_alignment (loop_vec_info, bb_vec_info);
 extern bool vect_analyze_data_ref_accesses (loop_vec_info, bb_vec_info);
 extern bool vect_prune_runtime_alias_test_list (loop_vec_info);
+extern tree vect_check_gather (gimple, loop_vec_info, tree *, tree *,
+			       int *);
 extern bool vect_analyze_data_refs (loop_vec_info, bb_vec_info, int *);
 extern tree vect_create_data_ref_ptr (gimple, tree, struct loop *, tree,
 				      tree *, gimple_stmt_iterator *,
--- gcc/config/i386/i386-builtin-types.def.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/config/i386/i386-builtin-types.def	2011-11-04 08:53:13.000000000 +0100
@@ -432,20 +432,24 @@ DEF_FUNCTION_TYPE (V8QI, QI, QI, QI, QI,
 
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE, V4SI, V2DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V4SI, V4DF, INT)
+DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V8SI, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE, V2DI, V2DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V4DI, V4DF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V4SI, V4SF, INT)
 DEF_FUNCTION_TYPE (V8SF, V8SF, PCFLOAT, V8SI, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V2DI, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V4DI, V4SF, INT)
+DEF_FUNCTION_TYPE (V8SF, V8SF, PCFLOAT, V4DI, V8SF, INT)
 DEF_FUNCTION_TYPE (V2DI, V2DI, PCINT64, V4SI, V2DI, INT)
 DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V4SI, V4DI, INT)
+DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V8SI, V4DI, INT)
 DEF_FUNCTION_TYPE (V2DI, V2DI, PCINT64, V2DI, V2DI, INT)
 DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V4DI, V4DI, INT)
 DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V4SI, V4SI, INT)
 DEF_FUNCTION_TYPE (V8SI, V8SI, PCINT, V8SI, V8SI, INT)
 DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V2DI, V4SI, INT)
 DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V4DI, V4SI, INT)
+DEF_FUNCTION_TYPE (V8SI, V8SI, PCINT, V4DI, V8SI, INT)
 
 DEF_FUNCTION_TYPE_ALIAS (V2DF_FTYPE_V2DF, ROUND)
 DEF_FUNCTION_TYPE_ALIAS (V4DF_FTYPE_V4DF, ROUND)
--- gcc/config/i386/sse.md.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/config/i386/sse.md	2011-11-04 12:48:16.000000000 +0100
@@ -316,14 +316,6 @@ (define_mode_attr i128
 ;; Mix-n-match
 (define_mode_iterator AVX256MODE2P [V8SI V8SF V4DF])
 
-(define_mode_iterator AVXMODE48P_DI
-		      [V2DI V2DF V4DI V4DF V4SF V4SI])
-(define_mode_attr AVXMODE48P_DI
-		      [(V2DI "V2DI") (V2DF "V2DI")
-		       (V4DI "V4DI") (V4DF "V4DI")
-		       (V4SI "V2DI") (V4SF "V2DI")
-		       (V8SI "V4DI") (V8SF "V4DI")])
-
 (define_mode_iterator FMAMODE [SF DF V4SF V2DF V8SF V4DF])
 
 ;; Mapping of immediate bits for blend instructions
@@ -12516,11 +12508,21 @@ (define_insn "vcvtps2ph256"
 ;; For gather* insn patterns
 (define_mode_iterator VEC_GATHER_MODE
 		      [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF])
-(define_mode_attr VEC_GATHER_MODE
+(define_mode_attr VEC_GATHER_IDXSI
 		      [(V2DI "V4SI") (V2DF "V4SI")
 		       (V4DI "V4SI") (V4DF "V4SI")
 		       (V4SI "V4SI") (V4SF "V4SI")
 		       (V8SI "V8SI") (V8SF "V8SI")])
+(define_mode_attr VEC_GATHER_IDXDI
+		      [(V2DI "V2DI") (V2DF "V2DI")
+		       (V4DI "V4DI") (V4DF "V4DI")
+		       (V4SI "V2DI") (V4SF "V2DI")
+		       (V8SI "V4DI") (V8SF "V4DI")])
+(define_mode_attr VEC_GATHER_SRCDI
+		      [(V2DI "V2DI") (V2DF "V2DF")
+		       (V4DI "V4DI") (V4DF "V4DF")
+		       (V4SI "V4SI") (V4SF "V4SF")
+		       (V8SI "V4SI") (V8SF "V4SF")])
 
 (define_expand "avx2_gathersi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "")
@@ -12529,7 +12531,8 @@ (define_expand "avx2_gathersi<mode>"
 		      (mem:<ssescalarmode>
 			(match_par_dup 7
 			  [(match_operand 2 "vsib_address_operand" "")
-			   (match_operand:<VEC_GATHER_MODE> 3 "register_operand" "")
+			   (match_operand:<VEC_GATHER_IDXSI>
+			      3 "register_operand" "")
 			   (match_operand:SI 5 "const1248_operand " "")]))
 		      (mem:BLK (scratch))
 		      (match_operand:VEC_GATHER_MODE 4 "register_operand" "")]
@@ -12549,7 +12552,7 @@ (define_insn "*avx2_gathersi<mode>"
 	   (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
 	     [(unspec:P
 		[(match_operand:P 3 "vsib_address_operand" "p")
-		 (match_operand:<VEC_GATHER_MODE> 4 "register_operand" "x")
+		 (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
 		 (match_operand:SI 6 "const1248_operand" "n")]
 		UNSPEC_VSIBADDR)])
 	   (mem:BLK (scratch))
@@ -12565,14 +12568,16 @@ (define_insn "*avx2_gathersi<mode>"
 (define_expand "avx2_gatherdi<mode>"
   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "")
 		   (unspec:VEC_GATHER_MODE
-		     [(match_operand:VEC_GATHER_MODE 1 "register_operand" "")
+		     [(match_operand:<VEC_GATHER_SRCDI> 1 "register_operand" "")
 		      (mem:<ssescalarmode>
 			(match_par_dup 7
 			  [(match_operand 2 "vsib_address_operand" "")
-			   (match_operand:<AVXMODE48P_DI> 3 "register_operand" "")
+			   (match_operand:<VEC_GATHER_IDXDI>
+			      3 "register_operand" "")
 			   (match_operand:SI 5 "const1248_operand " "")]))
 		      (mem:BLK (scratch))
-		      (match_operand:VEC_GATHER_MODE 4 "register_operand" "")]
+		      (match_operand:<VEC_GATHER_SRCDI>
+			4 "register_operand" "")]
 		     UNSPEC_GATHER))
 	      (clobber (match_scratch:VEC_GATHER_MODE 6 ""))])]
   "TARGET_AVX2"
@@ -12583,63 +12588,21 @@ (define_expand "avx2_gatherdi<mode>"
 })
 
 (define_insn "*avx2_gatherdi<mode>"
-  [(set (match_operand:AVXMODE48P_DI 0 "register_operand" "=&x")
-	(unspec:AVXMODE48P_DI
-	  [(match_operand:AVXMODE48P_DI 2 "register_operand" "0")
+  [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
+	(unspec:VEC_GATHER_MODE
+	  [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
 	   (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
 	     [(unspec:P
 		[(match_operand:P 3 "vsib_address_operand" "p")
-		 (match_operand:<AVXMODE48P_DI> 4 "register_operand" "x")
+		 (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
 		 (match_operand:SI 6 "const1248_operand" "n")]
 		UNSPEC_VSIBADDR)])
 	   (mem:BLK (scratch))
-	   (match_operand:AVXMODE48P_DI 5 "register_operand" "1")]
+	   (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
 	  UNSPEC_GATHER))
-   (clobber (match_scratch:AVXMODE48P_DI 1 "=&x"))]
-  "TARGET_AVX2"
-  "v<sseintprefix>gatherq<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
-  [(set_attr "type" "ssemov")
-   (set_attr "prefix" "vex")
-   (set_attr "mode" "<sseinsnmode>")])
-
-;; Special handling for VEX.256 with float arguments
-;; since there're still xmms as operands
-(define_expand "avx2_gatherdi<mode>256"
-  [(parallel [(set (match_operand:VI4F_128 0 "register_operand" "")
-		   (unspec:VI4F_128
-		     [(match_operand:VI4F_128 1 "register_operand" "")
-		      (mem:<ssescalarmode>
-			(match_par_dup 7
-			  [(match_operand 2 "vsib_address_operand" "")
-			   (match_operand:V4DI 3 "register_operand" "")
-			   (match_operand:SI 5 "const1248_operand " "")]))
-		      (mem:BLK (scratch))
-		      (match_operand:VI4F_128 4 "register_operand" "")]
-		     UNSPEC_GATHER))
-	      (clobber (match_scratch:VI4F_128 6 ""))])]
-  "TARGET_AVX2"
-{
-  operands[7]
-    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
-					operands[5]), UNSPEC_VSIBADDR);
-})
-
-(define_insn "*avx2_gatherdi<mode>256"
-  [(set (match_operand:VI4F_128 0 "register_operand" "=x")
-	(unspec:VI4F_128
-	  [(match_operand:VI4F_128 2 "register_operand" "0")
-	   (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
-	     [(unspec:P
-		[(match_operand:P 3 "vsib_address_operand" "p")
-		 (match_operand:V4DI 4 "register_operand" "x")
-		 (match_operand:SI 6 "const1248_operand" "n")]
-		UNSPEC_VSIBADDR)])
-	   (mem:BLK (scratch))
-	   (match_operand:VI4F_128 5 "register_operand" "1")]
-	  UNSPEC_GATHER)) 
-   (clobber (match_scratch:VI4F_128 1 "=&x"))]
+   (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
   "TARGET_AVX2"
-  "v<sseintprefix>gatherq<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
+  "v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"
   [(set_attr "type" "ssemov")
    (set_attr "prefix" "vex")
    (set_attr "mode" "<sseinsnmode>")])
--- gcc/config/i386/i386.c.jj	2011-11-04 08:52:19.000000000 +0100
+++ gcc/config/i386/i386.c	2011-11-04 12:48:16.000000000 +0100
@@ -25105,6 +25105,13 @@ enum ix86_builtins
   IX86_BUILTIN_GATHERDIV4SI,
   IX86_BUILTIN_GATHERDIV8SI,
 
+  /* Alternate 4 element gather for the vectorizer where
+     all operands are 32-byte wide.  */
+  IX86_BUILTIN_GATHERALTSIV4DF,
+  IX86_BUILTIN_GATHERALTDIV8SF,
+  IX86_BUILTIN_GATHERALTSIV4DI,
+  IX86_BUILTIN_GATHERALTDIV8SI,
+
   /* TFmode support builtins.  */
   IX86_BUILTIN_INFQ,
   IX86_BUILTIN_HUGE_VALQ,
@@ -26883,6 +26890,22 @@ ix86_init_mmx_sse_builtins (void)
 	       V4SI_FTYPE_V4SI_PCINT_V4DI_V4SI_INT,
 	       IX86_BUILTIN_GATHERDIV8SI);
 
+  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltsiv4df ",
+	       V4DF_FTYPE_V4DF_PCDOUBLE_V8SI_V4DF_INT,
+	       IX86_BUILTIN_GATHERALTSIV4DF);
+
+  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltdiv4sf256 ",
+	       V8SF_FTYPE_V8SF_PCFLOAT_V4DI_V8SF_INT,
+	       IX86_BUILTIN_GATHERALTDIV8SF);
+
+  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltsiv4di ",
+	       V4DI_FTYPE_V4DI_PCINT64_V8SI_V4DI_INT,
+	       IX86_BUILTIN_GATHERALTSIV4DI);
+
+  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltdiv4si256 ",
+	       V8SI_FTYPE_V8SI_PCINT_V4DI_V8SI_INT,
+	       IX86_BUILTIN_GATHERALTDIV8SI);
+
   /* MMX access to the vec_init patterns.  */
   def_builtin_const (OPTION_MASK_ISA_MMX, "__builtin_ia32_vec_init_v2si",
 		     V2SI_FTYPE_INT_INT, IX86_BUILTIN_VEC_INIT_V2SI);
@@ -28869,7 +28892,7 @@ rdrand_step:
       icode = CODE_FOR_avx2_gatherdiv4sf;
       goto gather_gen;
     case IX86_BUILTIN_GATHERDIV8SF:
-      icode = CODE_FOR_avx2_gatherdiv4sf256;
+      icode = CODE_FOR_avx2_gatherdiv8sf;
       goto gather_gen;
     case IX86_BUILTIN_GATHERSIV2DI:
       icode = CODE_FOR_avx2_gathersiv2di;
@@ -28893,7 +28916,20 @@ rdrand_step:
       icode = CODE_FOR_avx2_gatherdiv4si;
       goto gather_gen;
     case IX86_BUILTIN_GATHERDIV8SI:
-      icode = CODE_FOR_avx2_gatherdiv4si256;
+      icode = CODE_FOR_avx2_gatherdiv8si;
+      goto gather_gen;
+    case IX86_BUILTIN_GATHERALTSIV4DF:
+      icode = CODE_FOR_avx2_gathersiv4df;
+      goto gather_gen;
+    case IX86_BUILTIN_GATHERALTDIV8SF:
+      icode = CODE_FOR_avx2_gatherdiv8sf;
+      goto gather_gen;
+    case IX86_BUILTIN_GATHERALTSIV4DI:
+      icode = CODE_FOR_avx2_gathersiv4df;
+      goto gather_gen;
+    case IX86_BUILTIN_GATHERALTDIV8SI:
+      icode = CODE_FOR_avx2_gatherdiv8si;
+      goto gather_gen;
 
     gather_gen:
       arg0 = CALL_EXPR_ARG (exp, 0);
@@ -28912,8 +28948,39 @@ rdrand_step:
       mode3 = insn_data[icode].operand[4].mode;
       mode4 = insn_data[icode].operand[5].mode;
 
-      if (target == NULL_RTX)
-	target = gen_reg_rtx (insn_data[icode].operand[0].mode);
+      if (target == NULL_RTX
+	  || GET_MODE (target) != insn_data[icode].operand[0].mode)
+	subtarget = gen_reg_rtx (insn_data[icode].operand[0].mode);
+      else
+	subtarget = target;
+
+      if (fcode == IX86_BUILTIN_GATHERALTSIV4DF
+	  || fcode == IX86_BUILTIN_GATHERALTSIV4DI)
+	{
+	  rtx half = gen_reg_rtx (V4SImode);
+	  if (!nonimmediate_operand (op2, V8SImode))
+	    op2 = copy_to_mode_reg (V8SImode, op2);
+	  emit_insn (gen_vec_extract_lo_v8si (half, op2));
+	  op2 = half;
+	}
+      else if (fcode == IX86_BUILTIN_GATHERALTDIV8SF
+	       || fcode == IX86_BUILTIN_GATHERALTDIV8SI)
+	{
+	  rtx (*gen) (rtx, rtx);
+	  rtx half = gen_reg_rtx (mode0);
+	  if (mode0 == V4SFmode)
+	    gen = gen_vec_extract_lo_v8sf;
+	  else
+	    gen = gen_vec_extract_lo_v8si;
+	  if (!nonimmediate_operand (op0, GET_MODE (op0)))
+	    op0 = copy_to_mode_reg (GET_MODE (op0), op0);
+	  emit_insn (gen (half, op0));
+	  op0 = half;
+	  if (!nonimmediate_operand (op3, GET_MODE (op3)))
+	    op3 = copy_to_mode_reg (GET_MODE (op3), op3);
+	  emit_insn (gen (half, op3));
+	  op3 = half;
+	}
 
       /* Force memory operand only with base register here.  But we
 	 don't want to do it on memory operand for other builtin
@@ -28935,10 +29002,26 @@ rdrand_step:
           error ("last argument must be scale 1, 2, 4, 8");
           return const0_rtx;
 	}
-      pat = GEN_FCN (icode) (target, op0, op1, op2, op3, op4);
+      pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
       if (! pat)
 	return const0_rtx;
       emit_insn (pat);
+
+      if (fcode == IX86_BUILTIN_GATHERDIV8SF
+	  || fcode == IX86_BUILTIN_GATHERDIV8SI)
+	{
+	  enum machine_mode tmode = GET_MODE (subtarget) == V8SFmode
+				    ? V4SFmode : V4SImode;
+	  if (target == NULL_RTX)
+	    target = gen_reg_rtx (tmode);
+	  if (tmode == V4SFmode)
+	    emit_insn (gen_vec_extract_lo_v8sf (target, subtarget));
+	  else
+	    emit_insn (gen_vec_extract_lo_v8si (target, subtarget));
+	}
+      else
+	target = subtarget;
+
       return target;
 
     default:
@@ -29443,6 +29526,73 @@ ix86_veclibabi_acml (enum built_in_funct
   return new_fndecl;
 }
 
+/* Returns a decl of a function that implements gather load with
+   memory type MEM_VECTYPE and index type INDEX_VECTYPE and SCALE.
+   Return NULL_TREE if it is not available.  */
+
+static tree
+ix86_vectorize_builtin_gather (const_tree mem_vectype,
+			       const_tree index_type, int scale)
+{
+  bool si;
+  enum ix86_builtins code;
+
+  if (! TARGET_AVX2)
+    return NULL_TREE;
+
+  if ((TREE_CODE (index_type) != INTEGER_TYPE
+       && !POINTER_TYPE_P (index_type))
+      || (TYPE_MODE (index_type) != SImode
+	  && TYPE_MODE (index_type) != DImode))
+    return NULL_TREE;
+
+  if (TYPE_PRECISION (index_type) > POINTER_SIZE)
+    return NULL_TREE;
+
+  /* v*gather* insn sign extends index to pointer mode.  */
+  if (TYPE_PRECISION (index_type) < POINTER_SIZE
+      && TYPE_UNSIGNED (index_type))
+    return NULL_TREE;
+
+  if (scale <= 0
+      || scale > 8
+      || (scale & (scale - 1)) != 0)
+    return NULL_TREE;
+
+  si = TYPE_MODE (index_type) == SImode;
+  switch (TYPE_MODE (mem_vectype))
+    {
+    case V2DFmode:
+      code = si ? IX86_BUILTIN_GATHERSIV2DF : IX86_BUILTIN_GATHERDIV2DF;
+      break;
+    case V4DFmode:
+      code = si ? IX86_BUILTIN_GATHERALTSIV4DF : IX86_BUILTIN_GATHERDIV4DF;
+      break;
+    case V2DImode:
+      code = si ? IX86_BUILTIN_GATHERSIV2DI : IX86_BUILTIN_GATHERDIV2DI;
+      break;
+    case V4DImode:
+      code = si ? IX86_BUILTIN_GATHERALTSIV4DI : IX86_BUILTIN_GATHERDIV4DI;
+      break;
+    case V4SFmode:
+      code = si ? IX86_BUILTIN_GATHERSIV4SF : IX86_BUILTIN_GATHERDIV4SF;
+      break;
+    case V8SFmode:
+      code = si ? IX86_BUILTIN_GATHERSIV8SF : IX86_BUILTIN_GATHERALTDIV8SF;
+      break;
+    case V4SImode:
+      code = si ? IX86_BUILTIN_GATHERSIV4SI : IX86_BUILTIN_GATHERDIV4SI;
+      break;
+    case V8SImode:
+      code = si ? IX86_BUILTIN_GATHERSIV8SI : IX86_BUILTIN_GATHERALTDIV8SI;
+      break;
+    default:
+      return NULL_TREE;
+    }
+
+  return ix86_builtins[code];
+}
+
 /* Returns a code for a target-specific builtin that implements
    reciprocal of the function, or NULL_TREE if not available.  */
 
@@ -37642,6 +37792,9 @@ ix86_autovectorize_vector_sizes (void)
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
   ix86_builtin_vectorized_function
 
+#undef TARGET_VECTORIZE_BUILTIN_GATHER
+#define TARGET_VECTORIZE_BUILTIN_GATHER ix86_vectorize_builtin_gather
+
 #undef TARGET_BUILTIN_RECIPROCAL
 #define TARGET_BUILTIN_RECIPROCAL ix86_builtin_reciprocal
 
--- gcc/testsuite/gcc.target/i386/avx2-gather-1.c.jj	2011-11-04 08:53:13.000000000 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-1.c	2011-11-04 08:53:13.000000000 +0100
@@ -0,0 +1,215 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-options "-O3 -mavx2" } */
+
+#include "avx2-check.h"
+
+#define N 1024
+float vf1[N+16], vf2[N];
+double vd1[N+16], vd2[N];
+int k[N];
+long l[N];
+short n[N];
+
+__attribute__((noinline, noclone)) void
+f1 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vf2[i] = vf1[k[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f2 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vf1[k[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f3 (int x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vf2[i] = vf1[k[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f4 (int x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vf1[k[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f5 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vd2[i] = vd1[k[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f6 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vd1[k[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f7 (int x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vd2[i] = vd1[k[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f8 (int x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vd1[k[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f9 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vf2[i] = vf1[l[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f10 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vf1[l[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f11 (long x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vf2[i] = vf1[l[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f12 (long x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vf1[l[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f13 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vd2[i] = vd1[l[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f14 (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vd1[l[i]];
+}
+
+__attribute__((noinline, noclone)) void
+f15 (long x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    vd2[i] = vd1[l[i] + x];
+}
+
+__attribute__((noinline, noclone)) void
+f16 (long x)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    n[i] = (int) vd1[l[i] + x];
+}
+
+static void
+avx2_test (void)
+{
+  int i;
+
+  for (i = 0; i < N + 16; i++)
+    {
+      asm ("");
+      vf1[i] = 17.0f + i;
+      vd1[i] = 19.0 + i;
+    }
+  for (i = 0; i < N; i++)
+    {
+      asm ("");
+      k[i] = (i * 731) & (N - 1);
+      l[i] = (i * 657) & (N - 1);
+    }
+
+  f1 ();
+  f2 ();
+  for (i = 0; i < N; i++)
+    if (vf2[i] != ((i * 731) & (N - 1)) + 17
+	|| n[i] != ((i * 731) & (N - 1)) + 17)
+      abort ();
+
+  f3 (12);
+  f4 (14);
+  for (i = 0; i < N; i++)
+    if (vf2[i] != ((i * 731) & (N - 1)) + 17 + 12
+	|| n[i] != ((i * 731) & (N - 1)) + 17 + 14)
+      abort ();
+
+  f5 ();
+  f6 ();
+  for (i = 0; i < N; i++)
+    if (vd2[i] != ((i * 731) & (N - 1)) + 19
+	|| n[i] != ((i * 731) & (N - 1)) + 19)
+      abort ();
+
+  f7 (7);
+  f8 (9);
+  for (i = 0; i < N; i++)
+    if (vd2[i] != ((i * 731) & (N - 1)) + 19 + 7
+	|| n[i] != ((i * 731) & (N - 1)) + 19 + 9)
+      abort ();
+
+  f9 ();
+  f10 ();
+  for (i = 0; i < N; i++)
+    if (vf2[i] != ((i * 657) & (N - 1)) + 17
+	|| n[i] != ((i * 657) & (N - 1)) + 17)
+      abort ();
+
+  f11 (2);
+  f12 (4);
+  for (i = 0; i < N; i++)
+    if (vf2[i] != ((i * 657) & (N - 1)) + 17 + 2
+	|| n[i] != ((i * 657) & (N - 1)) + 17 + 4)
+      abort ();
+
+  f13 ();
+  f14 ();
+  for (i = 0; i < N; i++)
+    if (vd2[i] != ((i * 657) & (N - 1)) + 19
+	|| n[i] != ((i * 657) & (N - 1)) + 19)
+      abort ();
+
+  f15 (13);
+  f16 (15);
+  for (i = 0; i < N; i++)
+    if (vd2[i] != ((i * 657) & (N - 1)) + 19 + 13
+	|| n[i] != ((i * 657) & (N - 1)) + 19 + 15)
+      abort ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-gather-2.c.jj	2011-11-04 08:53:13.000000000 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-2.c	2011-11-04 08:53:13.000000000 +0100
@@ -0,0 +1,7 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx2 -fdump-tree-vect-details" } */
+
+#include "avx2-gather-1.c"
+
+/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops in function" 16 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/testsuite/gcc.target/i386/avx2-gather-3.c.jj	2011-11-04 08:53:13.000000000 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-3.c	2011-11-04 08:53:13.000000000 +0100
@@ -0,0 +1,167 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-options "-O3 -mavx2 -ffast-math" } */
+
+#include "avx2-check.h"
+
+#define N 1024
+float f[N];
+double d[N];
+int k[N];
+float *l[N];
+double *n[N];
+int **m[N];
+long **o[N];
+long q[N];
+long *r[N];
+int *s[N];
+
+__attribute__((noinline, noclone)) float
+f1 (void)
+{
+  int i;
+  float g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += f[k[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) float
+f2 (float *p)
+{
+  int i;
+  float g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += p[k[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) float
+f3 (void)
+{
+  int i;
+  float g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += *l[i];
+  return g;
+}
+
+__attribute__((noinline, noclone)) int
+f4 (void)
+{
+  int i;
+  int g = 0;
+  for (i = 0; i < N / 2; i++)
+    g += **m[i];
+  return g;
+}
+
+__attribute__((noinline, noclone)) double
+f5 (void)
+{
+  int i;
+  double g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += d[k[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) double
+f6 (double *p)
+{
+  int i;
+  double g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += p[k[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) double
+f7 (void)
+{
+  int i;
+  double g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += *n[i];
+  return g;
+}
+
+__attribute__((noinline, noclone)) int
+f8 (void)
+{
+  int i;
+  int g = 0;
+  for (i = 0; i < N / 2; i++)
+    g += **o[i];
+  return g;
+}
+
+__attribute__((noinline, noclone)) float
+f9 (void)
+{
+  int i;
+  float g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += f[q[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) float
+f10 (float *p)
+{
+  int i;
+  float g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += p[q[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) double
+f11 (void)
+{
+  int i;
+  double g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += d[q[i]];
+  return g;
+}
+
+__attribute__((noinline, noclone)) double
+f12 (double *p)
+{
+  int i;
+  double g = 0.0;
+  for (i = 0; i < N / 2; i++)
+    g += p[q[i]];
+  return g;
+}
+
+static void
+avx2_test (void)
+{
+  int i;
+
+  for (i = 0; i < N; i++)
+    {
+      asm ("");
+      f[i] = -256.0f + i;
+      d[i] = -258.0 + i;
+      k[i] = (i * 731) & (N - 1);
+      q[i] = (i * 657) & (N - 1);
+      l[i] = &f[(i * 239) & (N - 1)];
+      n[i] = &d[(i * 271) & (N - 1)];
+      r[i] = &q[(i * 323) & (N - 1)];
+      s[i] = &k[(i * 565) & (N - 1)];
+      m[i] = &s[(i * 13) & (N - 1)];
+      o[i] = &r[(i * 19) & (N - 1)];
+    }
+
+  if (f1 () != 136448.0f || f2 (f) != 136448.0f || f3 () != 130304.0)
+    abort ();
+  if (f4 () != 261376 || f5 () != 135424.0 || f6 (d) != 135424.0)
+    abort ();
+  if (f7 () != 129280.0 || f8 () != 259840L || f9 () != 130816.0f)
+    abort ();
+  if (f10 (f) != 130816.0f || f11 () != 129792.0 || f12 (d) != 129792.0)
+    abort ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-gather-4.c.jj	2011-11-04 08:54:11.000000000 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-4.c	2011-11-04 08:54:11.000000000 +0100
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-options "-O3 -mavx2" } */
+
+#include "avx2-check.h"
+
+#define N 1024
+int a[N], b[N], c[N], d[N];
+
+__attribute__((noinline, noclone)) void
+foo (float *__restrict p, float *__restrict q, float *__restrict r,
+     long s1, long s2, long s3)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    p[i] = q[a[i] * s1 + b[i] * s2 + s3] * r[c[i] * s1 + d[i] * s2 + s3];
+}
+
+static void
+avx2_test (void)
+{
+  int i;
+  float e[N], f[N], g[N];
+  for (i = 0; i < N; i++)
+    {
+      a[i] = (i * 7) & (N / 8 - 1);
+      b[i] = (i * 13) & (N / 8 - 1);
+      c[i] = (i * 23) & (N / 8 - 1);
+      d[i] = (i * 5) & (N / 8 - 1);
+      e[i] = 16.5 + i;
+      f[i] = 127.5 - i;
+    }
+  foo (g, e, f, 3, 2, 4);
+  for (i = 0; i < N; i++)
+    if (g[i] != (float) ((20.5 + a[i] * 3 + b[i] * 2)
+			 * (123.5 - c[i] * 3 - d[i] * 2)))
+      abort ();
+}


	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Patch ping
  2011-11-04 11:44     ` Richard Guenther
  2011-11-04 13:32       ` [PATCH] Gather vectorization (PR tree-optimization/50789, take 2) Jakub Jelinek
@ 2011-11-04 14:09       ` Michael Matz
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Matz @ 2011-11-04 14:09 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Jakub Jelinek, Ira Rosen, Richard Henderson, gcc-patches

Hi,

On Fri, 4 Nov 2011, Richard Guenther wrote:

> > With the approach the incremental patch does I just walk what 
> > split_constant_offset during DR walks and peel off loop invariants 
> > until I have something that should be used as the vectorized index.
> 
> It looks like split_constant_offset walks def stmts in an unbound 
> fashion.  That's surely a bad idea - SCEV should already have expanded 
> everything non-loop-invariant, thus it should at most look through DEFs 
> that trivially add to the constant offset, not through others.

split_constant_offset is also used for canonicalization, to increase 
chances of finding the same base in two data refs to be able to use offset 
based disambiguation.  For that it sometimes has to look also through loop 
invariant parts.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Gather vectorization (PR tree-optimization/50789, take 2)
  2011-11-04 13:32       ` [PATCH] Gather vectorization (PR tree-optimization/50789, take 2) Jakub Jelinek
@ 2011-11-04 23:54         ` Richard Guenther
  2011-11-07 15:49         ` Uros Bizjak
  1 sibling, 0 replies; 8+ messages in thread
From: Richard Guenther @ 2011-11-04 23:54 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Guenther, Ira Rosen, Uros Bizjak, Richard Henderson, gcc-patches

On Fri, Nov 4, 2011 at 2:22 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> Hi!
>
> On Fri, Nov 04, 2011 at 12:21:49PM +0100, Richard Guenther wrote:
>> Ok.  I guess it's ok to use builtins for now - I didn't think of
>> the memory reference issue ;)
>
> Based on IRC discussion I'm posting an updated patch (both former
> patches (base and incremental) in one).
>
> I'm now using expr_invariant_in_loop_p instead of chrec_contains*,
> which nicely handles also the is_gimple_min_invariant case,
> and I've added several comments and fixed the MEM_REF offset
> folding.  Smoke tested on the *gather* testcases, will do full
> bootstrap/regtest soon.

Ok for the vectorizer pieces, I'll defer to x86 maintainers for the target bits.

Thanks,
Richard.

> 2011-11-04  Jakub Jelinek  <jakub@redhat.com>
>
>        PR tree-optimization/50789
>        * tree-vect-stmts.c (process_use): Add force argument, avoid
>        exist_non_indexing_operands_for_use_p check if true.
>        (vect_mark_stmts_to_be_vectorized): Adjust callers.  Handle
>        STMT_VINFO_GATHER_P.
>        (gen_perm_mask): New function.
>        (perm_mask_for_reverse): Use it.
>        (reverse_vec_element): Rename to...
>        (permute_vec_elements): ... this.  Add Y and MASK_VEC arguments,
>        generalize for any permutations.
>        (vectorizable_load): Adjust caller.  Handle STMT_VINFO_GATHER_P.
>        * target.def (TARGET_VECTORIZE_BUILTIN_GATHER): New hook.
>        * doc/tm.texi.in (TARGET_VECTORIZE_BUILTIN_GATHER): Document it.
>        * doc/tm.texi: Regenerate.
>        * tree-data-ref.c (initialize_data_dependence_relation,
>        compute_self_dependence): No longer static.
>        * tree-data-ref.h (initialize_data_dependence_relation,
>        compute_self_dependence): New prototypes.
>        * tree-vect-data-refs.c (vect_check_gather): New function.
>        (vect_analyze_data_refs): Detect possible gather load data
>        refs.
>        * tree-vectorizer.h (struct _stmt_vec_info): Add gather_p field.
>        (STMT_VINFO_GATHER_P): Define.
>        (vect_check_gather): New prototype.
>        * config/i386/i386-builtin-types.def: Add types for alternate
>        gather builtins.
>        * config/i386/sse.md (AVXMODE48P_DI): Remove.
>        (VEC_GATHER_MODE): Rename mode_attr to...
>        (VEC_GATHER_IDXSI): ... this.
>        (VEC_GATHER_IDXDI, VEC_GATHER_SRCDI): New mode_attrs.
>        (avx2_gathersi<mode>, *avx2_gathersi<mode>): Use <VEC_GATHER_IDXSI>
>        instead of <VEC_GATHER_MODE>.
>        (avx2_gatherdi<mode>): Use <VEC_GATHER_IDXDI> instead of
>        <<AVXMODE48P_DI> and <VEC_GATHER_SRCDI> instead of VEC_GATHER_MODE
>        on src and mask operands.
>        (*avx2_gatherdi<mode>): Likewise.  Use VEC_GATHER_MODE iterator
>        instead of AVXMODE48P_DI.
>        (avx2_gatherdi<mode>256, *avx2_gatherdi<mode>256): Removed.
>        * config/i386/i386.c (enum ix86_builtins): Add
>        IX86_BUILTIN_GATHERALTSIV4DF, IX86_BUILTIN_GATHERALTDIV8SF,
>        IX86_BUILTIN_GATHERALTSIV4DI and IX86_BUILTIN_GATHERALTDIV8SI.
>        (ix86_init_mmx_sse_builtins): Create those builtins.
>        (ix86_expand_builtin): Handle those builtins and adjust expansions
>        of other gather builtins.
>        (ix86_vectorize_builtin_gather): New function.
>        (TARGET_VECTORIZE_BUILTIN_GATHER): Define.
>
>        * gcc.target/i386/avx2-gather-1.c: New test.
>        * gcc.target/i386/avx2-gather-2.c: New test.
>        * gcc.target/i386/avx2-gather-3.c: New test.
>        * gcc.target/i386/avx2-gather-4.c: New test.
>
> --- gcc/tree-vect-stmts.c.jj    2011-11-04 08:52:19.000000000 +0100
> +++ gcc/tree-vect-stmts.c       2011-11-04 08:54:11.000000000 +0100
> @@ -332,6 +332,8 @@ exist_non_indexing_operands_for_use_p (t
>    - LIVE_P, RELEVANT - enum values to be set in the STMT_VINFO of the stmt
>      that defined USE.  This is done by calling mark_relevant and passing it
>      the WORKLIST (to add DEF_STMT to the WORKLIST in case it is relevant).
> +   - FORCE is true if exist_non_indexing_operands_for_use_p check shouldn't
> +     be performed.
>
>    Outputs:
>    Generally, LIVE_P and RELEVANT are used to define the liveness and
> @@ -351,7 +353,8 @@ exist_non_indexing_operands_for_use_p (t
>
>  static bool
>  process_use (gimple stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
> -            enum vect_relevant relevant, VEC(gimple,heap) **worklist)
> +            enum vect_relevant relevant, VEC(gimple,heap) **worklist,
> +            bool force)
>  {
>   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>   stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
> @@ -363,7 +366,7 @@ process_use (gimple stmt, tree use, loop
>
>   /* case 1: we are only interested in uses that need to be vectorized.  Uses
>      that are used for address computation are not considered relevant.  */
> -  if (!exist_non_indexing_operands_for_use_p (use, stmt))
> +  if (!force && !exist_non_indexing_operands_for_use_p (use, stmt))
>      return true;
>
>   if (!vect_is_simple_use (use, loop_vinfo, NULL, &def_stmt, &def, &dt))
> @@ -646,7 +649,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
>             break;
>         }
>
> -      if (is_pattern_stmt_p (vinfo_for_stmt (stmt)))
> +      if (is_pattern_stmt_p (stmt_vinfo))
>         {
>           /* Pattern statements are not inserted into the code, so
>              FOR_EACH_PHI_OR_STMT_USE optimizes their operands out, and we
> @@ -660,9 +663,9 @@ vect_mark_stmts_to_be_vectorized (loop_v
>              if (rhs_code == COND_EXPR && COMPARISON_CLASS_P (op))
>                {
>                  if (!process_use (stmt, TREE_OPERAND (op, 0), loop_vinfo,
> -                                   live_p, relevant, &worklist)
> +                                   live_p, relevant, &worklist, false)
>                      || !process_use (stmt, TREE_OPERAND (op, 1), loop_vinfo,
> -                                      live_p, relevant, &worklist))
> +                                      live_p, relevant, &worklist, false))
>                    {
>                      VEC_free (gimple, heap, worklist);
>                      return false;
> @@ -673,7 +676,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
>                 {
>                  op = gimple_op (stmt, i);
>                   if (!process_use (stmt, op, loop_vinfo, live_p, relevant,
> -                                    &worklist))
> +                                   &worklist, false))
>                     {
>                       VEC_free (gimple, heap, worklist);
>                       return false;
> @@ -686,7 +689,7 @@ vect_mark_stmts_to_be_vectorized (loop_v
>                 {
>                   tree arg = gimple_call_arg (stmt, i);
>                   if (!process_use (stmt, arg, loop_vinfo, live_p, relevant,
> -                                    &worklist))
> +                                   &worklist, false))
>                     {
>                       VEC_free (gimple, heap, worklist);
>                       return false;
> @@ -699,12 +702,25 @@ vect_mark_stmts_to_be_vectorized (loop_v
>           {
>             tree op = USE_FROM_PTR (use_p);
>             if (!process_use (stmt, op, loop_vinfo, live_p, relevant,
> -                              &worklist))
> +                             &worklist, false))
>               {
>                 VEC_free (gimple, heap, worklist);
>                 return false;
>               }
>           }
> +
> +      if (STMT_VINFO_GATHER_P (stmt_vinfo))
> +       {
> +         tree off;
> +         tree decl = vect_check_gather (stmt, loop_vinfo, NULL, &off, NULL);
> +         gcc_assert (decl);
> +         if (!process_use (stmt, off, loop_vinfo, live_p, relevant,
> +                           &worklist, true))
> +           {
> +             VEC_free (gimple, heap, worklist);
> +             return false;
> +           }
> +       }
>     } /* while worklist */
>
>   VEC_free (gimple, heap, worklist);
> @@ -4142,23 +4158,17 @@ vectorizable_store (gimple stmt, gimple_
>   return true;
>  }
>
> -/* Given a vector type VECTYPE returns a builtin DECL to be used
> -   for vector permutation and returns the mask that implements
> -   reversal of the vector elements.  If that is impossible to do,
> -   returns NULL.  */
> +/* Given a vector type VECTYPE and permutation SEL returns
> +   the VECTOR_CST mask that implements the permutation of the
> +   vector elements.  If that is impossible to do, returns NULL.  */
>
>  static tree
> -perm_mask_for_reverse (tree vectype)
> +gen_perm_mask (tree vectype, unsigned char *sel)
>  {
>   tree mask_elt_type, mask_type, mask_vec;
>   int i, nunits;
> -  unsigned char *sel;
>
>   nunits = TYPE_VECTOR_SUBPARTS (vectype);
> -  sel = XALLOCAVEC (unsigned char, nunits);
> -
> -  for (i = 0; i < nunits; ++i)
> -    sel[i] = nunits - 1 - i;
>
>   if (!can_vec_perm_p (TYPE_MODE (vectype), false, sel))
>     return NULL;
> @@ -4169,33 +4179,52 @@ perm_mask_for_reverse (tree vectype)
>   mask_type = get_vectype_for_scalar_type (mask_elt_type);
>
>   mask_vec = NULL;
> -  for (i = 0; i < nunits; i++)
> -    mask_vec = tree_cons (NULL, build_int_cst (mask_elt_type, i), mask_vec);
> +  for (i = nunits - 1; i >= 0; i--)
> +    mask_vec = tree_cons (NULL, build_int_cst (mask_elt_type, sel[i]),
> +                         mask_vec);
>   mask_vec = build_vector (mask_type, mask_vec);
>
>   return mask_vec;
>  }
>
> -/* Given a vector variable X, that was generated for the scalar LHS of
> -   STMT, generate instructions to reverse the vector elements of X,
> -   insert them a *GSI and return the permuted vector variable.  */
> +/* Given a vector type VECTYPE returns the VECTOR_CST mask that implements
> +   reversal of the vector elements.  If that is impossible to do,
> +   returns NULL.  */
>
>  static tree
> -reverse_vec_elements (tree x, gimple stmt, gimple_stmt_iterator *gsi)
> +perm_mask_for_reverse (tree vectype)
> +{
> +  int i, nunits;
> +  unsigned char *sel;
> +
> +  nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  sel = XALLOCAVEC (unsigned char, nunits);
> +
> +  for (i = 0; i < nunits; ++i)
> +    sel[i] = nunits - 1 - i;
> +
> +  return gen_perm_mask (vectype, sel);
> +}
> +
> +/* Given a vector variable X and Y, that was generated for the scalar
> +   STMT, generate instructions to permute the vector elements of X and Y
> +   using permutation mask MASK_VEC, insert them at *GSI and return the
> +   permuted vector variable.  */
> +
> +static tree
> +permute_vec_elements (tree x, tree y, tree mask_vec, gimple stmt,
> +                     gimple_stmt_iterator *gsi)
>  {
>   tree vectype = TREE_TYPE (x);
> -  tree mask_vec, perm_dest, data_ref;
> +  tree perm_dest, data_ref;
>   gimple perm_stmt;
>
> -  mask_vec = perm_mask_for_reverse (vectype);
> -
>   perm_dest = vect_create_destination_var (gimple_assign_lhs (stmt), vectype);
> +  data_ref = make_ssa_name (perm_dest, NULL);
>
>   /* Generate the permute statement.  */
> -  perm_stmt = gimple_build_assign_with_ops3 (VEC_PERM_EXPR, perm_dest,
> -                                            x, x, mask_vec);
> -  data_ref = make_ssa_name (perm_dest, perm_stmt);
> -  gimple_set_lhs (perm_stmt, data_ref);
> +  perm_stmt = gimple_build_assign_with_ops3 (VEC_PERM_EXPR, data_ref,
> +                                            x, y, mask_vec);
>   vect_finish_stmt_generation (stmt, perm_stmt, gsi);
>
>   return data_ref;
> @@ -4254,6 +4283,10 @@ vectorizable_load (gimple stmt, gimple_s
>   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
>   int vf;
>   tree aggr_type;
> +  tree gather_base = NULL_TREE, gather_off = NULL_TREE;
> +  tree gather_off_vectype = NULL_TREE, gather_decl = NULL_TREE;
> +  int gather_scale = 1;
> +  enum vect_def_type gather_dt = vect_unknown_def_type;
>
>   if (loop_vinfo)
>     {
> @@ -4334,7 +4367,7 @@ vectorizable_load (gimple stmt, gimple_s
>     {
>       strided_load = true;
>       /* FORNOW */
> -      gcc_assert (! nested_in_vect_loop);
> +      gcc_assert (! nested_in_vect_loop && !STMT_VINFO_GATHER_P (stmt_info));
>
>       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
>       if (!slp && !PURE_SLP_STMT (stmt_info))
> @@ -4349,7 +4382,7 @@ vectorizable_load (gimple stmt, gimple_s
>
>   if (negative)
>     {
> -      gcc_assert (!strided_load);
> +      gcc_assert (!strided_load && !STMT_VINFO_GATHER_P (stmt_info));
>       alignment_support_scheme = vect_supportable_dr_alignment (dr, false);
>       if (alignment_support_scheme != dr_aligned
>          && alignment_support_scheme != dr_unaligned_supported)
> @@ -4366,6 +4399,23 @@ vectorizable_load (gimple stmt, gimple_s
>        }
>     }
>
> +  if (STMT_VINFO_GATHER_P (stmt_info))
> +    {
> +      gimple def_stmt;
> +      tree def;
> +      gather_decl = vect_check_gather (stmt, loop_vinfo, &gather_base,
> +                                      &gather_off, &gather_scale);
> +      gcc_assert (gather_decl);
> +      if (!vect_is_simple_use_1 (gather_off, loop_vinfo, bb_vinfo,
> +                                &def_stmt, &def, &gather_dt,
> +                                &gather_off_vectype))
> +       {
> +         if (vect_print_dump_info (REPORT_DETAILS))
> +           fprintf (vect_dump, "gather index use not simple.");
> +         return false;
> +       }
> +    }
> +
>   if (!vec_stmt) /* transformation not required.  */
>     {
>       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
> @@ -4378,6 +4428,161 @@ vectorizable_load (gimple stmt, gimple_s
>
>   /** Transform.  **/
>
> +  if (STMT_VINFO_GATHER_P (stmt_info))
> +    {
> +      tree vec_oprnd0 = NULL_TREE, op;
> +      tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
> +      tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
> +      tree ptr, mask, var, scale, perm_mask = NULL_TREE, prev_res = NULL_TREE;
> +      edge pe = loop_preheader_edge (loop);
> +      gimple_seq seq;
> +      basic_block new_bb;
> +      enum { NARROW, NONE, WIDEN } modifier;
> +      int gather_off_nunits = TYPE_VECTOR_SUBPARTS (gather_off_vectype);
> +
> +      if (nunits == gather_off_nunits)
> +       modifier = NONE;
> +      else if (nunits == gather_off_nunits / 2)
> +       {
> +         unsigned char *sel = XALLOCAVEC (unsigned char, gather_off_nunits);
> +         modifier = WIDEN;
> +
> +         for (i = 0; i < gather_off_nunits; ++i)
> +           sel[i] = i | nunits;
> +
> +         perm_mask = gen_perm_mask (gather_off_vectype, sel);
> +         gcc_assert (perm_mask != NULL_TREE);
> +       }
> +      else if (nunits == gather_off_nunits * 2)
> +       {
> +         unsigned char *sel = XALLOCAVEC (unsigned char, nunits);
> +         modifier = NARROW;
> +
> +         for (i = 0; i < nunits; ++i)
> +           sel[i] = i < gather_off_nunits
> +                    ? i : i + nunits - gather_off_nunits;
> +
> +         perm_mask = gen_perm_mask (vectype, sel);
> +         gcc_assert (perm_mask != NULL_TREE);
> +         ncopies *= 2;
> +       }
> +      else
> +       gcc_unreachable ();
> +
> +      rettype = TREE_TYPE (TREE_TYPE (gather_decl));
> +      srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      ptrtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      scaletype = TREE_VALUE (arglist);
> +      gcc_checking_assert (types_compatible_p (srctype, rettype)
> +                          && types_compatible_p (srctype, masktype));
> +
> +      vec_dest = vect_create_destination_var (scalar_dest, vectype);
> +
> +      ptr = fold_convert (ptrtype, gather_base);
> +      if (!is_gimple_min_invariant (ptr))
> +       {
> +         ptr = force_gimple_operand (ptr, &seq, true, NULL_TREE);
> +         new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
> +         gcc_assert (!new_bb);
> +       }
> +
> +      /* Currently we support only unconditional gather loads,
> +        so mask should be all ones.  */
> +      if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
> +       mask = build_int_cst (TREE_TYPE (masktype), -1);
> +      else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
> +       {
> +         REAL_VALUE_TYPE r;
> +         long tmp[6];
> +         for (j = 0; j < 6; ++j)
> +           tmp[j] = -1;
> +         real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
> +         mask = build_real (TREE_TYPE (masktype), r);
> +       }
> +      else
> +       gcc_unreachable ();
> +      mask = build_vector_from_val (masktype, mask);
> +      mask = vect_init_vector (stmt, mask, masktype, NULL);
> +
> +      scale = build_int_cst (scaletype, gather_scale);
> +
> +      prev_stmt_info = NULL;
> +      for (j = 0; j < ncopies; ++j)
> +       {
> +         if (modifier == WIDEN && (j & 1))
> +           op = permute_vec_elements (vec_oprnd0, vec_oprnd0,
> +                                      perm_mask, stmt, gsi);
> +         else if (j == 0)
> +           op = vec_oprnd0
> +             = vect_get_vec_def_for_operand (gather_off, stmt, NULL);
> +         else
> +           op = vec_oprnd0
> +             = vect_get_vec_def_for_stmt_copy (gather_dt, vec_oprnd0);
> +
> +         if (!useless_type_conversion_p (idxtype, TREE_TYPE (op)))
> +           {
> +             gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (op))
> +                         == TYPE_VECTOR_SUBPARTS (idxtype));
> +             var = vect_get_new_vect_var (idxtype, vect_simple_var, NULL);
> +             add_referenced_var (var);
> +             var = make_ssa_name (var, NULL);
> +             op = build1 (VIEW_CONVERT_EXPR, idxtype, op);
> +             new_stmt
> +               = gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
> +                                               op, NULL_TREE);
> +             vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +             op = var;
> +           }
> +
> +         new_stmt
> +           = gimple_build_call (gather_decl, 5, mask, ptr, op, mask, scale);
> +
> +         if (!useless_type_conversion_p (vectype, rettype))
> +           {
> +             gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +                         == TYPE_VECTOR_SUBPARTS (rettype));
> +             var = vect_get_new_vect_var (rettype, vect_simple_var, NULL);
> +             add_referenced_var (var);
> +             op = make_ssa_name (var, new_stmt);
> +             gimple_call_set_lhs (new_stmt, op);
> +             vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +             var = make_ssa_name (vec_dest, NULL);
> +             op = build1 (VIEW_CONVERT_EXPR, vectype, op);
> +             new_stmt
> +               = gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var, op,
> +                                               NULL_TREE);
> +           }
> +         else
> +           {
> +             var = make_ssa_name (vec_dest, new_stmt);
> +             gimple_call_set_lhs (new_stmt, var);
> +           }
> +
> +         vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +
> +         if (modifier == NARROW)
> +           {
> +             if ((j & 1) == 0)
> +               {
> +                 prev_res = var;
> +                 continue;
> +               }
> +             var = permute_vec_elements (prev_res, var,
> +                                         perm_mask, stmt, gsi);
> +             new_stmt = SSA_NAME_DEF_STMT (var);
> +           }
> +
> +         if (prev_stmt_info == NULL)
> +           STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +         else
> +           STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
> +         prev_stmt_info = vinfo_for_stmt (new_stmt);
> +       }
> +      return true;
> +    }
> +
>   if (strided_load)
>     {
>       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
> @@ -4769,7 +4974,9 @@ vectorizable_load (gimple stmt, gimple_s
>
>              if (negative)
>                {
> -                 new_temp = reverse_vec_elements (new_temp, stmt, gsi);
> +                 tree perm_mask = perm_mask_for_reverse (vectype);
> +                 new_temp = permute_vec_elements (new_temp, new_temp,
> +                                                  perm_mask, stmt, gsi);
>                  new_stmt = SSA_NAME_DEF_STMT (new_temp);
>                }
>
> --- gcc/target.def.jj   2011-11-04 08:52:19.000000000 +0100
> +++ gcc/target.def      2011-11-04 08:53:13.000000000 +0100
> @@ -1021,6 +1021,14 @@ DEFHOOK
>  (void),
>  default_autovectorize_vector_sizes)
>
> +/* Target builtin that implements vector gather operation.  */
> +DEFHOOK
> +(builtin_gather,
> + "",
> + tree,
> + (const_tree mem_vectype, const_tree index_type, int scale),
> + NULL)
> +
>  HOOK_VECTOR_END (vectorize)
>
>  #undef HOOK_PREFIX
> --- gcc/tree-data-ref.c.jj      2011-11-04 08:52:19.000000000 +0100
> +++ gcc/tree-data-ref.c 2011-11-04 08:53:13.000000000 +0100
> @@ -1351,13 +1351,11 @@ dr_may_alias_p (const struct data_refere
>   return refs_may_alias_p (addr_a, addr_b);
>  }
>
> -static void compute_self_dependence (struct data_dependence_relation *);
> -
>  /* Initialize a data dependence relation between data accesses A and
>    B.  NB_LOOPS is the number of loops surrounding the references: the
>    size of the classic distance/direction vectors.  */
>
> -static struct data_dependence_relation *
> +struct data_dependence_relation *
>  initialize_data_dependence_relation (struct data_reference *a,
>                                     struct data_reference *b,
>                                     VEC (loop_p, heap) *loop_nest)
> @@ -4121,7 +4119,7 @@ compute_affine_dependence (struct data_d
>  /* This computes the dependence relation for the same data
>    reference into DDR.  */
>
> -static void
> +void
>  compute_self_dependence (struct data_dependence_relation *ddr)
>  {
>   unsigned int i;
> --- gcc/tree-data-ref.h.jj      2011-11-04 08:52:19.000000000 +0100
> +++ gcc/tree-data-ref.h 2011-11-04 13:22:28.000000000 +0100
> @@ -1,5 +1,5 @@
>  /* Data references and dependences detectors.
> -   Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010
> +   Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011
>    Free Software Foundation, Inc.
>    Contributed by Sebastian Pop <pop@cri.ensmp.fr>
>
> @@ -423,6 +423,9 @@ extern bool graphite_find_data_reference
>                                                   VEC (data_reference_p, heap) **);
>  struct data_reference *create_data_ref (loop_p, loop_p, tree, gimple, bool);
>  extern bool find_loop_nest (struct loop *, VEC (loop_p, heap) **);
> +extern struct data_dependence_relation *initialize_data_dependence_relation
> +     (struct data_reference *, struct data_reference *, VEC (loop_p, heap) *);
> +extern void compute_self_dependence (struct data_dependence_relation *);
>  extern void compute_all_dependences (VEC (data_reference_p, heap) *,
>                                     VEC (ddr_p, heap) **, VEC (loop_p, heap) *,
>                                     bool);
> --- gcc/doc/tm.texi.in.jj       2011-11-04 08:52:19.000000000 +0100
> +++ gcc/doc/tm.texi.in  2011-11-04 08:53:13.000000000 +0100
> @@ -5696,6 +5696,14 @@ mode returned by @code{TARGET_VECTORIZE_
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@hook TARGET_VECTORIZE_BUILTIN_GATHER
> +Target builtin that implements vector gather operation.  @var{mem_vectype}
> +is the vector type of the load and @var{index_type} is scalar type of
> +the index, scaled by @var{scale}.
> +The default is @code{NULL_TREE} which means to not vectorize gather
> +loads.
> +@end deftypefn
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses
> --- gcc/doc/tm.texi.jj  2011-11-04 08:52:19.000000000 +0100
> +++ gcc/doc/tm.texi     2011-11-04 08:53:13.000000000 +0100
> @@ -5758,6 +5758,14 @@ mode returned by @code{TARGET_VECTORIZE_
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
> +Target builtin that implements vector gather operation.  @var{mem_vectype}
> +is the vector type of the load and @var{index_type} is scalar type of
> +the index, scaled by @var{scale}.
> +The default is @code{NULL_TREE} which means to not vectorize gather
> +loads.
> +@end deftypefn
> +
>  @node Anchored Addresses
>  @section Anchored Addresses
>  @cindex anchored addresses
> --- gcc/tree-vect-data-refs.c.jj        2011-11-04 08:52:57.000000000 +0100
> +++ gcc/tree-vect-data-refs.c   2011-11-04 14:10:17.000000000 +0100
> @@ -2497,6 +2497,199 @@ vect_prune_runtime_alias_test_list (loop
>   return true;
>  }
>
> +/* Check whether a non-affine read in stmt is suitable for gather load
> +   and if so, return a builtin decl for that operation.  */
> +
> +tree
> +vect_check_gather (gimple stmt, loop_vec_info loop_vinfo, tree *basep,
> +                  tree *offp, int *scalep)
> +{
> +  HOST_WIDE_INT scale = 1, pbitpos, pbitsize;
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
> +  tree offtype = NULL_TREE;
> +  tree decl, base, off;
> +  enum machine_mode pmode;
> +  int punsignedp, pvolatilep;
> +
> +  /* The gather builtins need address of the form
> +     loop_invariant + vector * {1, 2, 4, 8}
> +     or
> +     loop_invariant + sign_extend (vector) * { 1, 2, 4, 8 }.
> +     Unfortunately DR_BASE_ADDRESS/DR_OFFSET can be a mixture
> +     of loop invariants/SSA_NAMEs defined in the loop, with casts,
> +     multiplications and additions in it.  To get a vector, we need
> +     a single SSA_NAME that will be defined in the loop and will
> +     contain everything that is not loop invariant and that can be
> +     vectorized.  The following code attempts to find such a preexistng
> +     SSA_NAME OFF and put the loop invariants into a tree BASE
> +     that can be gimplified before the loop.  */
> +  base = get_inner_reference (DR_REF (dr), &pbitsize, &pbitpos, &off,
> +                             &pmode, &punsignedp, &pvolatilep, false);
> +  gcc_assert (base != NULL_TREE && (pbitpos % BITS_PER_UNIT) == 0);
> +
> +  if (TREE_CODE (base) == MEM_REF)
> +    {
> +      if (!integer_zerop (TREE_OPERAND (base, 1)))
> +       {
> +         if (off == NULL_TREE)
> +           {
> +             double_int moff = mem_ref_offset (base);
> +             off = double_int_to_tree (sizetype, moff);
> +           }
> +         else
> +           off = size_binop (PLUS_EXPR, off,
> +                             fold_convert (sizetype, TREE_OPERAND (base, 1)));
> +       }
> +      base = TREE_OPERAND (base, 0);
> +    }
> +  else
> +    base = build_fold_addr_expr (base);
> +
> +  if (off == NULL_TREE)
> +    off = size_zero_node;
> +
> +  /* If base is not loop invariant, either off is 0, then we start with just
> +     the constant offset in the loop invariant BASE and continue with base
> +     as OFF, otherwise give up.
> +     We could handle that case by gimplifying the addition of base + off
> +     into some SSA_NAME and use that as off, but for now punt.  */
> +  if (!expr_invariant_in_loop_p (loop, base))
> +    {
> +      if (!integer_zerop (off))
> +       return NULL_TREE;
> +      off = base;
> +      base = size_int (pbitpos / BITS_PER_UNIT);
> +    }
> +  /* Otherwise put base + constant offset into the loop invariant BASE
> +     and continue with OFF.  */
> +  else
> +    {
> +      base = fold_convert (sizetype, base);
> +      base = size_binop (PLUS_EXPR, base, size_int (pbitpos / BITS_PER_UNIT));
> +    }
> +
> +  /* OFF at this point may be either a SSA_NAME or some tree expression
> +     from get_inner_reference.  Try to peel off loop invariants from it
> +     into BASE as long as possible.  */
> +  STRIP_NOPS (off);
> +  while (offtype == NULL_TREE)
> +    {
> +      enum tree_code code;
> +      tree op0, op1, add = NULL_TREE;
> +
> +      if (TREE_CODE (off) == SSA_NAME)
> +       {
> +         gimple def_stmt = SSA_NAME_DEF_STMT (off);
> +
> +         if (expr_invariant_in_loop_p (loop, off))
> +           return NULL_TREE;
> +
> +         if (gimple_code (def_stmt) != GIMPLE_ASSIGN)
> +           break;
> +
> +         op0 = gimple_assign_rhs1 (def_stmt);
> +         code = gimple_assign_rhs_code (def_stmt);
> +         op1 = gimple_assign_rhs2 (def_stmt);
> +       }
> +      else
> +       {
> +         if (get_gimple_rhs_class (TREE_CODE (off)) == GIMPLE_TERNARY_RHS)
> +           return NULL_TREE;
> +         code = TREE_CODE (off);
> +         extract_ops_from_tree (off, &code, &op0, &op1);
> +       }
> +      switch (code)
> +       {
> +       case POINTER_PLUS_EXPR:
> +       case PLUS_EXPR:
> +         if (expr_invariant_in_loop_p (loop, op0))
> +           {
> +             add = op0;
> +             off = op1;
> +           do_add:
> +             add = fold_convert (sizetype, add);
> +             if (scale != 1)
> +               add = size_binop (MULT_EXPR, add, size_int (scale));
> +             base = size_binop (PLUS_EXPR, base, add);
> +             continue;
> +           }
> +         if (expr_invariant_in_loop_p (loop, op1))
> +           {
> +             add = op1;
> +             off = op0;
> +             goto do_add;
> +           }
> +         break;
> +       case MINUS_EXPR:
> +         if (expr_invariant_in_loop_p (loop, op1))
> +           {
> +             add = fold_convert (sizetype, op1);
> +             add = size_binop (MINUS_EXPR, size_zero_node, add);
> +             off = op0;
> +             goto do_add;
> +           }
> +         break;
> +       case MULT_EXPR:
> +         if (scale == 1 && host_integerp (op1, 0))
> +           {
> +             scale = tree_low_cst (op1, 0);
> +             off = op0;
> +             continue;
> +           }
> +         break;
> +       case SSA_NAME:
> +         off = op0;
> +         continue;
> +       CASE_CONVERT:
> +         if (!POINTER_TYPE_P (TREE_TYPE (op0))
> +             && !INTEGRAL_TYPE_P (TREE_TYPE (op0)))
> +           break;
> +         if (TYPE_PRECISION (TREE_TYPE (op0))
> +             == TYPE_PRECISION (TREE_TYPE (off)))
> +           {
> +             off = op0;
> +             continue;
> +           }
> +         if (TYPE_PRECISION (TREE_TYPE (op0))
> +             < TYPE_PRECISION (TREE_TYPE (off)))
> +           {
> +             off = op0;
> +             offtype = TREE_TYPE (off);
> +             STRIP_NOPS (off);
> +             continue;
> +           }
> +         break;
> +       default:
> +         break;
> +       }
> +      break;
> +    }
> +
> +  /* If at the end OFF still isn't a SSA_NAME or isn't
> +     defined in the loop, punt.  */
> +  if (TREE_CODE (off) != SSA_NAME
> +      || expr_invariant_in_loop_p (loop, off))
> +    return NULL_TREE;
> +
> +  if (offtype == NULL_TREE)
> +    offtype = TREE_TYPE (off);
> +
> +  decl = targetm.vectorize.builtin_gather (STMT_VINFO_VECTYPE (stmt_info),
> +                                          offtype, scale);
> +  if (decl == NULL_TREE)
> +    return NULL_TREE;
> +
> +  if (basep)
> +    *basep = base;
> +  if (offp)
> +    *offp = off;
> +  if (scalep)
> +    *scalep = scale;
> +  return decl;
> +}
> +
>
>  /* Function vect_analyze_data_refs.
>
> @@ -2573,6 +2766,7 @@ vect_analyze_data_refs (loop_vec_info lo
>       gimple stmt;
>       stmt_vec_info stmt_info;
>       tree base, offset, init;
> +      bool gather = false;
>       int vf;
>
>       if (!dr || !DR_REF (dr))
> @@ -2594,22 +2788,51 @@ vect_analyze_data_refs (loop_vec_info lo
>
>       /* Check that analysis of the data-ref succeeded.  */
>       if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
> -          || !DR_STEP (dr))
> +         || !DR_STEP (dr))
>         {
> -          if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> -            {
> -              fprintf (vect_dump, "not vectorized: data ref analysis failed ");
> -              print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
> -            }
> +         /* If target supports vector gather loads, see if they can't
> +            be used.  */
> +         if (loop_vinfo
> +             && DR_IS_READ (dr)
> +             && !TREE_THIS_VOLATILE (DR_REF (dr))
> +             && targetm.vectorize.builtin_gather != NULL
> +             && !nested_in_vect_loop_p (loop, stmt))
> +           {
> +             struct data_reference *newdr
> +               = create_data_ref (NULL, loop_containing_stmt (stmt),
> +                                  DR_REF (dr), stmt, true);
> +             gcc_assert (newdr != NULL && DR_REF (newdr));
> +             if (DR_BASE_ADDRESS (newdr)
> +                 && DR_OFFSET (newdr)
> +                 && DR_INIT (newdr)
> +                 && DR_STEP (newdr)
> +                 && integer_zerop (DR_STEP (newdr)))
> +               {
> +                 dr = newdr;
> +                 gather = true;
> +               }
> +             else
> +               free_data_ref (newdr);
> +           }
>
> -          if (bb_vinfo)
> -            {
> -              STMT_VINFO_VECTORIZABLE (stmt_info) = false;
> -              stop_bb_analysis = true;
> -              continue;
> -            }
> +         if (!gather)
> +           {
> +             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +               {
> +                 fprintf (vect_dump, "not vectorized: data ref analysis "
> +                                     "failed ");
> +                 print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
> +               }
> +
> +             if (bb_vinfo)
> +               {
> +                 STMT_VINFO_VECTORIZABLE (stmt_info) = false;
> +                 stop_bb_analysis = true;
> +                 continue;
> +               }
>
> -          return false;
> +             return false;
> +           }
>         }
>
>       if (TREE_CODE (DR_BASE_ADDRESS (dr)) == INTEGER_CST)
> @@ -2625,7 +2848,9 @@ vect_analyze_data_refs (loop_vec_info lo
>               continue;
>             }
>
> -           return false;
> +         if (gather)
> +           free_data_ref (dr);
> +         return false;
>         }
>
>       if (TREE_THIS_VOLATILE (DR_REF (dr)))
> @@ -2666,6 +2891,8 @@ vect_analyze_data_refs (loop_vec_info lo
>               continue;
>             }
>
> +         if (gather)
> +           free_data_ref (dr);
>           return false;
>         }
>
> @@ -2791,6 +3018,8 @@ vect_analyze_data_refs (loop_vec_info lo
>               continue;
>             }
>
> +         if (gather)
> +           free_data_ref (dr);
>           return false;
>         }
>
> @@ -2818,8 +3047,13 @@ vect_analyze_data_refs (loop_vec_info lo
>               stop_bb_analysis = true;
>               continue;
>             }
> -          else
> -            return false;
> +
> +         if (gather)
> +           {
> +             STMT_VINFO_DATA_REF (stmt_info) = NULL;
> +             free_data_ref (dr);
> +           }
> +         return false;
>         }
>
>       /* Adjust the minimal vectorization factor according to the
> @@ -2827,6 +3061,86 @@ vect_analyze_data_refs (loop_vec_info lo
>       vf = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>       if (vf > *min_vf)
>        *min_vf = vf;
> +
> +      if (gather)
> +       {
> +         unsigned int j, k, n;
> +         struct data_reference *olddr
> +           = VEC_index (data_reference_p, datarefs, i);
> +         VEC (ddr_p, heap) *ddrs = LOOP_VINFO_DDRS (loop_vinfo);
> +         struct data_dependence_relation *ddr, *newddr;
> +         bool bad = false;
> +         tree off;
> +         VEC (loop_p, heap) *nest = LOOP_VINFO_LOOP_NEST (loop_vinfo);
> +
> +         if (!vect_check_gather (stmt, loop_vinfo, NULL, &off, NULL)
> +             || get_vectype_for_scalar_type (TREE_TYPE (off)) == NULL_TREE)
> +           {
> +             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +               {
> +                 fprintf (vect_dump,
> +                          "not vectorized: not suitable for gather ");
> +                 print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
> +               }
> +             return false;
> +           }
> +
> +         n = VEC_length (data_reference_p, datarefs) - 1;
> +         for (j = 0, k = i - 1; j < i; j++)
> +           {
> +             ddr = VEC_index (ddr_p, ddrs, k);
> +             gcc_assert (DDR_B (ddr) == olddr);
> +             newddr = initialize_data_dependence_relation (DDR_A (ddr), dr,
> +                                                           nest);
> +             VEC_replace (ddr_p, ddrs, k, newddr);
> +             free_dependence_relation (ddr);
> +             if (!bad
> +                 && DR_IS_WRITE (DDR_A (newddr))
> +                 && DDR_ARE_DEPENDENT (newddr) != chrec_known)
> +               bad = true;
> +             k += --n;
> +           }
> +
> +         k++;
> +         n = k + VEC_length (data_reference_p, datarefs) - i - 1;
> +         for (; k < n; k++)
> +           {
> +             ddr = VEC_index (ddr_p, ddrs, k);
> +             gcc_assert (DDR_A (ddr) == olddr);
> +             newddr = initialize_data_dependence_relation (dr, DDR_B (ddr),
> +                                                           nest);
> +             VEC_replace (ddr_p, ddrs, k, newddr);
> +             free_dependence_relation (ddr);
> +             if (!bad
> +                 && DR_IS_WRITE (DDR_B (newddr))
> +                 && DDR_ARE_DEPENDENT (newddr) != chrec_known)
> +               bad = true;
> +           }
> +
> +         k = VEC_length (ddr_p, ddrs)
> +             - VEC_length (data_reference_p, datarefs) + i;
> +         ddr = VEC_index (ddr_p, ddrs, k);
> +         gcc_assert (DDR_A (ddr) == olddr && DDR_B (ddr) == olddr);
> +         newddr = initialize_data_dependence_relation (dr, dr, nest);
> +         compute_self_dependence (newddr);
> +         VEC_replace (ddr_p, ddrs, k, newddr);
> +         free_dependence_relation (ddr);
> +         VEC_replace (data_reference_p, datarefs, i, dr);
> +
> +         if (bad)
> +           {
> +             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
> +               {
> +                 fprintf (vect_dump,
> +                          "not vectorized: data dependence conflict"
> +                          " prevents gather");
> +                 print_gimple_stmt (vect_dump, stmt, 0, TDF_SLIM);
> +               }
> +             return false;
> +           }
> +
> +         STMT_VINFO_GATHER_P (stmt_info) = true;
> +       }
>     }
>
>   return true;
> --- gcc/tree-vectorizer.h.jj    2011-11-04 08:52:19.000000000 +0100
> +++ gcc/tree-vectorizer.h       2011-11-04 08:53:13.000000000 +0100
> @@ -535,6 +535,9 @@ typedef struct _stmt_vec_info {
>   /* Is this statement vectorizable or should it be skipped in (partial)
>      vectorization.  */
>   bool vectorizable;
> +
> +  /* For loads only, true if this is a gather load.  */
> +  bool gather_p;
>  } *stmt_vec_info;
>
>  /* Access Functions.  */
> @@ -548,6 +551,7 @@ typedef struct _stmt_vec_info {
>  #define STMT_VINFO_VEC_STMT(S)             (S)->vectorized_stmt
>  #define STMT_VINFO_VECTORIZABLE(S)         (S)->vectorizable
>  #define STMT_VINFO_DATA_REF(S)             (S)->data_ref_info
> +#define STMT_VINFO_GATHER_P(S)            (S)->gather_p
>
>  #define STMT_VINFO_DR_BASE_ADDRESS(S)      (S)->dr_base_address
>  #define STMT_VINFO_DR_INIT(S)              (S)->dr_init
> @@ -858,6 +862,8 @@ extern bool vect_analyze_data_refs_align
>  extern bool vect_verify_datarefs_alignment (loop_vec_info, bb_vec_info);
>  extern bool vect_analyze_data_ref_accesses (loop_vec_info, bb_vec_info);
>  extern bool vect_prune_runtime_alias_test_list (loop_vec_info);
> +extern tree vect_check_gather (gimple, loop_vec_info, tree *, tree *,
> +                              int *);
>  extern bool vect_analyze_data_refs (loop_vec_info, bb_vec_info, int *);
>  extern tree vect_create_data_ref_ptr (gimple, tree, struct loop *, tree,
>                                      tree *, gimple_stmt_iterator *,
> --- gcc/config/i386/i386-builtin-types.def.jj   2011-11-04 08:52:19.000000000 +0100
> +++ gcc/config/i386/i386-builtin-types.def      2011-11-04 08:53:13.000000000 +0100
> @@ -432,20 +432,24 @@ DEF_FUNCTION_TYPE (V8QI, QI, QI, QI, QI,
>
>  DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE, V4SI, V2DF, INT)
>  DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V4SI, V4DF, INT)
> +DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V8SI, V4DF, INT)
>  DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE, V2DI, V2DF, INT)
>  DEF_FUNCTION_TYPE (V4DF, V4DF, PCDOUBLE, V4DI, V4DF, INT)
>  DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V4SI, V4SF, INT)
>  DEF_FUNCTION_TYPE (V8SF, V8SF, PCFLOAT, V8SI, V8SF, INT)
>  DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V2DI, V4SF, INT)
>  DEF_FUNCTION_TYPE (V4SF, V4SF, PCFLOAT, V4DI, V4SF, INT)
> +DEF_FUNCTION_TYPE (V8SF, V8SF, PCFLOAT, V4DI, V8SF, INT)
>  DEF_FUNCTION_TYPE (V2DI, V2DI, PCINT64, V4SI, V2DI, INT)
>  DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V4SI, V4DI, INT)
> +DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V8SI, V4DI, INT)
>  DEF_FUNCTION_TYPE (V2DI, V2DI, PCINT64, V2DI, V2DI, INT)
>  DEF_FUNCTION_TYPE (V4DI, V4DI, PCINT64, V4DI, V4DI, INT)
>  DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V4SI, V4SI, INT)
>  DEF_FUNCTION_TYPE (V8SI, V8SI, PCINT, V8SI, V8SI, INT)
>  DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V2DI, V4SI, INT)
>  DEF_FUNCTION_TYPE (V4SI, V4SI, PCINT, V4DI, V4SI, INT)
> +DEF_FUNCTION_TYPE (V8SI, V8SI, PCINT, V4DI, V8SI, INT)
>
>  DEF_FUNCTION_TYPE_ALIAS (V2DF_FTYPE_V2DF, ROUND)
>  DEF_FUNCTION_TYPE_ALIAS (V4DF_FTYPE_V4DF, ROUND)
> --- gcc/config/i386/sse.md.jj   2011-11-04 08:52:19.000000000 +0100
> +++ gcc/config/i386/sse.md      2011-11-04 12:48:16.000000000 +0100
> @@ -316,14 +316,6 @@ (define_mode_attr i128
>  ;; Mix-n-match
>  (define_mode_iterator AVX256MODE2P [V8SI V8SF V4DF])
>
> -(define_mode_iterator AVXMODE48P_DI
> -                     [V2DI V2DF V4DI V4DF V4SF V4SI])
> -(define_mode_attr AVXMODE48P_DI
> -                     [(V2DI "V2DI") (V2DF "V2DI")
> -                      (V4DI "V4DI") (V4DF "V4DI")
> -                      (V4SI "V2DI") (V4SF "V2DI")
> -                      (V8SI "V4DI") (V8SF "V4DI")])
> -
>  (define_mode_iterator FMAMODE [SF DF V4SF V2DF V8SF V4DF])
>
>  ;; Mapping of immediate bits for blend instructions
> @@ -12516,11 +12508,21 @@ (define_insn "vcvtps2ph256"
>  ;; For gather* insn patterns
>  (define_mode_iterator VEC_GATHER_MODE
>                      [V2DI V2DF V4DI V4DF V4SI V4SF V8SI V8SF])
> -(define_mode_attr VEC_GATHER_MODE
> +(define_mode_attr VEC_GATHER_IDXSI
>                      [(V2DI "V4SI") (V2DF "V4SI")
>                       (V4DI "V4SI") (V4DF "V4SI")
>                       (V4SI "V4SI") (V4SF "V4SI")
>                       (V8SI "V8SI") (V8SF "V8SI")])
> +(define_mode_attr VEC_GATHER_IDXDI
> +                     [(V2DI "V2DI") (V2DF "V2DI")
> +                      (V4DI "V4DI") (V4DF "V4DI")
> +                      (V4SI "V2DI") (V4SF "V2DI")
> +                      (V8SI "V4DI") (V8SF "V4DI")])
> +(define_mode_attr VEC_GATHER_SRCDI
> +                     [(V2DI "V2DI") (V2DF "V2DF")
> +                      (V4DI "V4DI") (V4DF "V4DF")
> +                      (V4SI "V4SI") (V4SF "V4SF")
> +                      (V8SI "V4SI") (V8SF "V4SF")])
>
>  (define_expand "avx2_gathersi<mode>"
>   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "")
> @@ -12529,7 +12531,8 @@ (define_expand "avx2_gathersi<mode>"
>                      (mem:<ssescalarmode>
>                        (match_par_dup 7
>                          [(match_operand 2 "vsib_address_operand" "")
> -                          (match_operand:<VEC_GATHER_MODE> 3 "register_operand" "")
> +                          (match_operand:<VEC_GATHER_IDXSI>
> +                             3 "register_operand" "")
>                           (match_operand:SI 5 "const1248_operand " "")]))
>                      (mem:BLK (scratch))
>                      (match_operand:VEC_GATHER_MODE 4 "register_operand" "")]
> @@ -12549,7 +12552,7 @@ (define_insn "*avx2_gathersi<mode>"
>           (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>             [(unspec:P
>                [(match_operand:P 3 "vsib_address_operand" "p")
> -                (match_operand:<VEC_GATHER_MODE> 4 "register_operand" "x")
> +                (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
>                 (match_operand:SI 6 "const1248_operand" "n")]
>                UNSPEC_VSIBADDR)])
>           (mem:BLK (scratch))
> @@ -12565,14 +12568,16 @@ (define_insn "*avx2_gathersi<mode>"
>  (define_expand "avx2_gatherdi<mode>"
>   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "")
>                   (unspec:VEC_GATHER_MODE
> -                    [(match_operand:VEC_GATHER_MODE 1 "register_operand" "")
> +                    [(match_operand:<VEC_GATHER_SRCDI> 1 "register_operand" "")
>                      (mem:<ssescalarmode>
>                        (match_par_dup 7
>                          [(match_operand 2 "vsib_address_operand" "")
> -                          (match_operand:<AVXMODE48P_DI> 3 "register_operand" "")
> +                          (match_operand:<VEC_GATHER_IDXDI>
> +                             3 "register_operand" "")
>                           (match_operand:SI 5 "const1248_operand " "")]))
>                      (mem:BLK (scratch))
> -                     (match_operand:VEC_GATHER_MODE 4 "register_operand" "")]
> +                     (match_operand:<VEC_GATHER_SRCDI>
> +                       4 "register_operand" "")]
>                     UNSPEC_GATHER))
>              (clobber (match_scratch:VEC_GATHER_MODE 6 ""))])]
>   "TARGET_AVX2"
> @@ -12583,63 +12588,21 @@ (define_expand "avx2_gatherdi<mode>"
>  })
>
>  (define_insn "*avx2_gatherdi<mode>"
> -  [(set (match_operand:AVXMODE48P_DI 0 "register_operand" "=&x")
> -       (unspec:AVXMODE48P_DI
> -         [(match_operand:AVXMODE48P_DI 2 "register_operand" "0")
> +  [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> +       (unspec:VEC_GATHER_MODE
> +         [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
>           (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>             [(unspec:P
>                [(match_operand:P 3 "vsib_address_operand" "p")
> -                (match_operand:<AVXMODE48P_DI> 4 "register_operand" "x")
> +                (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
>                 (match_operand:SI 6 "const1248_operand" "n")]
>                UNSPEC_VSIBADDR)])
>           (mem:BLK (scratch))
> -          (match_operand:AVXMODE48P_DI 5 "register_operand" "1")]
> +          (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
>          UNSPEC_GATHER))
> -   (clobber (match_scratch:AVXMODE48P_DI 1 "=&x"))]
> -  "TARGET_AVX2"
> -  "v<sseintprefix>gatherq<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> -  [(set_attr "type" "ssemov")
> -   (set_attr "prefix" "vex")
> -   (set_attr "mode" "<sseinsnmode>")])
> -
> -;; Special handling for VEX.256 with float arguments
> -;; since there're still xmms as operands
> -(define_expand "avx2_gatherdi<mode>256"
> -  [(parallel [(set (match_operand:VI4F_128 0 "register_operand" "")
> -                  (unspec:VI4F_128
> -                    [(match_operand:VI4F_128 1 "register_operand" "")
> -                     (mem:<ssescalarmode>
> -                       (match_par_dup 7
> -                         [(match_operand 2 "vsib_address_operand" "")
> -                          (match_operand:V4DI 3 "register_operand" "")
> -                          (match_operand:SI 5 "const1248_operand " "")]))
> -                     (mem:BLK (scratch))
> -                     (match_operand:VI4F_128 4 "register_operand" "")]
> -                    UNSPEC_GATHER))
> -             (clobber (match_scratch:VI4F_128 6 ""))])]
> -  "TARGET_AVX2"
> -{
> -  operands[7]
> -    = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, operands[2], operands[3],
> -                                       operands[5]), UNSPEC_VSIBADDR);
> -})
> -
> -(define_insn "*avx2_gatherdi<mode>256"
> -  [(set (match_operand:VI4F_128 0 "register_operand" "=x")
> -       (unspec:VI4F_128
> -         [(match_operand:VI4F_128 2 "register_operand" "0")
> -          (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> -            [(unspec:P
> -               [(match_operand:P 3 "vsib_address_operand" "p")
> -                (match_operand:V4DI 4 "register_operand" "x")
> -                (match_operand:SI 6 "const1248_operand" "n")]
> -               UNSPEC_VSIBADDR)])
> -          (mem:BLK (scratch))
> -          (match_operand:VI4F_128 5 "register_operand" "1")]
> -         UNSPEC_GATHER))
> -   (clobber (match_scratch:VI4F_128 1 "=&x"))]
> +   (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>   "TARGET_AVX2"
> -  "v<sseintprefix>gatherq<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> +  "v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"
>   [(set_attr "type" "ssemov")
>    (set_attr "prefix" "vex")
>    (set_attr "mode" "<sseinsnmode>")])
> --- gcc/config/i386/i386.c.jj   2011-11-04 08:52:19.000000000 +0100
> +++ gcc/config/i386/i386.c      2011-11-04 12:48:16.000000000 +0100
> @@ -25105,6 +25105,13 @@ enum ix86_builtins
>   IX86_BUILTIN_GATHERDIV4SI,
>   IX86_BUILTIN_GATHERDIV8SI,
>
> +  /* Alternate 4 element gather for the vectorizer where
> +     all operands are 32-byte wide.  */
> +  IX86_BUILTIN_GATHERALTSIV4DF,
> +  IX86_BUILTIN_GATHERALTDIV8SF,
> +  IX86_BUILTIN_GATHERALTSIV4DI,
> +  IX86_BUILTIN_GATHERALTDIV8SI,
> +
>   /* TFmode support builtins.  */
>   IX86_BUILTIN_INFQ,
>   IX86_BUILTIN_HUGE_VALQ,
> @@ -26883,6 +26890,22 @@ ix86_init_mmx_sse_builtins (void)
>               V4SI_FTYPE_V4SI_PCINT_V4DI_V4SI_INT,
>               IX86_BUILTIN_GATHERDIV8SI);
>
> +  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltsiv4df ",
> +              V4DF_FTYPE_V4DF_PCDOUBLE_V8SI_V4DF_INT,
> +              IX86_BUILTIN_GATHERALTSIV4DF);
> +
> +  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltdiv4sf256 ",
> +              V8SF_FTYPE_V8SF_PCFLOAT_V4DI_V8SF_INT,
> +              IX86_BUILTIN_GATHERALTDIV8SF);
> +
> +  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltsiv4di ",
> +              V4DI_FTYPE_V4DI_PCINT64_V8SI_V4DI_INT,
> +              IX86_BUILTIN_GATHERALTSIV4DI);
> +
> +  def_builtin (OPTION_MASK_ISA_AVX2, "__builtin_ia32_gatheraltdiv4si256 ",
> +              V8SI_FTYPE_V8SI_PCINT_V4DI_V8SI_INT,
> +              IX86_BUILTIN_GATHERALTDIV8SI);
> +
>   /* MMX access to the vec_init patterns.  */
>   def_builtin_const (OPTION_MASK_ISA_MMX, "__builtin_ia32_vec_init_v2si",
>                     V2SI_FTYPE_INT_INT, IX86_BUILTIN_VEC_INIT_V2SI);
> @@ -28869,7 +28892,7 @@ rdrand_step:
>       icode = CODE_FOR_avx2_gatherdiv4sf;
>       goto gather_gen;
>     case IX86_BUILTIN_GATHERDIV8SF:
> -      icode = CODE_FOR_avx2_gatherdiv4sf256;
> +      icode = CODE_FOR_avx2_gatherdiv8sf;
>       goto gather_gen;
>     case IX86_BUILTIN_GATHERSIV2DI:
>       icode = CODE_FOR_avx2_gathersiv2di;
> @@ -28893,7 +28916,20 @@ rdrand_step:
>       icode = CODE_FOR_avx2_gatherdiv4si;
>       goto gather_gen;
>     case IX86_BUILTIN_GATHERDIV8SI:
> -      icode = CODE_FOR_avx2_gatherdiv4si256;
> +      icode = CODE_FOR_avx2_gatherdiv8si;
> +      goto gather_gen;
> +    case IX86_BUILTIN_GATHERALTSIV4DF:
> +      icode = CODE_FOR_avx2_gathersiv4df;
> +      goto gather_gen;
> +    case IX86_BUILTIN_GATHERALTDIV8SF:
> +      icode = CODE_FOR_avx2_gatherdiv8sf;
> +      goto gather_gen;
> +    case IX86_BUILTIN_GATHERALTSIV4DI:
> +      icode = CODE_FOR_avx2_gathersiv4df;
> +      goto gather_gen;
> +    case IX86_BUILTIN_GATHERALTDIV8SI:
> +      icode = CODE_FOR_avx2_gatherdiv8si;
> +      goto gather_gen;
>
>     gather_gen:
>       arg0 = CALL_EXPR_ARG (exp, 0);
> @@ -28912,8 +28948,39 @@ rdrand_step:
>       mode3 = insn_data[icode].operand[4].mode;
>       mode4 = insn_data[icode].operand[5].mode;
>
> -      if (target == NULL_RTX)
> -       target = gen_reg_rtx (insn_data[icode].operand[0].mode);
> +      if (target == NULL_RTX
> +         || GET_MODE (target) != insn_data[icode].operand[0].mode)
> +       subtarget = gen_reg_rtx (insn_data[icode].operand[0].mode);
> +      else
> +       subtarget = target;
> +
> +      if (fcode == IX86_BUILTIN_GATHERALTSIV4DF
> +         || fcode == IX86_BUILTIN_GATHERALTSIV4DI)
> +       {
> +         rtx half = gen_reg_rtx (V4SImode);
> +         if (!nonimmediate_operand (op2, V8SImode))
> +           op2 = copy_to_mode_reg (V8SImode, op2);
> +         emit_insn (gen_vec_extract_lo_v8si (half, op2));
> +         op2 = half;
> +       }
> +      else if (fcode == IX86_BUILTIN_GATHERALTDIV8SF
> +              || fcode == IX86_BUILTIN_GATHERALTDIV8SI)
> +       {
> +         rtx (*gen) (rtx, rtx);
> +         rtx half = gen_reg_rtx (mode0);
> +         if (mode0 == V4SFmode)
> +           gen = gen_vec_extract_lo_v8sf;
> +         else
> +           gen = gen_vec_extract_lo_v8si;
> +         if (!nonimmediate_operand (op0, GET_MODE (op0)))
> +           op0 = copy_to_mode_reg (GET_MODE (op0), op0);
> +         emit_insn (gen (half, op0));
> +         op0 = half;
> +         if (!nonimmediate_operand (op3, GET_MODE (op3)))
> +           op3 = copy_to_mode_reg (GET_MODE (op3), op3);
> +         emit_insn (gen (half, op3));
> +         op3 = half;
> +       }
>
>       /* Force memory operand only with base register here.  But we
>         don't want to do it on memory operand for other builtin
> @@ -28935,10 +29002,26 @@ rdrand_step:
>           error ("last argument must be scale 1, 2, 4, 8");
>           return const0_rtx;
>        }
> -      pat = GEN_FCN (icode) (target, op0, op1, op2, op3, op4);
> +      pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
>       if (! pat)
>        return const0_rtx;
>       emit_insn (pat);
> +
> +      if (fcode == IX86_BUILTIN_GATHERDIV8SF
> +         || fcode == IX86_BUILTIN_GATHERDIV8SI)
> +       {
> +         enum machine_mode tmode = GET_MODE (subtarget) == V8SFmode
> +                                   ? V4SFmode : V4SImode;
> +         if (target == NULL_RTX)
> +           target = gen_reg_rtx (tmode);
> +         if (tmode == V4SFmode)
> +           emit_insn (gen_vec_extract_lo_v8sf (target, subtarget));
> +         else
> +           emit_insn (gen_vec_extract_lo_v8si (target, subtarget));
> +       }
> +      else
> +       target = subtarget;
> +
>       return target;
>
>     default:
> @@ -29443,6 +29526,73 @@ ix86_veclibabi_acml (enum built_in_funct
>   return new_fndecl;
>  }
>
> +/* Returns a decl of a function that implements gather load with
> +   memory type MEM_VECTYPE and index type INDEX_VECTYPE and SCALE.
> +   Return NULL_TREE if it is not available.  */
> +
> +static tree
> +ix86_vectorize_builtin_gather (const_tree mem_vectype,
> +                              const_tree index_type, int scale)
> +{
> +  bool si;
> +  enum ix86_builtins code;
> +
> +  if (! TARGET_AVX2)
> +    return NULL_TREE;
> +
> +  if ((TREE_CODE (index_type) != INTEGER_TYPE
> +       && !POINTER_TYPE_P (index_type))
> +      || (TYPE_MODE (index_type) != SImode
> +         && TYPE_MODE (index_type) != DImode))
> +    return NULL_TREE;
> +
> +  if (TYPE_PRECISION (index_type) > POINTER_SIZE)
> +    return NULL_TREE;
> +
> +  /* v*gather* insn sign extends index to pointer mode.  */
> +  if (TYPE_PRECISION (index_type) < POINTER_SIZE
> +      && TYPE_UNSIGNED (index_type))
> +    return NULL_TREE;
> +
> +  if (scale <= 0
> +      || scale > 8
> +      || (scale & (scale - 1)) != 0)
> +    return NULL_TREE;
> +
> +  si = TYPE_MODE (index_type) == SImode;
> +  switch (TYPE_MODE (mem_vectype))
> +    {
> +    case V2DFmode:
> +      code = si ? IX86_BUILTIN_GATHERSIV2DF : IX86_BUILTIN_GATHERDIV2DF;
> +      break;
> +    case V4DFmode:
> +      code = si ? IX86_BUILTIN_GATHERALTSIV4DF : IX86_BUILTIN_GATHERDIV4DF;
> +      break;
> +    case V2DImode:
> +      code = si ? IX86_BUILTIN_GATHERSIV2DI : IX86_BUILTIN_GATHERDIV2DI;
> +      break;
> +    case V4DImode:
> +      code = si ? IX86_BUILTIN_GATHERALTSIV4DI : IX86_BUILTIN_GATHERDIV4DI;
> +      break;
> +    case V4SFmode:
> +      code = si ? IX86_BUILTIN_GATHERSIV4SF : IX86_BUILTIN_GATHERDIV4SF;
> +      break;
> +    case V8SFmode:
> +      code = si ? IX86_BUILTIN_GATHERSIV8SF : IX86_BUILTIN_GATHERALTDIV8SF;
> +      break;
> +    case V4SImode:
> +      code = si ? IX86_BUILTIN_GATHERSIV4SI : IX86_BUILTIN_GATHERDIV4SI;
> +      break;
> +    case V8SImode:
> +      code = si ? IX86_BUILTIN_GATHERSIV8SI : IX86_BUILTIN_GATHERALTDIV8SI;
> +      break;
> +    default:
> +      return NULL_TREE;
> +    }
> +
> +  return ix86_builtins[code];
> +}
> +
>  /* Returns a code for a target-specific builtin that implements
>    reciprocal of the function, or NULL_TREE if not available.  */
>
> @@ -37642,6 +37792,9 @@ ix86_autovectorize_vector_sizes (void)
>  #define TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION \
>   ix86_builtin_vectorized_function
>
> +#undef TARGET_VECTORIZE_BUILTIN_GATHER
> +#define TARGET_VECTORIZE_BUILTIN_GATHER ix86_vectorize_builtin_gather
> +
>  #undef TARGET_BUILTIN_RECIPROCAL
>  #define TARGET_BUILTIN_RECIPROCAL ix86_builtin_reciprocal
>
> --- gcc/testsuite/gcc.target/i386/avx2-gather-1.c.jj    2011-11-04 08:53:13.000000000 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-1.c       2011-11-04 08:53:13.000000000 +0100
> @@ -0,0 +1,215 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx2 } */
> +/* { dg-options "-O3 -mavx2" } */
> +
> +#include "avx2-check.h"
> +
> +#define N 1024
> +float vf1[N+16], vf2[N];
> +double vd1[N+16], vd2[N];
> +int k[N];
> +long l[N];
> +short n[N];
> +
> +__attribute__((noinline, noclone)) void
> +f1 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vf2[i] = vf1[k[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f2 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vf1[k[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f3 (int x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vf2[i] = vf1[k[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f4 (int x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vf1[k[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f5 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vd2[i] = vd1[k[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f6 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vd1[k[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f7 (int x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vd2[i] = vd1[k[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f8 (int x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vd1[k[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f9 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vf2[i] = vf1[l[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f10 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vf1[l[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f11 (long x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vf2[i] = vf1[l[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f12 (long x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vf1[l[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f13 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vd2[i] = vd1[l[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f14 (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vd1[l[i]];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f15 (long x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    vd2[i] = vd1[l[i] + x];
> +}
> +
> +__attribute__((noinline, noclone)) void
> +f16 (long x)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    n[i] = (int) vd1[l[i] + x];
> +}
> +
> +static void
> +avx2_test (void)
> +{
> +  int i;
> +
> +  for (i = 0; i < N + 16; i++)
> +    {
> +      asm ("");
> +      vf1[i] = 17.0f + i;
> +      vd1[i] = 19.0 + i;
> +    }
> +  for (i = 0; i < N; i++)
> +    {
> +      asm ("");
> +      k[i] = (i * 731) & (N - 1);
> +      l[i] = (i * 657) & (N - 1);
> +    }
> +
> +  f1 ();
> +  f2 ();
> +  for (i = 0; i < N; i++)
> +    if (vf2[i] != ((i * 731) & (N - 1)) + 17
> +       || n[i] != ((i * 731) & (N - 1)) + 17)
> +      abort ();
> +
> +  f3 (12);
> +  f4 (14);
> +  for (i = 0; i < N; i++)
> +    if (vf2[i] != ((i * 731) & (N - 1)) + 17 + 12
> +       || n[i] != ((i * 731) & (N - 1)) + 17 + 14)
> +      abort ();
> +
> +  f5 ();
> +  f6 ();
> +  for (i = 0; i < N; i++)
> +    if (vd2[i] != ((i * 731) & (N - 1)) + 19
> +       || n[i] != ((i * 731) & (N - 1)) + 19)
> +      abort ();
> +
> +  f7 (7);
> +  f8 (9);
> +  for (i = 0; i < N; i++)
> +    if (vd2[i] != ((i * 731) & (N - 1)) + 19 + 7
> +       || n[i] != ((i * 731) & (N - 1)) + 19 + 9)
> +      abort ();
> +
> +  f9 ();
> +  f10 ();
> +  for (i = 0; i < N; i++)
> +    if (vf2[i] != ((i * 657) & (N - 1)) + 17
> +       || n[i] != ((i * 657) & (N - 1)) + 17)
> +      abort ();
> +
> +  f11 (2);
> +  f12 (4);
> +  for (i = 0; i < N; i++)
> +    if (vf2[i] != ((i * 657) & (N - 1)) + 17 + 2
> +       || n[i] != ((i * 657) & (N - 1)) + 17 + 4)
> +      abort ();
> +
> +  f13 ();
> +  f14 ();
> +  for (i = 0; i < N; i++)
> +    if (vd2[i] != ((i * 657) & (N - 1)) + 19
> +       || n[i] != ((i * 657) & (N - 1)) + 19)
> +      abort ();
> +
> +  f15 (13);
> +  f16 (15);
> +  for (i = 0; i < N; i++)
> +    if (vd2[i] != ((i * 657) & (N - 1)) + 19 + 13
> +       || n[i] != ((i * 657) & (N - 1)) + 19 + 15)
> +      abort ();
> +}
> --- gcc/testsuite/gcc.target/i386/avx2-gather-2.c.jj    2011-11-04 08:53:13.000000000 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-2.c       2011-11-04 08:53:13.000000000 +0100
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx2 -fdump-tree-vect-details" } */
> +
> +#include "avx2-gather-1.c"
> +
> +/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops in function" 16 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/testsuite/gcc.target/i386/avx2-gather-3.c.jj    2011-11-04 08:53:13.000000000 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-3.c       2011-11-04 08:53:13.000000000 +0100
> @@ -0,0 +1,167 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx2 } */
> +/* { dg-options "-O3 -mavx2 -ffast-math" } */
> +
> +#include "avx2-check.h"
> +
> +#define N 1024
> +float f[N];
> +double d[N];
> +int k[N];
> +float *l[N];
> +double *n[N];
> +int **m[N];
> +long **o[N];
> +long q[N];
> +long *r[N];
> +int *s[N];
> +
> +__attribute__((noinline, noclone)) float
> +f1 (void)
> +{
> +  int i;
> +  float g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += f[k[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) float
> +f2 (float *p)
> +{
> +  int i;
> +  float g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += p[k[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) float
> +f3 (void)
> +{
> +  int i;
> +  float g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += *l[i];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) int
> +f4 (void)
> +{
> +  int i;
> +  int g = 0;
> +  for (i = 0; i < N / 2; i++)
> +    g += **m[i];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) double
> +f5 (void)
> +{
> +  int i;
> +  double g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += d[k[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) double
> +f6 (double *p)
> +{
> +  int i;
> +  double g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += p[k[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) double
> +f7 (void)
> +{
> +  int i;
> +  double g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += *n[i];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) int
> +f8 (void)
> +{
> +  int i;
> +  int g = 0;
> +  for (i = 0; i < N / 2; i++)
> +    g += **o[i];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) float
> +f9 (void)
> +{
> +  int i;
> +  float g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += f[q[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) float
> +f10 (float *p)
> +{
> +  int i;
> +  float g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += p[q[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) double
> +f11 (void)
> +{
> +  int i;
> +  double g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += d[q[i]];
> +  return g;
> +}
> +
> +__attribute__((noinline, noclone)) double
> +f12 (double *p)
> +{
> +  int i;
> +  double g = 0.0;
> +  for (i = 0; i < N / 2; i++)
> +    g += p[q[i]];
> +  return g;
> +}
> +
> +static void
> +avx2_test (void)
> +{
> +  int i;
> +
> +  for (i = 0; i < N; i++)
> +    {
> +      asm ("");
> +      f[i] = -256.0f + i;
> +      d[i] = -258.0 + i;
> +      k[i] = (i * 731) & (N - 1);
> +      q[i] = (i * 657) & (N - 1);
> +      l[i] = &f[(i * 239) & (N - 1)];
> +      n[i] = &d[(i * 271) & (N - 1)];
> +      r[i] = &q[(i * 323) & (N - 1)];
> +      s[i] = &k[(i * 565) & (N - 1)];
> +      m[i] = &s[(i * 13) & (N - 1)];
> +      o[i] = &r[(i * 19) & (N - 1)];
> +    }
> +
> +  if (f1 () != 136448.0f || f2 (f) != 136448.0f || f3 () != 130304.0)
> +    abort ();
> +  if (f4 () != 261376 || f5 () != 135424.0 || f6 (d) != 135424.0)
> +    abort ();
> +  if (f7 () != 129280.0 || f8 () != 259840L || f9 () != 130816.0f)
> +    abort ();
> +  if (f10 (f) != 130816.0f || f11 () != 129792.0 || f12 (d) != 129792.0)
> +    abort ();
> +}
> --- gcc/testsuite/gcc.target/i386/avx2-gather-4.c.jj    2011-11-04 08:54:11.000000000 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-4.c       2011-11-04 08:54:11.000000000 +0100
> @@ -0,0 +1,38 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx2 } */
> +/* { dg-options "-O3 -mavx2" } */
> +
> +#include "avx2-check.h"
> +
> +#define N 1024
> +int a[N], b[N], c[N], d[N];
> +
> +__attribute__((noinline, noclone)) void
> +foo (float *__restrict p, float *__restrict q, float *__restrict r,
> +     long s1, long s2, long s3)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    p[i] = q[a[i] * s1 + b[i] * s2 + s3] * r[c[i] * s1 + d[i] * s2 + s3];
> +}
> +
> +static void
> +avx2_test (void)
> +{
> +  int i;
> +  float e[N], f[N], g[N];
> +  for (i = 0; i < N; i++)
> +    {
> +      a[i] = (i * 7) & (N / 8 - 1);
> +      b[i] = (i * 13) & (N / 8 - 1);
> +      c[i] = (i * 23) & (N / 8 - 1);
> +      d[i] = (i * 5) & (N / 8 - 1);
> +      e[i] = 16.5 + i;
> +      f[i] = 127.5 - i;
> +    }
> +  foo (g, e, f, 3, 2, 4);
> +  for (i = 0; i < N; i++)
> +    if (g[i] != (float) ((20.5 + a[i] * 3 + b[i] * 2)
> +                        * (123.5 - c[i] * 3 - d[i] * 2)))
> +      abort ();
> +}
>
>
>        Jakub
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Gather vectorization (PR tree-optimization/50789, take 2)
  2011-11-04 13:32       ` [PATCH] Gather vectorization (PR tree-optimization/50789, take 2) Jakub Jelinek
  2011-11-04 23:54         ` Richard Guenther
@ 2011-11-07 15:49         ` Uros Bizjak
  1 sibling, 0 replies; 8+ messages in thread
From: Uros Bizjak @ 2011-11-07 15:49 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Richard Guenther, Ira Rosen, Richard Henderson, gcc-patches

On Fri, Nov 4, 2011 at 2:22 PM, Jakub Jelinek <jakub@redhat.com> wrote:

> On Fri, Nov 04, 2011 at 12:21:49PM +0100, Richard Guenther wrote:
>> Ok.  I guess it's ok to use builtins for now - I didn't think of
>> the memory reference issue ;)
>
> Based on IRC discussion I'm posting an updated patch (both former
> patches (base and incremental) in one).
>
> I'm now using expr_invariant_in_loop_p instead of chrec_contains*,
> which nicely handles also the is_gimple_min_invariant case,
> and I've added several comments and fixed the MEM_REF offset
> folding.  Smoke tested on the *gather* testcases, will do full
> bootstrap/regtest soon.
>
> 2011-11-04  Jakub Jelinek  <jakub@redhat.com>
>
>        PR tree-optimization/50789
>        * tree-vect-stmts.c (process_use): Add force argument, avoid
>        exist_non_indexing_operands_for_use_p check if true.
>        (vect_mark_stmts_to_be_vectorized): Adjust callers.  Handle
>        STMT_VINFO_GATHER_P.
>        (gen_perm_mask): New function.
>        (perm_mask_for_reverse): Use it.
>        (reverse_vec_element): Rename to...
>        (permute_vec_elements): ... this.  Add Y and MASK_VEC arguments,
>        generalize for any permutations.
>        (vectorizable_load): Adjust caller.  Handle STMT_VINFO_GATHER_P.
>        * target.def (TARGET_VECTORIZE_BUILTIN_GATHER): New hook.
>        * doc/tm.texi.in (TARGET_VECTORIZE_BUILTIN_GATHER): Document it.
>        * doc/tm.texi: Regenerate.
>        * tree-data-ref.c (initialize_data_dependence_relation,
>        compute_self_dependence): No longer static.
>        * tree-data-ref.h (initialize_data_dependence_relation,
>        compute_self_dependence): New prototypes.
>        * tree-vect-data-refs.c (vect_check_gather): New function.
>        (vect_analyze_data_refs): Detect possible gather load data
>        refs.
>        * tree-vectorizer.h (struct _stmt_vec_info): Add gather_p field.
>        (STMT_VINFO_GATHER_P): Define.
>        (vect_check_gather): New prototype.
>        * config/i386/i386-builtin-types.def: Add types for alternate
>        gather builtins.
>        * config/i386/sse.md (AVXMODE48P_DI): Remove.
>        (VEC_GATHER_MODE): Rename mode_attr to...
>        (VEC_GATHER_IDXSI): ... this.
>        (VEC_GATHER_IDXDI, VEC_GATHER_SRCDI): New mode_attrs.
>        (avx2_gathersi<mode>, *avx2_gathersi<mode>): Use <VEC_GATHER_IDXSI>
>        instead of <VEC_GATHER_MODE>.
>        (avx2_gatherdi<mode>): Use <VEC_GATHER_IDXDI> instead of
>        <<AVXMODE48P_DI> and <VEC_GATHER_SRCDI> instead of VEC_GATHER_MODE
>        on src and mask operands.
>        (*avx2_gatherdi<mode>): Likewise.  Use VEC_GATHER_MODE iterator
>        instead of AVXMODE48P_DI.
>        (avx2_gatherdi<mode>256, *avx2_gatherdi<mode>256): Removed.
>        * config/i386/i386.c (enum ix86_builtins): Add
>        IX86_BUILTIN_GATHERALTSIV4DF, IX86_BUILTIN_GATHERALTDIV8SF,
>        IX86_BUILTIN_GATHERALTSIV4DI and IX86_BUILTIN_GATHERALTDIV8SI.
>        (ix86_init_mmx_sse_builtins): Create those builtins.
>        (ix86_expand_builtin): Handle those builtins and adjust expansions
>        of other gather builtins.
>        (ix86_vectorize_builtin_gather): New function.
>        (TARGET_VECTORIZE_BUILTIN_GATHER): Define.
>
>        * gcc.target/i386/avx2-gather-1.c: New test.
>        * gcc.target/i386/avx2-gather-2.c: New test.
>        * gcc.target/i386/avx2-gather-3.c: New test.
>        * gcc.target/i386/avx2-gather-4.c: New test.

x86 parts are also OK.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-11-07 15:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-02 20:19 Patch ping Jakub Jelinek
2011-11-04 10:11 ` Richard Guenther
2011-11-04 10:39   ` Jakub Jelinek
2011-11-04 11:44     ` Richard Guenther
2011-11-04 13:32       ` [PATCH] Gather vectorization (PR tree-optimization/50789, take 2) Jakub Jelinek
2011-11-04 23:54         ` Richard Guenther
2011-11-07 15:49         ` Uros Bizjak
2011-11-04 14:09       ` Patch ping Michael Matz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).