Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
       [not found] <CAOvf_xxEQ3tm+fwL5EfVSOUKDQnaBt+jTz4huK66T_8+TXzzfQ@mail.gmail.com>
@ 2014-02-11 13:01 ` Richard Biener
  2014-02-11 14:03   ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2014-02-11 13:01 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: gcc-patches, Jakub Jelinek, ubizjak

On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:

> Hi,
> 
> The patch gives an expected 3 times gain for the test case in the PR52252
> (and even 6 times for AVX2).
> It passes make check and bootstrap on x86.
> spec2000/spec2006 got no regressions/gains on x86.
> 
> Is this patch ok?

I've worked on generalizing the permutation support in the light
of the availability of the generic shuffle support in the IL
but hit some road-blocks in the way code-generation works for
group loads with permutations (I don't remember if I posted all patches).

This patch seems to be to a slightly different place but it again
special-cases a specific permutation.  Why's that?  Why can't we
support groups of size 7 for example?  So - can this be generalized
to support arbitrary non-power-of-two load/store groups?

Other than that the patch has to wait for stage1 to open again,
of course.  And it misses a testcase.

Btw, do you have a copyright assignment on file with the FSF covering
work on GCC?

Thanks,
Richard.

> ChangeLog:
> 
> 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> 
>         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>         check for stores group of length 3.
>         (vect_permute_store_chain): New permutations for stores group of
>         length 3.
>         (vect_grouped_load_supported): New check for loads group of length
> 3.
>         (vect_permute_load_chain): New permutations for loads group of
> length 3.
>         * tree-vect-stmts.c (vect_model_store_cost): New cost
> vec_perm_shuffle
>         for the new permutations.
>         (vect_model_load_cost): Ditto.
>         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>         * config/arm/arm.c: Ditto.
>         * config/rs6000/rs6000.c: Ditto.
>         * config/spu/spu.c: Ditto.
>         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> byte
>         shuffle on some x86 architectures.
>         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>         (ix86_builtin_vectorization_cost): Adding cost for the new
> permutations.
>         Fixing cost for other permutations.
>         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>         slow (TARGET_SLOW_PHUFFB).
>         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>         Adding new shuffle cost only when byte shuffle is expected.
>         Fixing cost model for Silvermont.
> 
> Thanks,
> Evgeny
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-02-11 13:01 ` [PATCH, PR52252] Vectorization for load/store groups of size 3 Richard Biener
@ 2014-02-11 14:03   ` Evgeny Stupachenko
  2014-02-11 15:20     ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-02-11 14:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, ubizjak

[-- Attachment #1: Type: text/plain, Size: 3765 bytes --]

Missed patch attached in plain-text.

I have copyright assignment on file with the FSF covering work on GCC.

Load/stores groups of length 3 is the most frequent non-power-of-2
case. It is used in RGB image processing (like test case in PR52252).
For sure we can extend the patch to length 5 and more. However, this
potentially affect performance on some other architectures and
requires larger testing. So length 3 it is just first step.The
algorithm in the patch could be modified for a general case in several
steps.

I understand that the patch should wait for the stage 1, however since
its ready we can discuss it right now and make some changes (like
general size of group).

Thanks,
Evgeny

On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>
> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>
> > Hi,
> >
> > The patch gives an expected 3 times gain for the test case in the PR52252
> > (and even 6 times for AVX2).
> > It passes make check and bootstrap on x86.
> > spec2000/spec2006 got no regressions/gains on x86.
> >
> > Is this patch ok?
>
> I've worked on generalizing the permutation support in the light
> of the availability of the generic shuffle support in the IL
> but hit some road-blocks in the way code-generation works for
> group loads with permutations (I don't remember if I posted all patches).
>
> This patch seems to be to a slightly different place but it again
> special-cases a specific permutation.  Why's that?  Why can't we
> support groups of size 7 for example?  So - can this be generalized
> to support arbitrary non-power-of-two load/store groups?
>
> Other than that the patch has to wait for stage1 to open again,
> of course.  And it misses a testcase.
>
> Btw, do you have a copyright assignment on file with the FSF covering
> work on GCC?
>
> Thanks,
> Richard.
>
> > ChangeLog:
> >
> > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> >
> >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
> >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >         check for stores group of length 3.
> >         (vect_permute_store_chain): New permutations for stores group of
> >         length 3.
> >         (vect_grouped_load_supported): New check for loads group of length
> > 3.
> >         (vect_permute_load_chain): New permutations for loads group of
> > length 3.
> >         * tree-vect-stmts.c (vect_model_store_cost): New cost
> > vec_perm_shuffle
> >         for the new permutations.
> >         (vect_model_load_cost): Ditto.
> >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
> >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
> >         * config/arm/arm.c: Ditto.
> >         * config/rs6000/rs6000.c: Ditto.
> >         * config/spu/spu.c: Ditto.
> >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> > byte
> >         shuffle on some x86 architectures.
> >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
> >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
> >         (ix86_builtin_vectorization_cost): Adding cost for the new
> > permutations.
> >         Fixing cost for other permutations.
> >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
> >         slow (TARGET_SLOW_PHUFFB).
> >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
> >         Adding new shuffle cost only when byte shuffle is expected.
> >         Fixing cost model for Silvermont.
> >
> > Thanks,
> > Evgeny
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

[-- Attachment #2: vect.patch --]
[-- Type: application/octet-stream, Size: 34619 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d3c5cbc..cb876f7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -4954,6 +4954,7 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case cond_branch_not_taken:
 	return aarch64_tune_params->vec_costs->cond_not_taken_branch_cost;
 
+      case vec_perm_shuffle:
       case vec_perm:
       case vec_promote_demote:
 	return aarch64_tune_params->vec_costs->vec_stmt_cost;
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 825407f..4630c7b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -11311,6 +11311,7 @@ arm_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case cond_branch_not_taken:
         return current_tune->vec_costs->cond_not_taken_branch_cost;
 
+      case vec_perm_shuffle:
       case vec_perm:
       case vec_promote_demote:
         return current_tune->vec_costs->vec_stmt_cost;
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index dcf96e6..3e20db4 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -81,6 +81,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "context.h"
 #include "pass_manager.h"
 #include "target-globals.h"
+#include "tree-vectorizer.h"
 
 static rtx legitimize_dllimport_symbol (rtx, bool);
 static rtx legitimize_pe_coff_extern_decl (rtx, bool);
@@ -175,6 +176,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   1,					/* vec_align_load_cost.  */
   1,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   1,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -251,6 +253,7 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -328,6 +331,7 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -403,6 +407,7 @@ struct processor_costs pentium_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -486,6 +491,7 @@ struct processor_costs pentiumpro_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -561,6 +567,7 @@ struct processor_costs geode_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -638,6 +645,7 @@ struct processor_costs k6_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -715,6 +723,7 @@ struct processor_costs athlon_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -802,6 +811,7 @@ struct processor_costs k8_cost = {
   2,					/* vec_align_load_cost.  */
   3,					/* vec_unalign_load_cost.  */
   3,					/* vec_store_cost.  */
+  5,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   2,					/* cond_not_taken_branch_cost.  */
 };
@@ -896,6 +906,7 @@ struct processor_costs amdfam10_cost = {
   2,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   2,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -991,6 +1002,7 @@ const struct processor_costs bdver1_cost = {
   4,					/* vec_align_load_cost.  */
   4,					/* vec_unalign_load_cost.  */
   4,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1087,6 +1099,7 @@ const struct processor_costs bdver2_cost = {
   4,					/* vec_align_load_cost.  */
   4,					/* vec_unalign_load_cost.  */
   4,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1174,6 +1187,7 @@ struct processor_costs bdver3_cost = {
   4,					/* vec_align_load_cost.  */
   4,					/* vec_unalign_load_cost.  */
   4,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1260,6 +1274,7 @@ struct processor_costs bdver4_cost = {
   4,					/* vec_align_load_cost.  */
   4,					/* vec_unalign_load_cost.  */
   4,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1349,6 +1364,7 @@ const struct processor_costs btver1_cost = {
   2,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   2,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1434,6 +1450,7 @@ const struct processor_costs btver2_cost = {
   2,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   2,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   2,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1510,6 +1527,7 @@ struct processor_costs pentium4_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1589,6 +1607,7 @@ struct processor_costs nocona_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1666,6 +1685,7 @@ struct processor_costs atom_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  4,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1738,11 +1758,12 @@ struct processor_costs slm_cost = {
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
   1,					/* vec_stmt_cost.  */
-  1,					/* vec_to_scalar_cost.  */
+  4,					/* vec_to_scalar_cost.  */
   1,					/* scalar_to_vec_cost.  */
   1,					/* vec_align_load_cost.  */
-  2,					/* vec_unalign_load_cost.  */
+  3,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1815,11 +1836,12 @@ struct processor_costs intel_cost = {
   1,					/* scalar load_cost.  */
   1,					/* scalar_store_cost.  */
   1,					/* vec_stmt_cost.  */
-  1,					/* vec_to_scalar_cost.  */
+  4,					/* vec_to_scalar_cost.  */
   1,					/* scalar_to_vec_cost.  */
   1,					/* vec_align_load_cost.  */
-  2,					/* vec_unalign_load_cost.  */
+  3,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  6,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1907,6 +1929,7 @@ struct processor_costs generic_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -1993,6 +2016,7 @@ struct processor_costs core_cost = {
   1,					/* vec_align_load_cost.  */
   2,					/* vec_unalign_load_cost.  */
   1,					/* vec_store_cost.  */
+  1,					/* vec_pshufb_cost.  */
   3,					/* cond_taken_branch_cost.  */
   1,					/* cond_not_taken_branch_cost.  */
 };
@@ -42382,7 +42406,20 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case cond_branch_not_taken:
         return ix86_cost->cond_not_taken_branch_cost;
 
+      /* Byte permutations need 2 "shufb" instructions and 1 "or" on x86,
+	 pshufb costs up to 6 ticks on Silvermont.  */
+      case vec_perm_shuffle:
+        return 2 * ix86_cost->vec_pshufb_cost + ix86_cost->vec_stmt_cost;
+
+      /* Other permutations need 2 "not byte shuffle" instructions, but
+	 not always 1 "or" or "shuffle in addition.
+	 All not byte shuffles costs 1 tick on x86.  */
       case vec_perm:
+        return 2 * ix86_cost->vec_stmt_cost;
+
+      /* "or" instruction is not used in promotions/demotions.
+	 Moreover promotions need "1" or "2" instructions,
+	 so here model is optimistic.  */
       case vec_promote_demote:
         return ix86_cost->vec_stmt_cost;
 
@@ -44110,7 +44147,7 @@ expand_vec_perm_even_odd_1 (struct expand_vec_perm_d *d, unsigned odd)
       gcc_unreachable ();
 
     case V8HImode:
-      if (TARGET_SSSE3)
+      if (TARGET_SSSE3 && !TARGET_SLOW_PHUFFB)
 	return expand_vec_perm_pshufb2 (d);
       else
 	{
@@ -44131,7 +44168,7 @@ expand_vec_perm_even_odd_1 (struct expand_vec_perm_d *d, unsigned odd)
       break;
 
     case V16QImode:
-      if (TARGET_SSSE3)
+      if (TARGET_SSSE3 && !TARGET_SLOW_PHUFFB)
 	return expand_vec_perm_pshufb2 (d);
       else
 	{
@@ -46375,6 +46412,10 @@ ix86_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
   return cost;
 }
 
+#define CODE_IS_WIDEN_MULTIPLY(c) ((c) == WIDEN_MULT_EXPR || \
+				   (c) == WIDEN_MULT_PLUS_EXPR || \
+				   (c) == WIDEN_MULT_MINUS_EXPR)
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -46384,9 +46425,38 @@ ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
 {
   unsigned *cost = (unsigned *) data;
   unsigned retval = 0;
-
   tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
   int stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
+  enum tree_code code;
+
+  /* Checking vector type to indicate byte shuffles.  We need this as byte
+     shuffle is heavy on Silvermont and Bonnel.  */
+  if (kind == vec_perm_shuffle)
+    {
+      if (vectype && stmt_info && stmt_info->loop_vinfo
+	  && ((TYPE_MODE (TREE_TYPE (vectype)) == QImode
+	       && LOOP_VINFO_VECT_FACTOR (stmt_info->loop_vinfo) == 16)
+	  || (TYPE_MODE (TREE_TYPE (vectype)) == HImode
+	      && LOOP_VINFO_VECT_FACTOR (stmt_info->loop_vinfo) == 8)))
+	    stmt_cost = ix86_builtin_vectorization_cost (vec_perm_shuffle,
+							 vectype, misalign);
+      else
+	    stmt_cost = ix86_builtin_vectorization_cost (vec_perm,
+							 vectype, misalign);
+    }
+
+  /* We need to multiply cost of vector widen multiplication by 3 as we need
+     about 3 multiplications insted of 1 to complete operation.  */
+  if (stmt_info && stmt_info->stmt)
+    {
+      if (is_gimple_assign (stmt_info->stmt)
+	  && TREE_CODE (gimple_assign_lhs (stmt_info->stmt)) == SSA_NAME)
+	{
+	  code = gimple_assign_rhs_code (stmt_info->stmt);
+	  if ( CODE_IS_WIDEN_MULTIPLY (code))
+	    stmt_cost *= 4;
+	}
+    }
 
   /* Statements in an inner loop relative to the loop being
      vectorized are weighted more heavily.  The value here is
@@ -46395,6 +46465,18 @@ ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
     count *= 50;  /* FIXME.  */
 
   retval = (unsigned) (count * stmt_cost);
+
+  /* We need to multiply all vector stmt cost by 1.8 (estimated cost)
+     for Silvermont as it has out of order integer pipeline and can execute
+     2 scalar instruction per tick, but has in order SIMD pipeline.  */
+  if (TARGET_SILVERMONT || TARGET_INTEL)
+    if (stmt_info && stmt_info->stmt)
+      {
+	tree lhs_op = gimple_get_lhs (stmt_info->stmt);
+	if (lhs_op && TREE_CODE (TREE_TYPE (lhs_op)) == INTEGER_TYPE)
+	  retval = (retval * 18) / 10;
+      }
+
   cost[where] += retval;
 
   return retval;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index bfb6dc6..78ac50f 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -237,6 +237,7 @@ struct processor_costs {
   const int vec_align_load_cost;   /* Cost of aligned vector load.  */
   const int vec_unalign_load_cost; /* Cost of unaligned vector load.  */
   const int vec_store_cost;        /* Cost of vector store.  */
+  const int vec_pshufb_cost;       /* Cost of vector pshufb.  */
   const int cond_taken_branch_cost;    /* Cost of taken branch for vectorizer
 					  cost model.  */
   const int cond_not_taken_branch_cost;/* Cost of not taken branch for
@@ -420,6 +421,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_USE_VECTOR_FP_CONVERTS]
 #define TARGET_USE_VECTOR_CONVERTS \
 	ix86_tune_features[X86_TUNE_USE_VECTOR_CONVERTS]
+#define TARGET_SLOW_PHUFFB \
+	ix86_tune_features[X86_TUNE_SLOW_PHUFFB]
 #define TARGET_FUSE_CMP_AND_BRANCH_32 \
 	ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
 #define TARGET_FUSE_CMP_AND_BRANCH_64 \
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 8399102..f9ebf94 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -386,6 +386,10 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
 
+/* X86_TUNE_SLOW_SHUFFB: Indicates tunings with slow psuffb instruction.  */
+DEF_TUNE (X86_TUNE_SLOW_PHUFFB, "slow_pshuffb",
+          m_BONNELL | m_SILVERMONT | m_INTEL)
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 0bd92d4..1a3020f 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4131,6 +4131,7 @@ rs6000_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case cond_branch_not_taken:
         return 1;
 
+      case vec_perm_shuffle:
       case vec_perm:
 	if (TARGET_VSX)
 	  return 3;
diff --git a/gcc/config/spu/spu.c b/gcc/config/spu/spu.c
index 302d7e0..7a2618e 100644
--- a/gcc/config/spu/spu.c
+++ b/gcc/config/spu/spu.c
@@ -6622,6 +6622,7 @@ spu_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
       case vec_to_scalar:
       case scalar_to_vec:
       case cond_branch_not_taken:
+      case vec_perm_shuffle:
       case vec_perm:
       case vec_promote_demote:
         return 1;
diff --git a/gcc/target.h b/gcc/target.h
index 7c28efa..7f6b723 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -154,6 +154,7 @@ enum vect_cost_for_stmt
   cond_branch_not_taken,
   cond_branch_taken,
   vec_perm,
+  vec_perm_shuffle,
   vec_promote_demote,
   vec_construct
 };
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index c3e8f37..a5ff330 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4324,13 +4324,14 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_store_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_store_chain requires the group size to be equal to 3 or
+     be be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
@@ -4339,23 +4340,73 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
     {
       unsigned int i, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
-      for (i = 0; i < nelt / 2; i++)
+
+      if (exact_log2 (count) != -1)
 	{
-	  sel[i * 2] = i;
-	  sel[i * 2 + 1] = i + nelt;
+	  for (i = 0; i < nelt / 2; i++)
+	    {
+	      sel[i * 2] = i;
+	      sel[i * 2 + 1] = i + nelt;
+	    }
+	    if (can_vec_perm_p (mode, false, sel))
+	      {
+		for (i = 0; i < nelt; i++)
+		  sel[i] += nelt / 2;
+		if (can_vec_perm_p (mode, false, sel))
+		  return true;
+	      }
 	}
-      if (can_vec_perm_p (mode, false, sel))
+      else if (count == 3)
 	{
-	  for (i = 0; i < nelt; i++)
-	    sel[i] += nelt / 2;
-	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
-	}
+	  unsigned int j0 = 0, j1 = 0, j2 = 0;
+	  unsigned int i, j;
+
+	  for (j = 0; j < 3; j++)
+	    {
+	      int nelt0 = ((3 - j) * nelt) % 3;
+	      int nelt1 = ((3 - j) * nelt + 1) % 3;
+	      int nelt2 = ((3 - j) * nelt + 2) % 3;
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = j0++;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = nelt + j1++;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = 0;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = 3 * i + nelt0;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = 3 * i + nelt1;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = nelt + j2++;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+	    }
+	  return true;
+      }
     }
 
   if (dump_enabled_p ())
     dump_printf (MSG_MISSED_OPTIMIZATION,
-                 "interleave op not supported by target.\n");
+		 "permutaion op not supported by target.\n");
   return false;
 }
 
@@ -4375,9 +4426,9 @@ vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_store_chain.
 
    Given a chain of interleaved stores in DR_CHAIN of LENGTH that must be
-   a power of 2, generate interleave_high/low stmts to reorder the data
-   correctly for the stores.  Return the final references for stores in
-   RESULT_CHAIN.
+   a power of 2 or equal to 3, generate interleave_high/low stmts to reorder
+   the data correctly for the stores.  Return the final references for stores
+   in RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements.  We assign a number to
@@ -4444,7 +4495,9 @@ vect_permute_store_chain (vec<tree> dr_chain,
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   tree perm_mask_low, perm_mask_high;
-  unsigned int i, n;
+  tree data_ref;
+  tree perm3_mask_low, perm3_mask_high;
+  unsigned int i, n, log_length = exact_log2 (length);
   unsigned int j, nelt = TYPE_VECTOR_SUBPARTS (vectype);
   unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
@@ -4452,47 +4505,117 @@ vect_permute_store_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0, n = nelt / 2; i < n; i++)
+  if (log_length != -1)
     {
-      sel[i * 2] = i;
-      sel[i * 2 + 1] = i + nelt;
-    }
-  perm_mask_high = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_high != NULL);
+      for (i = 0, n = nelt / 2; i < n; i++)
+	{
+	  sel[i * 2] = i;
+	  sel[i * 2 + 1] = i + nelt;
+	}
+	perm_mask_high = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_high != NULL);
 
-  for (i = 0; i < nelt; i++)
-    sel[i] += nelt / 2;
-  perm_mask_low = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_low != NULL);
+	for (i = 0; i < nelt; i++)
+	  sel[i] += nelt / 2;
+	perm_mask_low = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_low != NULL);
 
-  for (i = 0, n = exact_log2 (length); i < n; i++)
+	for (i = 0, n = log_length; i < n; i++)
+	  {
+	    for (j = 0; j < length/2; j++)
+	      {
+		vect1 = dr_chain[j];
+		vect2 = dr_chain[j+length/2];
+
+		/* Create interleaving stmt:
+		   high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1,
+							...}>  */
+		high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
+						  vect1, vect2, perm_mask_high);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j] = high;
+
+		/* Create interleaving stmt:
+		   low = VEC_PERM_EXPR <vect1, vect2,
+					{nelt/2, nelt*3/2, nelt/2+1, nelt*3/2+1,
+					 ...}>  */
+		low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
+						  vect1, vect2, perm_mask_low);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j+1] = low;
+	      }
+	    memcpy (dr_chain.address (), result_chain->address (),
+		    length * sizeof (tree));
+	  }
+    }
+  /* length is not a power of 2.  */
+  else
     {
-      for (j = 0; j < length/2; j++)
-	{
-	  vect1 = dr_chain[j];
-	  vect2 = dr_chain[j+length/2];
+      unsigned int j0 = 0, j1 = 0, j2 = 0;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
+
+      for (j = 0; j < 3; j++)
+        {
+	  int nelt0 = ((3 - j) * nelt) % 3;
+	  int nelt1 = ((3 - j) * nelt + 1) % 3;
+	  int nelt2 = ((3 - j) * nelt + 2) % 3;
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = j0++;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = nelt + j1++;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = 0;
+	    }
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = 3 * i + nelt0;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = 3 * i + nelt1;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = nelt + j2++;
+	    }
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  vect1 = dr_chain[0];
+	  vect2 = dr_chain[1];
 
 	  /* Create interleaving stmt:
-	     high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1, ...}>  */
-	  high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
-					    vect1, vect2, perm_mask_high);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {j, nelt, *, j + 1, nelt + j + 1, *,
+				   j + 2, nelt + j + 2, *, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_low");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j] = high;
 
+	  vect1 = data_ref;
+	  vect2 = dr_chain[2];
 	  /* Create interleaving stmt:
-	     low = VEC_PERM_EXPR <vect1, vect2, {nelt/2, nelt*3/2, nelt/2+1,
-						 nelt*3/2+1, ...}>  */
-	  low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
-					    vect1, vect2, perm_mask_low);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {0, 1, nelt + j, 3, 4, nelt + j + 1,
+				   6, 7, nelt + j + 2, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_high");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j+1] = low;
+	  (*result_chain)[j] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
@@ -4771,36 +4894,74 @@ vect_grouped_load_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_load_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_load_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
   /* Check that the permutation is supported.  */
   if (VECTOR_MODE_P (mode))
     {
-      unsigned int i, nelt = GET_MODE_NUNITS (mode);
+      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
-      for (i = 0; i < nelt; i++)
-	sel[i] = i * 2;
-      if (can_vec_perm_p (mode, false, sel))
+      if (exact_log2 (count) != -1)
 	{
 	  for (i = 0; i < nelt; i++)
-	    sel[i] = i * 2 + 1;
+	    sel[i] = i * 2;
 	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
+	    {
+	      for (i = 0; i < nelt; i++)
+		sel[i] = i * 2 + 1;
+	      if (can_vec_perm_p (mode, false, sel))
+		return true;
+	    }
+        }
+      else if (count == 3)
+	{
+	  unsigned int k;
+	  for (k = 0; k < 3; k++)
+	    {
+	      for (i = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = 3 * i + k;
+		else
+		  sel[i] = 0;
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		    return false;
+		}
+	      for (i = 0, j = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = i;
+		else
+		  sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		  return false;
+		}
+	    }
+	  return true;
 	}
     }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                     "extract even/odd not supported by target\n");
+		     "extract even/odd not supported by target\n");
   return false;
 }
 
@@ -4818,8 +4979,9 @@ vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_load_chain.
 
    Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
-   a power of 2, generate extract_even/odd stmts to reorder the input data
-   correctly.  Return the final references for loads in RESULT_CHAIN.
+   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
+   the input data correctly.  Return the final references for loads in
+   RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements. We assign a number to each
@@ -4900,6 +5062,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
 {
   tree data_ref, first_vect, second_vect;
   tree perm_mask_even, perm_mask_odd;
+  tree perm3_mask_low, perm3_mask_high;
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   unsigned int i, j, log_length = exact_log2 (length);
@@ -4910,45 +5073,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2;
-  perm_mask_even = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_even != NULL);
+  if (log_length != -1)
+    {
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2;
+      perm_mask_even = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_even != NULL);
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2 + 1;
-  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_odd != NULL);
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2 + 1;
+      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_odd != NULL);
 
-  for (i = 0; i < log_length; i++)
-    {
-      for (j = 0; j < length; j += 2)
+      for (i = 0; i < log_length; i++)
 	{
-	  first_vect = dr_chain[j];
-	  second_vect = dr_chain[j+1];
+	  for (j = 0; j < length; j += 2)
+	    {
+	      first_vect = dr_chain[j];
+	      second_vect = dr_chain[j+1];
+
+	      /* data_ref = permute_even (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_even);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2] = data_ref;
+
+	      /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_odd);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2+length/2] = data_ref;
+	    }
+	  memcpy (dr_chain.address (), result_chain->address (),
+		  length * sizeof (tree));
+	}
+    }
+  /* length is not a power of 2.  */
+  else
+    {
+      unsigned int k;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
 
-	  /* data_ref = permute_even (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+      for (k = 0; k < 3; k++)
+	{
+	  for (i = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = 3 * i + k;
+	    else
+	      sel[i] = 0;
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0, j = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = i;
+	    else
+	      sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  first_vect = dr_chain[0];
+	  second_vect = dr_chain[1];
+
+	  /* Create interleaving stmt (low part of):
+	     low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							     ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_even);
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2] = data_ref;
 
-	  /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	  /* Create interleaving stmt (high part of):
+	     high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							      ...}>  */
+	  first_vect = data_ref;
+	  second_vect = dr_chain[2];
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_odd);
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2+length/2] = data_ref;
+	  (*result_chain)[k] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
-
 /* Function vect_transform_grouped_load.
 
    Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 2a2364d..f8f3b83 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -974,10 +974,12 @@ vect_model_store_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!store_lanes_p && group_size > 1)
     {
-      /* Uses a high and low interleave operation for each needed permute.  */
-      
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
+      /* Uses a high and low interleave or shuffle operations for each
+	 needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      enum vect_cost_for_stmt perm =
+	exact_log2 (group_size) != -1 ? vec_perm : vec_perm_shuffle;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, perm,
 				      stmt_info, 0, vect_body);
 
       if (dump_enabled_p ())
@@ -1091,10 +1093,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!load_lanes_p && group_size > 1)
     {
-      /* Uses an even and odd extract operations for each needed permute.  */
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
-				       stmt_info, 0, vect_body);
+      /* Uses an even and odd extract operations or shuffle operations
+	 for each needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      enum vect_cost_for_stmt perm =
+	exact_log2 (group_size) != -1 ? vec_perm : vec_perm_shuffle;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, perm,
+				      stmt_info, 0, vect_body);
 
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-02-11 14:03   ` Evgeny Stupachenko
@ 2014-02-11 15:20     ` Richard Biener
  2014-03-06 14:42       ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2014-02-11 15:20 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: gcc-patches, Jakub Jelinek, ubizjak

On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:

> Missed patch attached in plain-text.
> 
> I have copyright assignment on file with the FSF covering work on GCC.
> 
> Load/stores groups of length 3 is the most frequent non-power-of-2
> case. It is used in RGB image processing (like test case in PR52252).
> For sure we can extend the patch to length 5 and more. However, this
> potentially affect performance on some other architectures and
> requires larger testing. So length 3 it is just first step.The
> algorithm in the patch could be modified for a general case in several
> steps.
> 
> I understand that the patch should wait for the stage 1, however since
> its ready we can discuss it right now and make some changes (like
> general size of group).

Other than that I'd like to see a vectorizer hook querying the cost of a
vec_perm_const expansion instead of adding vec_perm_shuffle
(thus requires the constant shuffle mask to be passed as well
as the vector type).  That's more useful for other uses that
would require (arbitrary) shuffles.

Didn't look at the rest of the patch yet - queued in my review
pipeline.

Thanks,
Richard.

> Thanks,
> Evgeny
> 
> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >
> > > Hi,
> > >
> > > The patch gives an expected 3 times gain for the test case in the PR52252
> > > (and even 6 times for AVX2).
> > > It passes make check and bootstrap on x86.
> > > spec2000/spec2006 got no regressions/gains on x86.
> > >
> > > Is this patch ok?
> >
> > I've worked on generalizing the permutation support in the light
> > of the availability of the generic shuffle support in the IL
> > but hit some road-blocks in the way code-generation works for
> > group loads with permutations (I don't remember if I posted all patches).
> >
> > This patch seems to be to a slightly different place but it again
> > special-cases a specific permutation.  Why's that?  Why can't we
> > support groups of size 7 for example?  So - can this be generalized
> > to support arbitrary non-power-of-two load/store groups?
> >
> > Other than that the patch has to wait for stage1 to open again,
> > of course.  And it misses a testcase.
> >
> > Btw, do you have a copyright assignment on file with the FSF covering
> > work on GCC?
> >
> > Thanks,
> > Richard.
> >
> > > ChangeLog:
> > >
> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> > >
> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> > >         check for stores group of length 3.
> > >         (vect_permute_store_chain): New permutations for stores group of
> > >         length 3.
> > >         (vect_grouped_load_supported): New check for loads group of length
> > > 3.
> > >         (vect_permute_load_chain): New permutations for loads group of
> > > length 3.
> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
> > > vec_perm_shuffle
> > >         for the new permutations.
> > >         (vect_model_load_cost): Ditto.
> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
> > >         * config/arm/arm.c: Ditto.
> > >         * config/rs6000/rs6000.c: Ditto.
> > >         * config/spu/spu.c: Ditto.
> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> > > byte
> > >         shuffle on some x86 architectures.
> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
> > > permutations.
> > >         Fixing cost for other permutations.
> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
> > >         slow (TARGET_SLOW_PHUFFB).
> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
> > >         Adding new shuffle cost only when byte shuffle is expected.
> > >         Fixing cost model for Silvermont.
> > >
> > > Thanks,
> > > Evgeny
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE / SUSE Labs
> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-02-11 15:20     ` Richard Biener
@ 2014-03-06 14:42       ` Evgeny Stupachenko
  2014-03-06 14:44         ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-03-06 14:42 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

I've separated the patch into 2: cost model tuning and load/store
groups parallelism.
SLM tuning was partially introduced in the patch:
http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
The patch introducing vectorization for load/store groups of size 3 attached.

Is it ok for stage1?

ChangeLog:

2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>

       * tree-vect-data-refs.c (vect_grouped_store_supported): New
       check for stores group of length 3.
       (vect_permute_store_chain): New permutations for stores group of
       length 3.
       (vect_grouped_load_supported): New check for loads group of length 3.
       (vect_permute_load_chain): New permutations for loads group of length 3.
       * tree-vect-stmts.c (vect_model_store_cost): Change cost
       of vec_perm_shuffle for the new permutations.
       (vect_model_load_cost): Ditto.



On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>
>> Missed patch attached in plain-text.
>>
>> I have copyright assignment on file with the FSF covering work on GCC.
>>
>> Load/stores groups of length 3 is the most frequent non-power-of-2
>> case. It is used in RGB image processing (like test case in PR52252).
>> For sure we can extend the patch to length 5 and more. However, this
>> potentially affect performance on some other architectures and
>> requires larger testing. So length 3 it is just first step.The
>> algorithm in the patch could be modified for a general case in several
>> steps.
>>
>> I understand that the patch should wait for the stage 1, however since
>> its ready we can discuss it right now and make some changes (like
>> general size of group).
>
> Other than that I'd like to see a vectorizer hook querying the cost of a
> vec_perm_const expansion instead of adding vec_perm_shuffle
> (thus requires the constant shuffle mask to be passed as well
> as the vector type).  That's more useful for other uses that
> would require (arbitrary) shuffles.
>
> Didn't look at the rest of the patch yet - queued in my review
> pipeline.
>
> Thanks,
> Richard.
>
>> Thanks,
>> Evgeny
>>
>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>> >
>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>> >
>> > > Hi,
>> > >
>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>> > > (and even 6 times for AVX2).
>> > > It passes make check and bootstrap on x86.
>> > > spec2000/spec2006 got no regressions/gains on x86.
>> > >
>> > > Is this patch ok?
>> >
>> > I've worked on generalizing the permutation support in the light
>> > of the availability of the generic shuffle support in the IL
>> > but hit some road-blocks in the way code-generation works for
>> > group loads with permutations (I don't remember if I posted all patches).
>> >
>> > This patch seems to be to a slightly different place but it again
>> > special-cases a specific permutation.  Why's that?  Why can't we
>> > support groups of size 7 for example?  So - can this be generalized
>> > to support arbitrary non-power-of-two load/store groups?
>> >
>> > Other than that the patch has to wait for stage1 to open again,
>> > of course.  And it misses a testcase.
>> >
>> > Btw, do you have a copyright assignment on file with the FSF covering
>> > work on GCC?
>> >
>> > Thanks,
>> > Richard.
>> >
>> > > ChangeLog:
>> > >
>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>> > >
>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>> > >         check for stores group of length 3.
>> > >         (vect_permute_store_chain): New permutations for stores group of
>> > >         length 3.
>> > >         (vect_grouped_load_supported): New check for loads group of length
>> > > 3.
>> > >         (vect_permute_load_chain): New permutations for loads group of
>> > > length 3.
>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>> > > vec_perm_shuffle
>> > >         for the new permutations.
>> > >         (vect_model_load_cost): Ditto.
>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>> > >         * config/arm/arm.c: Ditto.
>> > >         * config/rs6000/rs6000.c: Ditto.
>> > >         * config/spu/spu.c: Ditto.
>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>> > > byte
>> > >         shuffle on some x86 architectures.
>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>> > > permutations.
>> > >         Fixing cost for other permutations.
>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>> > >         slow (TARGET_SLOW_PHUFFB).
>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>> > >         Adding new shuffle cost only when byte shuffle is expected.
>> > >         Fixing cost model for Silvermont.
>> > >
>> > > Thanks,
>> > > Evgeny
>> > >
>> >
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE / SUSE Labs
>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-03-06 14:42       ` Evgeny Stupachenko
@ 2014-03-06 14:44         ` Evgeny Stupachenko
  2014-04-18 10:05           ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-03-06 14:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 5895 bytes --]

Missed attachment.

On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> I've separated the patch into 2: cost model tuning and load/store
> groups parallelism.
> SLM tuning was partially introduced in the patch:
> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
> The patch introducing vectorization for load/store groups of size 3 attached.
>
> Is it ok for stage1?
>
> ChangeLog:
>
> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
>
>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
>        check for stores group of length 3.
>        (vect_permute_store_chain): New permutations for stores group of
>        length 3.
>        (vect_grouped_load_supported): New check for loads group of length 3.
>        (vect_permute_load_chain): New permutations for loads group of length 3.
>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
>        of vec_perm_shuffle for the new permutations.
>        (vect_model_load_cost): Ditto.
>
>
>
> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>
>>> Missed patch attached in plain-text.
>>>
>>> I have copyright assignment on file with the FSF covering work on GCC.
>>>
>>> Load/stores groups of length 3 is the most frequent non-power-of-2
>>> case. It is used in RGB image processing (like test case in PR52252).
>>> For sure we can extend the patch to length 5 and more. However, this
>>> potentially affect performance on some other architectures and
>>> requires larger testing. So length 3 it is just first step.The
>>> algorithm in the patch could be modified for a general case in several
>>> steps.
>>>
>>> I understand that the patch should wait for the stage 1, however since
>>> its ready we can discuss it right now and make some changes (like
>>> general size of group).
>>
>> Other than that I'd like to see a vectorizer hook querying the cost of a
>> vec_perm_const expansion instead of adding vec_perm_shuffle
>> (thus requires the constant shuffle mask to be passed as well
>> as the vector type).  That's more useful for other uses that
>> would require (arbitrary) shuffles.
>>
>> Didn't look at the rest of the patch yet - queued in my review
>> pipeline.
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Evgeny
>>>
>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>>> >
>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>>> > > (and even 6 times for AVX2).
>>> > > It passes make check and bootstrap on x86.
>>> > > spec2000/spec2006 got no regressions/gains on x86.
>>> > >
>>> > > Is this patch ok?
>>> >
>>> > I've worked on generalizing the permutation support in the light
>>> > of the availability of the generic shuffle support in the IL
>>> > but hit some road-blocks in the way code-generation works for
>>> > group loads with permutations (I don't remember if I posted all patches).
>>> >
>>> > This patch seems to be to a slightly different place but it again
>>> > special-cases a specific permutation.  Why's that?  Why can't we
>>> > support groups of size 7 for example?  So - can this be generalized
>>> > to support arbitrary non-power-of-two load/store groups?
>>> >
>>> > Other than that the patch has to wait for stage1 to open again,
>>> > of course.  And it misses a testcase.
>>> >
>>> > Btw, do you have a copyright assignment on file with the FSF covering
>>> > work on GCC?
>>> >
>>> > Thanks,
>>> > Richard.
>>> >
>>> > > ChangeLog:
>>> > >
>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>>> > >
>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>> > >         check for stores group of length 3.
>>> > >         (vect_permute_store_chain): New permutations for stores group of
>>> > >         length 3.
>>> > >         (vect_grouped_load_supported): New check for loads group of length
>>> > > 3.
>>> > >         (vect_permute_load_chain): New permutations for loads group of
>>> > > length 3.
>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>>> > > vec_perm_shuffle
>>> > >         for the new permutations.
>>> > >         (vect_model_load_cost): Ditto.
>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>>> > >         * config/arm/arm.c: Ditto.
>>> > >         * config/rs6000/rs6000.c: Ditto.
>>> > >         * config/spu/spu.c: Ditto.
>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>>> > > byte
>>> > >         shuffle on some x86 architectures.
>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>>> > > permutations.
>>> > >         Fixing cost for other permutations.
>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>>> > >         slow (TARGET_SLOW_PHUFFB).
>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>>> > >         Adding new shuffle cost only when byte shuffle is expected.
>>> > >         Fixing cost model for Silvermont.
>>> > >
>>> > > Thanks,
>>> > > Evgeny
>>> > >
>>> >
>>> > --
>>> > Richard Biener <rguenther@suse.de>
>>> > SUSE / SUSE Labs
>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>>
>>
>> --
>> Richard Biener <rguenther@suse.de>
>> SUSE / SUSE Labs
>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

[-- Attachment #2: vect3.patch --]
[-- Type: application/octet-stream, Size: 18528 bytes --]

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index fbc35a3..337c69b 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4359,13 +4359,14 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_store_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_store_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
@@ -4374,23 +4375,73 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
     {
       unsigned int i, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
-      for (i = 0; i < nelt / 2; i++)
+
+      if (exact_log2 (count) != -1)
 	{
-	  sel[i * 2] = i;
-	  sel[i * 2 + 1] = i + nelt;
+	  for (i = 0; i < nelt / 2; i++)
+	    {
+	      sel[i * 2] = i;
+	      sel[i * 2 + 1] = i + nelt;
+	    }
+	    if (can_vec_perm_p (mode, false, sel))
+	      {
+		for (i = 0; i < nelt; i++)
+		  sel[i] += nelt / 2;
+		if (can_vec_perm_p (mode, false, sel))
+		  return true;
+	      }
 	}
-      if (can_vec_perm_p (mode, false, sel))
+      else if (count == 3)
 	{
-	  for (i = 0; i < nelt; i++)
-	    sel[i] += nelt / 2;
-	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
-	}
+	  unsigned int j0 = 0, j1 = 0, j2 = 0;
+	  unsigned int i, j;
+
+	  for (j = 0; j < 3; j++)
+	    {
+	      int nelt0 = ((3 - j) * nelt) % 3;
+	      int nelt1 = ((3 - j) * nelt + 1) % 3;
+	      int nelt2 = ((3 - j) * nelt + 2) % 3;
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = j0++;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = nelt + j1++;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = 0;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = 3 * i + nelt0;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = 3 * i + nelt1;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = nelt + j2++;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+	    }
+	  return true;
+      }
     }
 
   if (dump_enabled_p ())
     dump_printf (MSG_MISSED_OPTIMIZATION,
-                 "interleave op not supported by target.\n");
+		 "permutaion op not supported by target.\n");
   return false;
 }
 
@@ -4410,9 +4461,9 @@ vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_store_chain.
 
    Given a chain of interleaved stores in DR_CHAIN of LENGTH that must be
-   a power of 2, generate interleave_high/low stmts to reorder the data
-   correctly for the stores.  Return the final references for stores in
-   RESULT_CHAIN.
+   a power of 2 or equal to 3, generate interleave_high/low stmts to reorder
+   the data correctly for the stores.  Return the final references for stores
+   in RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements.  We assign a number to
@@ -4479,7 +4530,9 @@ vect_permute_store_chain (vec<tree> dr_chain,
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   tree perm_mask_low, perm_mask_high;
-  unsigned int i, n;
+  tree data_ref;
+  tree perm3_mask_low, perm3_mask_high;
+  unsigned int i, n, log_length = exact_log2 (length);
   unsigned int j, nelt = TYPE_VECTOR_SUBPARTS (vectype);
   unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
@@ -4487,47 +4540,117 @@ vect_permute_store_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0, n = nelt / 2; i < n; i++)
+  if (log_length != -1)
     {
-      sel[i * 2] = i;
-      sel[i * 2 + 1] = i + nelt;
-    }
-  perm_mask_high = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_high != NULL);
+      for (i = 0, n = nelt / 2; i < n; i++)
+	{
+	  sel[i * 2] = i;
+	  sel[i * 2 + 1] = i + nelt;
+	}
+	perm_mask_high = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_high != NULL);
 
-  for (i = 0; i < nelt; i++)
-    sel[i] += nelt / 2;
-  perm_mask_low = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_low != NULL);
+	for (i = 0; i < nelt; i++)
+	  sel[i] += nelt / 2;
+	perm_mask_low = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_low != NULL);
 
-  for (i = 0, n = exact_log2 (length); i < n; i++)
+	for (i = 0, n = log_length; i < n; i++)
+	  {
+	    for (j = 0; j < length/2; j++)
+	      {
+		vect1 = dr_chain[j];
+		vect2 = dr_chain[j+length/2];
+
+		/* Create interleaving stmt:
+		   high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1,
+							...}>  */
+		high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
+						  vect1, vect2, perm_mask_high);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j] = high;
+
+		/* Create interleaving stmt:
+		   low = VEC_PERM_EXPR <vect1, vect2,
+					{nelt/2, nelt*3/2, nelt/2+1, nelt*3/2+1,
+					 ...}>  */
+		low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
+						  vect1, vect2, perm_mask_low);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j+1] = low;
+	      }
+	    memcpy (dr_chain.address (), result_chain->address (),
+		    length * sizeof (tree));
+	  }
+    }
+  /* length is not a power of 2.  */
+  else
     {
-      for (j = 0; j < length/2; j++)
-	{
-	  vect1 = dr_chain[j];
-	  vect2 = dr_chain[j+length/2];
+      unsigned int j0 = 0, j1 = 0, j2 = 0;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
+
+      for (j = 0; j < 3; j++)
+        {
+	  int nelt0 = ((3 - j) * nelt) % 3;
+	  int nelt1 = ((3 - j) * nelt + 1) % 3;
+	  int nelt2 = ((3 - j) * nelt + 2) % 3;
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = j0++;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = nelt + j1++;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = 0;
+	    }
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = 3 * i + nelt0;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = 3 * i + nelt1;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = nelt + j2++;
+	    }
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  vect1 = dr_chain[0];
+	  vect2 = dr_chain[1];
 
 	  /* Create interleaving stmt:
-	     high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1, ...}>  */
-	  high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
-					    vect1, vect2, perm_mask_high);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {j, nelt, *, j + 1, nelt + j + 1, *,
+				   j + 2, nelt + j + 2, *, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_low");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j] = high;
 
+	  vect1 = data_ref;
+	  vect2 = dr_chain[2];
 	  /* Create interleaving stmt:
-	     low = VEC_PERM_EXPR <vect1, vect2, {nelt/2, nelt*3/2, nelt/2+1,
-						 nelt*3/2+1, ...}>  */
-	  low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
-					    vect1, vect2, perm_mask_low);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {0, 1, nelt + j, 3, 4, nelt + j + 1,
+				   6, 7, nelt + j + 2, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_high");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j+1] = low;
+	  (*result_chain)[j] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
@@ -4806,36 +4929,74 @@ vect_grouped_load_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_load_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_load_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
   /* Check that the permutation is supported.  */
   if (VECTOR_MODE_P (mode))
     {
-      unsigned int i, nelt = GET_MODE_NUNITS (mode);
+      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
-      for (i = 0; i < nelt; i++)
-	sel[i] = i * 2;
-      if (can_vec_perm_p (mode, false, sel))
+      if (exact_log2 (count) != -1)
 	{
 	  for (i = 0; i < nelt; i++)
-	    sel[i] = i * 2 + 1;
+	    sel[i] = i * 2;
 	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
+	    {
+	      for (i = 0; i < nelt; i++)
+		sel[i] = i * 2 + 1;
+	      if (can_vec_perm_p (mode, false, sel))
+		return true;
+	    }
+        }
+      else if (count == 3)
+	{
+	  unsigned int k;
+	  for (k = 0; k < 3; k++)
+	    {
+	      for (i = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = 3 * i + k;
+		else
+		  sel[i] = 0;
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		    return false;
+		}
+	      for (i = 0, j = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = i;
+		else
+		  sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		  return false;
+		}
+	    }
+	  return true;
 	}
     }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                     "extract even/odd not supported by target\n");
+		     "extract even/odd not supported by target\n");
   return false;
 }
 
@@ -4853,8 +5014,9 @@ vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_load_chain.
 
    Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
-   a power of 2, generate extract_even/odd stmts to reorder the input data
-   correctly.  Return the final references for loads in RESULT_CHAIN.
+   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
+   the input data correctly.  Return the final references for loads in
+   RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements. We assign a number to each
@@ -4935,6 +5097,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
 {
   tree data_ref, first_vect, second_vect;
   tree perm_mask_even, perm_mask_odd;
+  tree perm3_mask_low, perm3_mask_high;
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   unsigned int i, j, log_length = exact_log2 (length);
@@ -4945,45 +5108,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2;
-  perm_mask_even = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_even != NULL);
+  if (log_length != -1)
+    {
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2;
+      perm_mask_even = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_even != NULL);
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2 + 1;
-  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_odd != NULL);
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2 + 1;
+      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_odd != NULL);
 
-  for (i = 0; i < log_length; i++)
-    {
-      for (j = 0; j < length; j += 2)
+      for (i = 0; i < log_length; i++)
 	{
-	  first_vect = dr_chain[j];
-	  second_vect = dr_chain[j+1];
+	  for (j = 0; j < length; j += 2)
+	    {
+	      first_vect = dr_chain[j];
+	      second_vect = dr_chain[j+1];
+
+	      /* data_ref = permute_even (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_even);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2] = data_ref;
+
+	      /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_odd);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2+length/2] = data_ref;
+	    }
+	  memcpy (dr_chain.address (), result_chain->address (),
+		  length * sizeof (tree));
+	}
+    }
+  /* length is not a power of 2.  */
+  else
+    {
+      unsigned int k;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
 
-	  /* data_ref = permute_even (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+      for (k = 0; k < 3; k++)
+	{
+	  for (i = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = 3 * i + k;
+	    else
+	      sel[i] = 0;
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0, j = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = i;
+	    else
+	      sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  first_vect = dr_chain[0];
+	  second_vect = dr_chain[1];
+
+	  /* Create interleaving stmt (low part of):
+	     low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							     ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_even);
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2] = data_ref;
 
-	  /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	  /* Create interleaving stmt (high part of):
+	     high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							      ...}>  */
+	  first_vect = data_ref;
+	  second_vect = dr_chain[2];
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_odd);
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2+length/2] = data_ref;
+	  (*result_chain)[k] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
-
 /* Function vect_transform_grouped_load.
 
    Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 70fb411..05ccf2a 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -974,9 +974,9 @@ vect_model_store_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!store_lanes_p && group_size > 1)
     {
-      /* Uses a high and low interleave operation for each needed permute.  */
-      
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
+      /* Uses a high and low interleave or shuffle operations for each
+	 needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
       inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
 				      stmt_info, 0, vect_body);
 
@@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!load_lanes_p && group_size > 1)
     {
-      /* Uses an even and odd extract operations for each needed permute.  */
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
-				       stmt_info, 0, vect_body);
+      /* Uses an even and odd extract operations or shuffle operations
+	 for each needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
+				      stmt_info, 0, vect_body);
 
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-03-06 14:44         ` Evgeny Stupachenko
@ 2014-04-18 10:05           ` Evgeny Stupachenko
  2014-04-30 15:03             ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-04-18 10:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 7088 bytes --]

Hi,

Merged with current master the patch passes bootstrap and is giving
expected gains.
Patch and new tests are attached.

ChangeLog:

2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>

        * tree-vect-data-refs.c (vect_grouped_store_supported): New
        check for stores group of length 3.
        (vect_permute_store_chain): New permutations for stores group of
        length 3.
        (vect_grouped_load_supported): New check for loads group of length 3.
        (vect_permute_load_chain): New permutations for loads group of length 3.
        * tree-vect-stmts.c (vect_model_store_cost): Change cost
        of vec_perm_shuffle for the new permutations.
        (vect_model_load_cost): Ditto.

ChangeLog for testsuite:

2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>

       PR tree-optimization/52252
       * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
       * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.

Evgeny

On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Missed attachment.
>
> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> I've separated the patch into 2: cost model tuning and load/store
>> groups parallelism.
>> SLM tuning was partially introduced in the patch:
>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
>> The patch introducing vectorization for load/store groups of size 3 attached.
>>
>> Is it ok for stage1?
>>
>> ChangeLog:
>>
>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>        check for stores group of length 3.
>>        (vect_permute_store_chain): New permutations for stores group of
>>        length 3.
>>        (vect_grouped_load_supported): New check for loads group of length 3.
>>        (vect_permute_load_chain): New permutations for loads group of length 3.
>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
>>        of vec_perm_shuffle for the new permutations.
>>        (vect_model_load_cost): Ditto.
>>
>>
>>
>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>
>>>> Missed patch attached in plain-text.
>>>>
>>>> I have copyright assignment on file with the FSF covering work on GCC.
>>>>
>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
>>>> case. It is used in RGB image processing (like test case in PR52252).
>>>> For sure we can extend the patch to length 5 and more. However, this
>>>> potentially affect performance on some other architectures and
>>>> requires larger testing. So length 3 it is just first step.The
>>>> algorithm in the patch could be modified for a general case in several
>>>> steps.
>>>>
>>>> I understand that the patch should wait for the stage 1, however since
>>>> its ready we can discuss it right now and make some changes (like
>>>> general size of group).
>>>
>>> Other than that I'd like to see a vectorizer hook querying the cost of a
>>> vec_perm_const expansion instead of adding vec_perm_shuffle
>>> (thus requires the constant shuffle mask to be passed as well
>>> as the vector type).  That's more useful for other uses that
>>> would require (arbitrary) shuffles.
>>>
>>> Didn't look at the rest of the patch yet - queued in my review
>>> pipeline.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Evgeny
>>>>
>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>>>> >
>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>>>> > > (and even 6 times for AVX2).
>>>> > > It passes make check and bootstrap on x86.
>>>> > > spec2000/spec2006 got no regressions/gains on x86.
>>>> > >
>>>> > > Is this patch ok?
>>>> >
>>>> > I've worked on generalizing the permutation support in the light
>>>> > of the availability of the generic shuffle support in the IL
>>>> > but hit some road-blocks in the way code-generation works for
>>>> > group loads with permutations (I don't remember if I posted all patches).
>>>> >
>>>> > This patch seems to be to a slightly different place but it again
>>>> > special-cases a specific permutation.  Why's that?  Why can't we
>>>> > support groups of size 7 for example?  So - can this be generalized
>>>> > to support arbitrary non-power-of-two load/store groups?
>>>> >
>>>> > Other than that the patch has to wait for stage1 to open again,
>>>> > of course.  And it misses a testcase.
>>>> >
>>>> > Btw, do you have a copyright assignment on file with the FSF covering
>>>> > work on GCC?
>>>> >
>>>> > Thanks,
>>>> > Richard.
>>>> >
>>>> > > ChangeLog:
>>>> > >
>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>>>> > >
>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>>> > >         check for stores group of length 3.
>>>> > >         (vect_permute_store_chain): New permutations for stores group of
>>>> > >         length 3.
>>>> > >         (vect_grouped_load_supported): New check for loads group of length
>>>> > > 3.
>>>> > >         (vect_permute_load_chain): New permutations for loads group of
>>>> > > length 3.
>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>>>> > > vec_perm_shuffle
>>>> > >         for the new permutations.
>>>> > >         (vect_model_load_cost): Ditto.
>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>>>> > >         * config/arm/arm.c: Ditto.
>>>> > >         * config/rs6000/rs6000.c: Ditto.
>>>> > >         * config/spu/spu.c: Ditto.
>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>>>> > > byte
>>>> > >         shuffle on some x86 architectures.
>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>>>> > > permutations.
>>>> > >         Fixing cost for other permutations.
>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>>>> > >         slow (TARGET_SLOW_PHUFFB).
>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
>>>> > >         Fixing cost model for Silvermont.
>>>> > >
>>>> > > Thanks,
>>>> > > Evgeny
>>>> > >
>>>> >
>>>> > --
>>>> > Richard Biener <rguenther@suse.de>
>>>> > SUSE / SUSE Labs
>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>>>
>>>
>>> --
>>> Richard Biener <rguenther@suse.de>
>>> SUSE / SUSE Labs
>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

[-- Attachment #2: vect3.patch --]
[-- Type: application/octet-stream, Size: 18556 bytes --]

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 274cdbd..042b97c 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4365,13 +4365,14 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_store_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_store_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
@@ -4380,23 +4381,73 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
     {
       unsigned int i, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
-      for (i = 0; i < nelt / 2; i++)
+
+      if (exact_log2 (count) != -1)
 	{
-	  sel[i * 2] = i;
-	  sel[i * 2 + 1] = i + nelt;
+	  for (i = 0; i < nelt / 2; i++)
+	    {
+	      sel[i * 2] = i;
+	      sel[i * 2 + 1] = i + nelt;
+	    }
+	    if (can_vec_perm_p (mode, false, sel))
+	      {
+		for (i = 0; i < nelt; i++)
+		  sel[i] += nelt / 2;
+		if (can_vec_perm_p (mode, false, sel))
+		  return true;
+	      }
 	}
-      if (can_vec_perm_p (mode, false, sel))
+      else if (count == 3)
 	{
-	  for (i = 0; i < nelt; i++)
-	    sel[i] += nelt / 2;
-	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
-	}
+	  unsigned int j0 = 0, j1 = 0, j2 = 0;
+	  unsigned int i, j;
+
+	  for (j = 0; j < 3; j++)
+	    {
+	      int nelt0 = ((3 - j) * nelt) % 3;
+	      int nelt1 = ((3 - j) * nelt + 1) % 3;
+	      int nelt2 = ((3 - j) * nelt + 2) % 3;
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = j0++;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = nelt + j1++;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = 0;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+
+	      for (i = 0; i < nelt; i++)
+		{
+		  if (3 * i + nelt0 < nelt)
+		    sel[3 * i + nelt0] = 3 * i + nelt0;
+		  if (3 * i + nelt1 < nelt)
+		    sel[3 * i + nelt1] = 3 * i + nelt1;
+		  if (3 * i + nelt2 < nelt)
+		    sel[3 * i + nelt2] = nelt + j2++;
+		}
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf (MSG_MISSED_OPTIMIZATION,
+				 "permutaion op not supported by target.\n");
+		  return false;
+		}
+	    }
+	  return true;
+      }
     }
 
   if (dump_enabled_p ())
     dump_printf (MSG_MISSED_OPTIMIZATION,
-                 "interleave op not supported by target.\n");
+		 "permutaion op not supported by target.\n");
   return false;
 }
 
@@ -4416,9 +4467,9 @@ vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_store_chain.
 
    Given a chain of interleaved stores in DR_CHAIN of LENGTH that must be
-   a power of 2, generate interleave_high/low stmts to reorder the data
-   correctly for the stores.  Return the final references for stores in
-   RESULT_CHAIN.
+   a power of 2 or equal to 3, generate interleave_high/low stmts to reorder
+   the data correctly for the stores.  Return the final references for stores
+   in RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements.  We assign a number to
@@ -4485,7 +4536,9 @@ vect_permute_store_chain (vec<tree> dr_chain,
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   tree perm_mask_low, perm_mask_high;
-  unsigned int i, n;
+  tree data_ref;
+  tree perm3_mask_low, perm3_mask_high;
+  unsigned int i, n, log_length = exact_log2 (length);
   unsigned int j, nelt = TYPE_VECTOR_SUBPARTS (vectype);
   unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
@@ -4493,47 +4546,117 @@ vect_permute_store_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0, n = nelt / 2; i < n; i++)
+  if (log_length != (unsigned int)-1)
     {
-      sel[i * 2] = i;
-      sel[i * 2 + 1] = i + nelt;
-    }
-  perm_mask_high = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_high != NULL);
+      for (i = 0, n = nelt / 2; i < n; i++)
+	{
+	  sel[i * 2] = i;
+	  sel[i * 2 + 1] = i + nelt;
+	}
+	perm_mask_high = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_high != NULL);
 
-  for (i = 0; i < nelt; i++)
-    sel[i] += nelt / 2;
-  perm_mask_low = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_low != NULL);
+	for (i = 0; i < nelt; i++)
+	  sel[i] += nelt / 2;
+	perm_mask_low = vect_gen_perm_mask (vectype, sel);
+	gcc_assert (perm_mask_low != NULL);
 
-  for (i = 0, n = exact_log2 (length); i < n; i++)
+	for (i = 0, n = log_length; i < n; i++)
+	  {
+	    for (j = 0; j < length/2; j++)
+	      {
+		vect1 = dr_chain[j];
+		vect2 = dr_chain[j+length/2];
+
+		/* Create interleaving stmt:
+		   high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1,
+							...}>  */
+		high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
+						  vect1, vect2, perm_mask_high);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j] = high;
+
+		/* Create interleaving stmt:
+		   low = VEC_PERM_EXPR <vect1, vect2,
+					{nelt/2, nelt*3/2, nelt/2+1, nelt*3/2+1,
+					 ...}>  */
+		low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
+		perm_stmt
+		  = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
+						  vect1, vect2, perm_mask_low);
+		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+		(*result_chain)[2*j+1] = low;
+	      }
+	    memcpy (dr_chain.address (), result_chain->address (),
+		    length * sizeof (tree));
+	  }
+    }
+  /* length is not a power of 2.  */
+  else
     {
-      for (j = 0; j < length/2; j++)
-	{
-	  vect1 = dr_chain[j];
-	  vect2 = dr_chain[j+length/2];
+      unsigned int j0 = 0, j1 = 0, j2 = 0;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
+
+      for (j = 0; j < 3; j++)
+        {
+	  int nelt0 = ((3 - j) * nelt) % 3;
+	  int nelt1 = ((3 - j) * nelt + 1) % 3;
+	  int nelt2 = ((3 - j) * nelt + 2) % 3;
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = j0++;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = nelt + j1++;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = 0;
+	    }
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0; i < nelt; i++)
+	    {
+	      if (3 * i + nelt0 < nelt)
+		sel[3 * i + nelt0] = 3 * i + nelt0;
+	      if (3 * i + nelt1 < nelt)
+		sel[3 * i + nelt1] = 3 * i + nelt1;
+	      if (3 * i + nelt2 < nelt)
+		sel[3 * i + nelt2] = nelt + j2++;
+	    }
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  vect1 = dr_chain[0];
+	  vect2 = dr_chain[1];
 
 	  /* Create interleaving stmt:
-	     high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1, ...}>  */
-	  high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, high,
-					    vect1, vect2, perm_mask_high);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {j, nelt, *, j + 1, nelt + j + 1, *,
+				   j + 2, nelt + j + 2, *, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_low");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j] = high;
 
+	  vect1 = data_ref;
+	  vect2 = dr_chain[2];
 	  /* Create interleaving stmt:
-	     low = VEC_PERM_EXPR <vect1, vect2, {nelt/2, nelt*3/2, nelt/2+1,
-						 nelt*3/2+1, ...}>  */
-	  low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
-	  perm_stmt
-	    = gimple_build_assign_with_ops (VEC_PERM_EXPR, low,
-					    vect1, vect2, perm_mask_low);
+	     low = VEC_PERM_EXPR <vect1, vect2,
+				  {0, 1, nelt + j, 3, 4, nelt + j + 1,
+				   6, 7, nelt + j + 2, ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_shuffle3_high");
+	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+						    vect1, vect2,
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[2*j+1] = low;
+	  (*result_chain)[j] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
@@ -4812,36 +4935,74 @@ vect_grouped_load_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_load_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_load_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not eqaul to 3\n");
       return false;
     }
 
   /* Check that the permutation is supported.  */
   if (VECTOR_MODE_P (mode))
     {
-      unsigned int i, nelt = GET_MODE_NUNITS (mode);
+      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
-      for (i = 0; i < nelt; i++)
-	sel[i] = i * 2;
-      if (can_vec_perm_p (mode, false, sel))
+      if (exact_log2 (count) != -1)
 	{
 	  for (i = 0; i < nelt; i++)
-	    sel[i] = i * 2 + 1;
+	    sel[i] = i * 2;
 	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
+	    {
+	      for (i = 0; i < nelt; i++)
+		sel[i] = i * 2 + 1;
+	      if (can_vec_perm_p (mode, false, sel))
+		return true;
+	    }
+        }
+      else if (count == 3)
+	{
+	  unsigned int k;
+	  for (k = 0; k < 3; k++)
+	    {
+	      for (i = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = 3 * i + k;
+		else
+		  sel[i] = 0;
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		    return false;
+		}
+	      for (i = 0, j = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = i;
+		else
+		  sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by \
+				     target\n");
+		  return false;
+		}
+	    }
+	  return true;
 	}
     }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                     "extract even/odd not supported by target\n");
+		     "extract even/odd not supported by target\n");
   return false;
 }
 
@@ -4859,8 +5020,9 @@ vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_load_chain.
 
    Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
-   a power of 2, generate extract_even/odd stmts to reorder the input data
-   correctly.  Return the final references for loads in RESULT_CHAIN.
+   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
+   the input data correctly.  Return the final references for loads in
+   RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements. We assign a number to each
@@ -4941,6 +5103,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
 {
   tree data_ref, first_vect, second_vect;
   tree perm_mask_even, perm_mask_odd;
+  tree perm3_mask_low, perm3_mask_high;
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   unsigned int i, j, log_length = exact_log2 (length);
@@ -4951,45 +5114,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2;
-  perm_mask_even = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_even != NULL);
+  if (log_length != (unsigned int)-1)
+    {
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2;
+      perm_mask_even = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_even != NULL);
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2 + 1;
-  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_odd != NULL);
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2 + 1;
+      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_odd != NULL);
 
-  for (i = 0; i < log_length; i++)
-    {
-      for (j = 0; j < length; j += 2)
+      for (i = 0; i < log_length; i++)
 	{
-	  first_vect = dr_chain[j];
-	  second_vect = dr_chain[j+1];
+	  for (j = 0; j < length; j += 2)
+	    {
+	      first_vect = dr_chain[j];
+	      second_vect = dr_chain[j+1];
+
+	      /* data_ref = permute_even (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_even);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2] = data_ref;
+
+	      /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_odd);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2+length/2] = data_ref;
+	    }
+	  memcpy (dr_chain.address (), result_chain->address (),
+		  length * sizeof (tree));
+	}
+    }
+  /* length is not a power of 2.  */
+  else
+    {
+      unsigned int k;
+
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
 
-	  /* data_ref = permute_even (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+      for (k = 0; k < 3; k++)
+	{
+	  for (i = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = 3 * i + k;
+	    else
+	      sel[i] = 0;
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0, j = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = i;
+	    else
+	      sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  first_vect = dr_chain[0];
+	  second_vect = dr_chain[1];
+
+	  /* Create interleaving stmt (low part of):
+	     low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							     ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_even);
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2] = data_ref;
 
-	  /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	  /* Create interleaving stmt (high part of):
+	     high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							      ...}>  */
+	  first_vect = data_ref;
+	  second_vect = dr_chain[2];
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_odd);
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2+length/2] = data_ref;
+	  (*result_chain)[k] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
 }
 
-
 /* Function vect_transform_grouped_load.
 
    Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1a51d6d..24d0b94 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -974,9 +974,9 @@ vect_model_store_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!store_lanes_p && group_size > 1)
     {
-      /* Uses a high and low interleave operation for each needed permute.  */
-      
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
+      /* Uses a high and low interleave or shuffle operations for each
+	 needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
       inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
 				      stmt_info, 0, vect_body);
 
@@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!load_lanes_p && group_size > 1)
     {
-      /* Uses an even and odd extract operations for each needed permute.  */
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
-				       stmt_info, 0, vect_body);
+      /* Uses an even and odd extract operations or shuffle operations
+	 for each needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
+				      stmt_info, 0, vect_body);
 
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,

[-- Attachment #3: vect3_tests.patch --]
[-- Type: application/octet-stream, Size: 1732 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
new file mode 100644
index 0000000..eecd4ee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
+
+#define byte unsigned char
+
+void
+matrix_mul(byte *in, byte *out, int size)
+{
+  int i; 
+  for(i = 0; i < size; i++)
+    {
+      byte in0 = in[0];
+      byte in1 = in[1];
+      byte in2 = in[2];
+      byte out0, out1, out2, out3;
+      out0 = in0 + in1;
+      out1 = in0 + in2;
+      out2 = in1 + in2;
+      out3 = in0 + in1 + in2;
+      out[0] = out0;
+      out[1] = out1;
+      out[2] = out2;
+      out[3] = out3;
+      in += 3;
+      out += 4;
+    } 
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-st.c b/gcc/testsuite/gcc.dg/vect/pr52252-st.c
new file mode 100644
index 0000000..170a95a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-st.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
+
+#define byte unsigned char
+
+void
+matrix_mul(byte *in, byte *out, int size)
+{
+  int i; 
+  for(i = 0; i < size; i++)
+    {
+      out[0] = in[0] + in[1] + in[3];
+      out[1] = in[0] + in[2] + in[4];
+      out[2] = in[1] + in[2] + in[4];
+      in += 4;
+      out += 3;
+    } 
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-04-18 10:05           ` Evgeny Stupachenko
@ 2014-04-30 15:03             ` Evgeny Stupachenko
  2014-05-06 11:27               ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-04-30 15:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

Ping.

On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Hi,
>
> Merged with current master the patch passes bootstrap and is giving
> expected gains.
> Patch and new tests are attached.
>
> ChangeLog:
>
> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>
>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>         check for stores group of length 3.
>         (vect_permute_store_chain): New permutations for stores group of
>         length 3.
>         (vect_grouped_load_supported): New check for loads group of length 3.
>         (vect_permute_load_chain): New permutations for loads group of length 3.
>         * tree-vect-stmts.c (vect_model_store_cost): Change cost
>         of vec_perm_shuffle for the new permutations.
>         (vect_model_load_cost): Ditto.
>
> ChangeLog for testsuite:
>
> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>
>        PR tree-optimization/52252
>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>        * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.
>
> Evgeny
>
> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> Missed attachment.
>>
>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>>> I've separated the patch into 2: cost model tuning and load/store
>>> groups parallelism.
>>> SLM tuning was partially introduced in the patch:
>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
>>> The patch introducing vectorization for load/store groups of size 3 attached.
>>>
>>> Is it ok for stage1?
>>>
>>> ChangeLog:
>>>
>>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>>
>>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>>        check for stores group of length 3.
>>>        (vect_permute_store_chain): New permutations for stores group of
>>>        length 3.
>>>        (vect_grouped_load_supported): New check for loads group of length 3.
>>>        (vect_permute_load_chain): New permutations for loads group of length 3.
>>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
>>>        of vec_perm_shuffle for the new permutations.
>>>        (vect_model_load_cost): Ditto.
>>>
>>>
>>>
>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>>
>>>>> Missed patch attached in plain-text.
>>>>>
>>>>> I have copyright assignment on file with the FSF covering work on GCC.
>>>>>
>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
>>>>> case. It is used in RGB image processing (like test case in PR52252).
>>>>> For sure we can extend the patch to length 5 and more. However, this
>>>>> potentially affect performance on some other architectures and
>>>>> requires larger testing. So length 3 it is just first step.The
>>>>> algorithm in the patch could be modified for a general case in several
>>>>> steps.
>>>>>
>>>>> I understand that the patch should wait for the stage 1, however since
>>>>> its ready we can discuss it right now and make some changes (like
>>>>> general size of group).
>>>>
>>>> Other than that I'd like to see a vectorizer hook querying the cost of a
>>>> vec_perm_const expansion instead of adding vec_perm_shuffle
>>>> (thus requires the constant shuffle mask to be passed as well
>>>> as the vector type).  That's more useful for other uses that
>>>> would require (arbitrary) shuffles.
>>>>
>>>> Didn't look at the rest of the patch yet - queued in my review
>>>> pipeline.
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> Evgeny
>>>>>
>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>>>>> >
>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>>> >
>>>>> > > Hi,
>>>>> > >
>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>>>>> > > (and even 6 times for AVX2).
>>>>> > > It passes make check and bootstrap on x86.
>>>>> > > spec2000/spec2006 got no regressions/gains on x86.
>>>>> > >
>>>>> > > Is this patch ok?
>>>>> >
>>>>> > I've worked on generalizing the permutation support in the light
>>>>> > of the availability of the generic shuffle support in the IL
>>>>> > but hit some road-blocks in the way code-generation works for
>>>>> > group loads with permutations (I don't remember if I posted all patches).
>>>>> >
>>>>> > This patch seems to be to a slightly different place but it again
>>>>> > special-cases a specific permutation.  Why's that?  Why can't we
>>>>> > support groups of size 7 for example?  So - can this be generalized
>>>>> > to support arbitrary non-power-of-two load/store groups?
>>>>> >
>>>>> > Other than that the patch has to wait for stage1 to open again,
>>>>> > of course.  And it misses a testcase.
>>>>> >
>>>>> > Btw, do you have a copyright assignment on file with the FSF covering
>>>>> > work on GCC?
>>>>> >
>>>>> > Thanks,
>>>>> > Richard.
>>>>> >
>>>>> > > ChangeLog:
>>>>> > >
>>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>>>>> > >
>>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>>>> > >         check for stores group of length 3.
>>>>> > >         (vect_permute_store_chain): New permutations for stores group of
>>>>> > >         length 3.
>>>>> > >         (vect_grouped_load_supported): New check for loads group of length
>>>>> > > 3.
>>>>> > >         (vect_permute_load_chain): New permutations for loads group of
>>>>> > > length 3.
>>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>>>>> > > vec_perm_shuffle
>>>>> > >         for the new permutations.
>>>>> > >         (vect_model_load_cost): Ditto.
>>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>>>>> > >         * config/arm/arm.c: Ditto.
>>>>> > >         * config/rs6000/rs6000.c: Ditto.
>>>>> > >         * config/spu/spu.c: Ditto.
>>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>>>>> > > byte
>>>>> > >         shuffle on some x86 architectures.
>>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>>>>> > > permutations.
>>>>> > >         Fixing cost for other permutations.
>>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>>>>> > >         slow (TARGET_SLOW_PHUFFB).
>>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
>>>>> > >         Fixing cost model for Silvermont.
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Evgeny
>>>>> > >
>>>>> >
>>>>> > --
>>>>> > Richard Biener <rguenther@suse.de>
>>>>> > SUSE / SUSE Labs
>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>>>>
>>>>
>>>> --
>>>> Richard Biener <rguenther@suse.de>
>>>> SUSE / SUSE Labs
>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-04-30 15:03             ` Evgeny Stupachenko
@ 2014-05-06 11:27               ` Evgeny Stupachenko
  2014-05-06 11:49                 ` Richard Biener
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-06 11:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

The patch on cost model was successfully committed.
I've separated the rest part of the patch on loads/stores group into
2: on loads group and on stores group.
Below is first part on loads group.

Bootstrap and make check passed on x86.

Is it ok?

ChangeLog:

2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>

        * tree-vect-data-refs.c (vect_grouped_load_supported): New
        check for loads group of length 3.
        (vect_permute_load_chain): New permutations for loads group of
        length 3.
        * tree-vect-stmts.c (vect_model_load_cost): Change cost
        of vec_perm_shuffle for the new permutations.

ChangeLog for testsuite:

2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>

       PR tree-optimization/52252
       * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 274cdbd..feafb38 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4812,36 +4812,74 @@ vect_grouped_load_supported (tree vectype,
unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);

-  /* vect_permute_load_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_load_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+                        "the size of the group of accesses"
+                        " is not a power of 2 or not eqaul to 3\n");
       return false;
     }

   /* Check that the permutation is supported.  */
   if (VECTOR_MODE_P (mode))
     {
-      unsigned int i, nelt = GET_MODE_NUNITS (mode);
+      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);

-      for (i = 0; i < nelt; i++)
-       sel[i] = i * 2;
-      if (can_vec_perm_p (mode, false, sel))
+      if (exact_log2 (count) != -1)
        {
          for (i = 0; i < nelt; i++)
-           sel[i] = i * 2 + 1;
+           sel[i] = i * 2;
          if (can_vec_perm_p (mode, false, sel))
-           return true;
+           {
+             for (i = 0; i < nelt; i++)
+               sel[i] = i * 2 + 1;
+             if (can_vec_perm_p (mode, false, sel))
+               return true;
+           }
+        }
+      else if (count == 3)
+       {
+         unsigned int k;
+         for (k = 0; k < 3; k++)
+           {
+             for (i = 0; i < nelt; i++)
+               if (3 * i + k < 2 * nelt)
+                 sel[i] = 3 * i + k;
+               else
+                 sel[i] = 0;
+             if (!can_vec_perm_p (mode, false, sel))
+               {
+                 if (dump_enabled_p ())
+                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                                    "shuffle of 3 loads is not supported by \
+                                    target\n");
+                   return false;
+               }
+             for (i = 0, j = 0; i < nelt; i++)
+               if (3 * i + k < 2 * nelt)
+                 sel[i] = i;
+               else
+                 sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+             if (!can_vec_perm_p (mode, false, sel))
+               {
+                 if (dump_enabled_p ())
+                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                                    "shuffle of 3 loads is not supported by \
+                                    target\n");
+                 return false;
+               }
+           }
+         return true;
        }
     }

   if (dump_enabled_p ())
     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                     "extract even/odd not supported by target\n");
+                    "extract even/odd not supported by target\n");
   return false;
 }

@@ -4859,8 +4897,9 @@ vect_load_lanes_supported (tree vectype,
unsigned HOST_WIDE_INT count)
 /* Function vect_permute_load_chain.

    Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
-   a power of 2, generate extract_even/odd stmts to reorder the input data
-   correctly.  Return the final references for loads in RESULT_CHAIN.
+   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
+   the input data correctly.  Return the final references for loads in
+   RESULT_CHAIN.

    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements. We assign a
number to each
@@ -4941,6 +4980,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
 {
   tree data_ref, first_vect, second_vect;
   tree perm_mask_even, perm_mask_odd;
+  tree perm3_mask_low, perm3_mask_high;
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   unsigned int i, j, log_length = exact_log2 (length);
@@ -4951,45 +4991,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
          length * sizeof (tree));

-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2;
-  perm_mask_even = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_even != NULL);
+  if (log_length != (unsigned int)-1)
+    {
+      for (i = 0; i < nelt; ++i)
+       sel[i] = i * 2;
+      perm_mask_even = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_even != NULL);

-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2 + 1;
-  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_odd != NULL);
+      for (i = 0; i < nelt; ++i)
+       sel[i] = i * 2 + 1;
+      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_odd != NULL);

-  for (i = 0; i < log_length; i++)
-    {
-      for (j = 0; j < length; j += 2)
+      for (i = 0; i < log_length; i++)
        {
-         first_vect = dr_chain[j];
-         second_vect = dr_chain[j+1];
+         for (j = 0; j < length; j += 2)
+           {
+             first_vect = dr_chain[j];
+             second_vect = dr_chain[j+1];
+
+             /* data_ref = permute_even (first_data_ref, second_data_ref);  */
+             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+                                                       first_vect, second_vect,
+                                                       perm_mask_even);
+             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+             (*result_chain)[j/2] = data_ref;
+
+             /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
+             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+                                                       first_vect, second_vect,
+                                                       perm_mask_odd);
+             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+             (*result_chain)[j/2+length/2] = data_ref;
+           }
+         memcpy (dr_chain.address (), result_chain->address (),
+                 length * sizeof (tree));
+       }
+    }
+  /* length is not a power of 2.  */
+  else
+    {
+      unsigned int k;

-         /* data_ref = permute_even (first_data_ref, second_data_ref);  */
-         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+      /* currently only length 3 is supported as most frequent case.  */
+      gcc_assert (length == 3);
+
+      for (k = 0; k < 3; k++)
+       {
+         for (i = 0; i < nelt; i++)
+           if (3 * i + k < 2 * nelt)
+             sel[i] = 3 * i + k;
+           else
+             sel[i] = 0;
+         perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+         gcc_assert (perm3_mask_low != NULL);
+
+         for (i = 0, j = 0; i < nelt; i++)
+           if (3 * i + k < 2 * nelt)
+             sel[i] = i;
+           else
+             sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+
+         perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+         gcc_assert (perm3_mask_high != NULL);
+
+         first_vect = dr_chain[0];
+         second_vect = dr_chain[1];
+
+         /* Create interleaving stmt (low part of):
+            low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+                                                            ...}>  */
+         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
          perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
                                                    first_vect, second_vect,
-                                                   perm_mask_even);
+                                                   perm3_mask_low);
          vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-         (*result_chain)[j/2] = data_ref;

-         /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
-         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+         /* Create interleaving stmt (high part of):
+            high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+                                                             ...}>  */
+         first_vect = data_ref;
+         second_vect = dr_chain[2];
+         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
          perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
                                                    first_vect, second_vect,
-                                                   perm_mask_odd);
+                                                   perm3_mask_high);
          vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-         (*result_chain)[j/2+length/2] = data_ref;
+         (*result_chain)[k] = data_ref;
        }
-      memcpy (dr_chain.address (), result_chain->address (),
-             length * sizeof (tree));
     }
 }

-
 /* Function vect_transform_grouped_load.

    Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1a51d6d..b87c143 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info,
int ncopies,
      include the cost of the permutes.  */
   if (!load_lanes_p && group_size > 1)
     {
-      /* Uses an even and odd extract operations for each needed permute.  */
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
-                                      stmt_info, 0, vect_body);
+      /* Uses an even and odd extract operations or shuffle operations
+        for each needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
+                                     stmt_info, 0, vect_body);

       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,


diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
new file mode 100644
index 0000000..6e3cb52
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -g -ftree-vectorize -mssse3
-fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
+
+#define byte unsigned char
+
+void
+matrix_mul (byte *in, byte *out, int size)
+{
+  int i;
+  for (i = 0; i < size; i++)
+    {
+      byte in0 = in[0];
+      byte in1 = in[1];
+      byte in2 = in[2];
+      byte out0, out1, out2, out3;
+      out0 = in0 + in1;
+      out1 = in0 + in2;
+      out2 = in1 + in2;
+      out3 = in0 + in1 + in2;
+      out[0] = out0;
+      out[1] = out1;
+      out[2] = out2;
+      out[3] = out3;
+      in += 3;
+      out += 4;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */


On Wed, Apr 30, 2014 at 6:31 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Ping.
>
> On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> Hi,
>>
>> Merged with current master the patch passes bootstrap and is giving
>> expected gains.
>> Patch and new tests are attached.
>>
>> ChangeLog:
>>
>> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>         check for stores group of length 3.
>>         (vect_permute_store_chain): New permutations for stores group of
>>         length 3.
>>         (vect_grouped_load_supported): New check for loads group of length 3.
>>         (vect_permute_load_chain): New permutations for loads group of length 3.
>>         * tree-vect-stmts.c (vect_model_store_cost): Change cost
>>         of vec_perm_shuffle for the new permutations.
>>         (vect_model_load_cost): Ditto.
>>
>> ChangeLog for testsuite:
>>
>> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>        PR tree-optimization/52252
>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>>        * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.
>>
>> Evgeny
>>
>> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>>> Missed attachment.
>>>
>>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>>>> I've separated the patch into 2: cost model tuning and load/store
>>>> groups parallelism.
>>>> SLM tuning was partially introduced in the patch:
>>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
>>>> The patch introducing vectorization for load/store groups of size 3 attached.
>>>>
>>>> Is it ok for stage1?
>>>>
>>>> ChangeLog:
>>>>
>>>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>>>
>>>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>>>        check for stores group of length 3.
>>>>        (vect_permute_store_chain): New permutations for stores group of
>>>>        length 3.
>>>>        (vect_grouped_load_supported): New check for loads group of length 3.
>>>>        (vect_permute_load_chain): New permutations for loads group of length 3.
>>>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
>>>>        of vec_perm_shuffle for the new permutations.
>>>>        (vect_model_load_cost): Ditto.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
>>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>>>
>>>>>> Missed patch attached in plain-text.
>>>>>>
>>>>>> I have copyright assignment on file with the FSF covering work on GCC.
>>>>>>
>>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
>>>>>> case. It is used in RGB image processing (like test case in PR52252).
>>>>>> For sure we can extend the patch to length 5 and more. However, this
>>>>>> potentially affect performance on some other architectures and
>>>>>> requires larger testing. So length 3 it is just first step.The
>>>>>> algorithm in the patch could be modified for a general case in several
>>>>>> steps.
>>>>>>
>>>>>> I understand that the patch should wait for the stage 1, however since
>>>>>> its ready we can discuss it right now and make some changes (like
>>>>>> general size of group).
>>>>>
>>>>> Other than that I'd like to see a vectorizer hook querying the cost of a
>>>>> vec_perm_const expansion instead of adding vec_perm_shuffle
>>>>> (thus requires the constant shuffle mask to be passed as well
>>>>> as the vector type).  That's more useful for other uses that
>>>>> would require (arbitrary) shuffles.
>>>>>
>>>>> Didn't look at the rest of the patch yet - queued in my review
>>>>> pipeline.
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Evgeny
>>>>>>
>>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>>>>>> >
>>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>>>>>> >
>>>>>> > > Hi,
>>>>>> > >
>>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>>>>>> > > (and even 6 times for AVX2).
>>>>>> > > It passes make check and bootstrap on x86.
>>>>>> > > spec2000/spec2006 got no regressions/gains on x86.
>>>>>> > >
>>>>>> > > Is this patch ok?
>>>>>> >
>>>>>> > I've worked on generalizing the permutation support in the light
>>>>>> > of the availability of the generic shuffle support in the IL
>>>>>> > but hit some road-blocks in the way code-generation works for
>>>>>> > group loads with permutations (I don't remember if I posted all patches).
>>>>>> >
>>>>>> > This patch seems to be to a slightly different place but it again
>>>>>> > special-cases a specific permutation.  Why's that?  Why can't we
>>>>>> > support groups of size 7 for example?  So - can this be generalized
>>>>>> > to support arbitrary non-power-of-two load/store groups?
>>>>>> >
>>>>>> > Other than that the patch has to wait for stage1 to open again,
>>>>>> > of course.  And it misses a testcase.
>>>>>> >
>>>>>> > Btw, do you have a copyright assignment on file with the FSF covering
>>>>>> > work on GCC?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Richard.
>>>>>> >
>>>>>> > > ChangeLog:
>>>>>> > >
>>>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>>>>>> > >
>>>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>>>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>>>>>> > >         check for stores group of length 3.
>>>>>> > >         (vect_permute_store_chain): New permutations for stores group of
>>>>>> > >         length 3.
>>>>>> > >         (vect_grouped_load_supported): New check for loads group of length
>>>>>> > > 3.
>>>>>> > >         (vect_permute_load_chain): New permutations for loads group of
>>>>>> > > length 3.
>>>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>>>>>> > > vec_perm_shuffle
>>>>>> > >         for the new permutations.
>>>>>> > >         (vect_model_load_cost): Ditto.
>>>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>>>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>>>>>> > >         * config/arm/arm.c: Ditto.
>>>>>> > >         * config/rs6000/rs6000.c: Ditto.
>>>>>> > >         * config/spu/spu.c: Ditto.
>>>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>>>>>> > > byte
>>>>>> > >         shuffle on some x86 architectures.
>>>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>>>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>>>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>>>>>> > > permutations.
>>>>>> > >         Fixing cost for other permutations.
>>>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>>>>>> > >         slow (TARGET_SLOW_PHUFFB).
>>>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>>>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
>>>>>> > >         Fixing cost model for Silvermont.
>>>>>> > >
>>>>>> > > Thanks,
>>>>>> > > Evgeny
>>>>>> > >
>>>>>> >
>>>>>> > --
>>>>>> > Richard Biener <rguenther@suse.de>
>>>>>> > SUSE / SUSE Labs
>>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>>>>>
>>>>>
>>>>> --
>>>>> Richard Biener <rguenther@suse.de>
>>>>> SUSE / SUSE Labs
>>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-06 11:27               ` Evgeny Stupachenko
@ 2014-05-06 11:49                 ` Richard Biener
  2014-05-06 14:38                   ` Evgeny Stupachenko
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Biener @ 2014-05-06 11:49 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

On Tue, 6 May 2014, Evgeny Stupachenko wrote:

> The patch on cost model was successfully committed.
> I've separated the rest part of the patch on loads/stores group into
> 2: on loads group and on stores group.
> Below is first part on loads group.
> 
> Bootstrap and make check passed on x86.
> 
> Is it ok?
> 
> ChangeLog:
> 
> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
> 
>         * tree-vect-data-refs.c (vect_grouped_load_supported): New
>         check for loads group of length 3.
>         (vect_permute_load_chain): New permutations for loads group of
>         length 3.
>         * tree-vect-stmts.c (vect_model_load_cost): Change cost
>         of vec_perm_shuffle for the new permutations.
> 
> ChangeLog for testsuite:
> 
> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
> 
>        PR tree-optimization/52252
>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
> 
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index 274cdbd..feafb38 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -4812,36 +4812,74 @@ vect_grouped_load_supported (tree vectype,
> unsigned HOST_WIDE_INT count)
>  {
>    enum machine_mode mode = TYPE_MODE (vectype);
> 
> -  /* vect_permute_load_chain requires the group size to be a power of two.  */
> -  if (exact_log2 (count) == -1)
> +  /* vect_permute_load_chain requires the group size to be equal to 3 or
> +     be a power of two.  */
> +  if (count != 3 && exact_log2 (count) == -1)
>      {
>        if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                         "the size of the group of accesses"
> -                         " is not a power of 2\n");
> +                        "the size of the group of accesses"
> +                        " is not a power of 2 or not eqaul to 3\n");

equal

>        return false;
>      }
> 
>    /* Check that the permutation is supported.  */
>    if (VECTOR_MODE_P (mode))
>      {
> -      unsigned int i, nelt = GET_MODE_NUNITS (mode);
> +      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
>        unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
> 
> -      for (i = 0; i < nelt; i++)
> -       sel[i] = i * 2;
> -      if (can_vec_perm_p (mode, false, sel))
> +      if (exact_log2 (count) != -1)
>         {
>           for (i = 0; i < nelt; i++)
> -           sel[i] = i * 2 + 1;
> +           sel[i] = i * 2;
>           if (can_vec_perm_p (mode, false, sel))
> -           return true;
> +           {
> +             for (i = 0; i < nelt; i++)
> +               sel[i] = i * 2 + 1;
> +             if (can_vec_perm_p (mode, false, sel))
> +               return true;
> +           }
> +        }
> +      else if (count == 3)

Please structure this if as having special cases first and then an
else with gcc_assert (exact_log2 (count)).

> +       {
> +         unsigned int k;
> +         for (k = 0; k < 3; k++)
> +           {
> +             for (i = 0; i < nelt; i++)
> +               if (3 * i + k < 2 * nelt)
> +                 sel[i] = 3 * i + k;
> +               else
> +                 sel[i] = 0;
> +             if (!can_vec_perm_p (mode, false, sel))
> +               {
> +                 if (dump_enabled_p ())
> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                                    "shuffle of 3 loads is not supported by \
> +                                    target\n");

Don't use multi-line strings but do

			"shuffle of ..."
			"target\n");

instead.

> +                   return false;
> +               }
> +             for (i = 0, j = 0; i < nelt; i++)
> +               if (3 * i + k < 2 * nelt)
> +                 sel[i] = i;
> +               else
> +                 sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
> +             if (!can_vec_perm_p (mode, false, sel))
> +               {
> +                 if (dump_enabled_p ())
> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                                    "shuffle of 3 loads is not supported by \
> +                                    target\n");

Likewise.

> +                 return false;
> +               }
> +           }
> +         return true;
>         }
>      }
> 
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                     "extract even/odd not supported by target\n");
> +                    "extract even/odd not supported by target\n");
>    return false;
>  }
> 
> @@ -4859,8 +4897,9 @@ vect_load_lanes_supported (tree vectype,
> unsigned HOST_WIDE_INT count)
>  /* Function vect_permute_load_chain.
> 
>     Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
> -   a power of 2, generate extract_even/odd stmts to reorder the input data
> -   correctly.  Return the final references for loads in RESULT_CHAIN.
> +   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
> +   the input data correctly.  Return the final references for loads in
> +   RESULT_CHAIN.
> 
>     E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
>     The input is 4 vectors each containing 8 elements. We assign a
> number to each
> @@ -4941,6 +4980,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
>  {
>    tree data_ref, first_vect, second_vect;
>    tree perm_mask_even, perm_mask_odd;
> +  tree perm3_mask_low, perm3_mask_high;
>    gimple perm_stmt;
>    tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
>    unsigned int i, j, log_length = exact_log2 (length);
> @@ -4951,45 +4991,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
>    memcpy (result_chain->address (), dr_chain.address (),
>           length * sizeof (tree));
> 
> -  for (i = 0; i < nelt; ++i)
> -    sel[i] = i * 2;
> -  perm_mask_even = vect_gen_perm_mask (vectype, sel);
> -  gcc_assert (perm_mask_even != NULL);
> +  if (log_length != (unsigned int)-1)

Same for the if-structure - first handle all special values
and then in the else handle power-of-two cases.

Ok with those changes.

Thanks,
Richard.

> +    {
> +      for (i = 0; i < nelt; ++i)
> +       sel[i] = i * 2;
> +      perm_mask_even = vect_gen_perm_mask (vectype, sel);
> +      gcc_assert (perm_mask_even != NULL);
> 
> -  for (i = 0; i < nelt; ++i)
> -    sel[i] = i * 2 + 1;
> -  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
> -  gcc_assert (perm_mask_odd != NULL);
> +      for (i = 0; i < nelt; ++i)
> +       sel[i] = i * 2 + 1;
> +      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
> +      gcc_assert (perm_mask_odd != NULL);
> 
> -  for (i = 0; i < log_length; i++)
> -    {
> -      for (j = 0; j < length; j += 2)
> +      for (i = 0; i < log_length; i++)
>         {
> -         first_vect = dr_chain[j];
> -         second_vect = dr_chain[j+1];
> +         for (j = 0; j < length; j += 2)
> +           {
> +             first_vect = dr_chain[j];
> +             second_vect = dr_chain[j+1];
> +
> +             /* data_ref = permute_even (first_data_ref, second_data_ref);  */
> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> +                                                       first_vect, second_vect,
> +                                                       perm_mask_even);
> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> +             (*result_chain)[j/2] = data_ref;
> +
> +             /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> +                                                       first_vect, second_vect,
> +                                                       perm_mask_odd);
> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> +             (*result_chain)[j/2+length/2] = data_ref;
> +           }
> +         memcpy (dr_chain.address (), result_chain->address (),
> +                 length * sizeof (tree));
> +       }
> +    }
> +  /* length is not a power of 2.  */
> +  else
> +    {
> +      unsigned int k;
> 
> -         /* data_ref = permute_even (first_data_ref, second_data_ref);  */
> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
> +      /* currently only length 3 is supported as most frequent case.  */
> +      gcc_assert (length == 3);
> +
> +      for (k = 0; k < 3; k++)
> +       {
> +         for (i = 0; i < nelt; i++)
> +           if (3 * i + k < 2 * nelt)
> +             sel[i] = 3 * i + k;
> +           else
> +             sel[i] = 0;
> +         perm3_mask_low = vect_gen_perm_mask (vectype, sel);
> +         gcc_assert (perm3_mask_low != NULL);
> +
> +         for (i = 0, j = 0; i < nelt; i++)
> +           if (3 * i + k < 2 * nelt)
> +             sel[i] = i;
> +           else
> +             sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
> +
> +         perm3_mask_high = vect_gen_perm_mask (vectype, sel);
> +         gcc_assert (perm3_mask_high != NULL);
> +
> +         first_vect = dr_chain[0];
> +         second_vect = dr_chain[1];
> +
> +         /* Create interleaving stmt (low part of):
> +            low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
> +                                                            ...}>  */
> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>                                                     first_vect, second_vect,
> -                                                   perm_mask_even);
> +                                                   perm3_mask_low);
>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> -         (*result_chain)[j/2] = data_ref;
> 
> -         /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
> +         /* Create interleaving stmt (high part of):
> +            high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
> +                                                             ...}>  */
> +         first_vect = data_ref;
> +         second_vect = dr_chain[2];
> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>                                                     first_vect, second_vect,
> -                                                   perm_mask_odd);
> +                                                   perm3_mask_high);
>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> -         (*result_chain)[j/2+length/2] = data_ref;
> +         (*result_chain)[k] = data_ref;
>         }
> -      memcpy (dr_chain.address (), result_chain->address (),
> -             length * sizeof (tree));
>      }
>  }
> 
> -
>  /* Function vect_transform_grouped_load.
> 
>     Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 1a51d6d..b87c143 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info,
> int ncopies,
>       include the cost of the permutes.  */
>    if (!load_lanes_p && group_size > 1)
>      {
> -      /* Uses an even and odd extract operations for each needed permute.  */
> -      int nstmts = ncopies * exact_log2 (group_size) * group_size;
> -      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
> -                                      stmt_info, 0, vect_body);
> +      /* Uses an even and odd extract operations or shuffle operations
> +        for each needed permute.  */
> +      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
> +      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
> +                                     stmt_info, 0, vect_body);
> 
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_NOTE, vect_location,
> 
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> new file mode 100644
> index 0000000..6e3cb52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
> +
> +#define byte unsigned char
> +
> +void
> +matrix_mul (byte *in, byte *out, int size)
> +{
> +  int i;
> +  for (i = 0; i < size; i++)
> +    {
> +      byte in0 = in[0];
> +      byte in1 = in[1];
> +      byte in2 = in[2];
> +      byte out0, out1, out2, out3;
> +      out0 = in0 + in1;
> +      out1 = in0 + in2;
> +      out2 = in1 + in2;
> +      out3 = in0 + in1 + in2;
> +      out[0] = out0;
> +      out[1] = out1;
> +      out[2] = out2;
> +      out[3] = out3;
> +      in += 3;
> +      out += 4;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> 
> 
> On Wed, Apr 30, 2014 at 6:31 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> > Ping.
> >
> > On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >> Hi,
> >>
> >> Merged with current master the patch passes bootstrap and is giving
> >> expected gains.
> >> Patch and new tests are attached.
> >>
> >> ChangeLog:
> >>
> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
> >>
> >>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >>         check for stores group of length 3.
> >>         (vect_permute_store_chain): New permutations for stores group of
> >>         length 3.
> >>         (vect_grouped_load_supported): New check for loads group of length 3.
> >>         (vect_permute_load_chain): New permutations for loads group of length 3.
> >>         * tree-vect-stmts.c (vect_model_store_cost): Change cost
> >>         of vec_perm_shuffle for the new permutations.
> >>         (vect_model_load_cost): Ditto.
> >>
> >> ChangeLog for testsuite:
> >>
> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
> >>
> >>        PR tree-optimization/52252
> >>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
> >>        * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.
> >>
> >> Evgeny
> >>
> >> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >>> Missed attachment.
> >>>
> >>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >>>> I've separated the patch into 2: cost model tuning and load/store
> >>>> groups parallelism.
> >>>> SLM tuning was partially introduced in the patch:
> >>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
> >>>> The patch introducing vectorization for load/store groups of size 3 attached.
> >>>>
> >>>> Is it ok for stage1?
> >>>>
> >>>> ChangeLog:
> >>>>
> >>>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
> >>>>
> >>>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >>>>        check for stores group of length 3.
> >>>>        (vect_permute_store_chain): New permutations for stores group of
> >>>>        length 3.
> >>>>        (vect_grouped_load_supported): New check for loads group of length 3.
> >>>>        (vect_permute_load_chain): New permutations for loads group of length 3.
> >>>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
> >>>>        of vec_perm_shuffle for the new permutations.
> >>>>        (vect_model_load_cost): Ditto.
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
> >>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >>>>>
> >>>>>> Missed patch attached in plain-text.
> >>>>>>
> >>>>>> I have copyright assignment on file with the FSF covering work on GCC.
> >>>>>>
> >>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
> >>>>>> case. It is used in RGB image processing (like test case in PR52252).
> >>>>>> For sure we can extend the patch to length 5 and more. However, this
> >>>>>> potentially affect performance on some other architectures and
> >>>>>> requires larger testing. So length 3 it is just first step.The
> >>>>>> algorithm in the patch could be modified for a general case in several
> >>>>>> steps.
> >>>>>>
> >>>>>> I understand that the patch should wait for the stage 1, however since
> >>>>>> its ready we can discuss it right now and make some changes (like
> >>>>>> general size of group).
> >>>>>
> >>>>> Other than that I'd like to see a vectorizer hook querying the cost of a
> >>>>> vec_perm_const expansion instead of adding vec_perm_shuffle
> >>>>> (thus requires the constant shuffle mask to be passed as well
> >>>>> as the vector type).  That's more useful for other uses that
> >>>>> would require (arbitrary) shuffles.
> >>>>>
> >>>>> Didn't look at the rest of the patch yet - queued in my review
> >>>>> pipeline.
> >>>>>
> >>>>> Thanks,
> >>>>> Richard.
> >>>>>
> >>>>>> Thanks,
> >>>>>> Evgeny
> >>>>>>
> >>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
> >>>>>> >
> >>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >>>>>> >
> >>>>>> > > Hi,
> >>>>>> > >
> >>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
> >>>>>> > > (and even 6 times for AVX2).
> >>>>>> > > It passes make check and bootstrap on x86.
> >>>>>> > > spec2000/spec2006 got no regressions/gains on x86.
> >>>>>> > >
> >>>>>> > > Is this patch ok?
> >>>>>> >
> >>>>>> > I've worked on generalizing the permutation support in the light
> >>>>>> > of the availability of the generic shuffle support in the IL
> >>>>>> > but hit some road-blocks in the way code-generation works for
> >>>>>> > group loads with permutations (I don't remember if I posted all patches).
> >>>>>> >
> >>>>>> > This patch seems to be to a slightly different place but it again
> >>>>>> > special-cases a specific permutation.  Why's that?  Why can't we
> >>>>>> > support groups of size 7 for example?  So - can this be generalized
> >>>>>> > to support arbitrary non-power-of-two load/store groups?
> >>>>>> >
> >>>>>> > Other than that the patch has to wait for stage1 to open again,
> >>>>>> > of course.  And it misses a testcase.
> >>>>>> >
> >>>>>> > Btw, do you have a copyright assignment on file with the FSF covering
> >>>>>> > work on GCC?
> >>>>>> >
> >>>>>> > Thanks,
> >>>>>> > Richard.
> >>>>>> >
> >>>>>> > > ChangeLog:
> >>>>>> > >
> >>>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> >>>>>> > >
> >>>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
> >>>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >>>>>> > >         check for stores group of length 3.
> >>>>>> > >         (vect_permute_store_chain): New permutations for stores group of
> >>>>>> > >         length 3.
> >>>>>> > >         (vect_grouped_load_supported): New check for loads group of length
> >>>>>> > > 3.
> >>>>>> > >         (vect_permute_load_chain): New permutations for loads group of
> >>>>>> > > length 3.
> >>>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
> >>>>>> > > vec_perm_shuffle
> >>>>>> > >         for the new permutations.
> >>>>>> > >         (vect_model_load_cost): Ditto.
> >>>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
> >>>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
> >>>>>> > >         * config/arm/arm.c: Ditto.
> >>>>>> > >         * config/rs6000/rs6000.c: Ditto.
> >>>>>> > >         * config/spu/spu.c: Ditto.
> >>>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> >>>>>> > > byte
> >>>>>> > >         shuffle on some x86 architectures.
> >>>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
> >>>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
> >>>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
> >>>>>> > > permutations.
> >>>>>> > >         Fixing cost for other permutations.
> >>>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
> >>>>>> > >         slow (TARGET_SLOW_PHUFFB).
> >>>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
> >>>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
> >>>>>> > >         Fixing cost model for Silvermont.
> >>>>>> > >
> >>>>>> > > Thanks,
> >>>>>> > > Evgeny
> >>>>>> > >
> >>>>>> >
> >>>>>> > --
> >>>>>> > Richard Biener <rguenther@suse.de>
> >>>>>> > SUSE / SUSE Labs
> >>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Richard Biener <rguenther@suse.de>
> >>>>> SUSE / SUSE Labs
> >>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-06 11:49                 ` Richard Biener
@ 2014-05-06 14:38                   ` Evgeny Stupachenko
  2014-05-06 14:40                     ` Richard Biener
  2014-05-12 15:14                     ` Rainer Orth
  0 siblings, 2 replies; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-06 14:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 22470 bytes --]

Patch with fixes attached.
Currently if-structure is as following:
+      if (count == 3)
...
+      else
+       {
+         /* If length is not equal to 3 then only power of 2 is supported.  */
+         gcc_assert (exact_log2 (count) != -1);

For stores group I've created another mail thread.

Thanks,
Evgeny


On Tue, May 6, 2014 at 3:47 PM, Richard Biener <rguenther@suse.de> wrote:
> On Tue, 6 May 2014, Evgeny Stupachenko wrote:
>
>> The patch on cost model was successfully committed.
>> I've separated the rest part of the patch on loads/stores group into
>> 2: on loads group and on stores group.
>> Below is first part on loads group.
>>
>> Bootstrap and make check passed on x86.
>>
>> Is it ok?
>>
>> ChangeLog:
>>
>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>         * tree-vect-data-refs.c (vect_grouped_load_supported): New
>>         check for loads group of length 3.
>>         (vect_permute_load_chain): New permutations for loads group of
>>         length 3.
>>         * tree-vect-stmts.c (vect_model_load_cost): Change cost
>>         of vec_perm_shuffle for the new permutations.
>>
>> ChangeLog for testsuite:
>>
>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>        PR tree-optimization/52252
>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>>
>> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
>> index 274cdbd..feafb38 100644
>> --- a/gcc/tree-vect-data-refs.c
>> +++ b/gcc/tree-vect-data-refs.c
>> @@ -4812,36 +4812,74 @@ vect_grouped_load_supported (tree vectype,
>> unsigned HOST_WIDE_INT count)
>>  {
>>    enum machine_mode mode = TYPE_MODE (vectype);
>>
>> -  /* vect_permute_load_chain requires the group size to be a power of two.  */
>> -  if (exact_log2 (count) == -1)
>> +  /* vect_permute_load_chain requires the group size to be equal to 3 or
>> +     be a power of two.  */
>> +  if (count != 3 && exact_log2 (count) == -1)
>>      {
>>        if (dump_enabled_p ())
>>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -                         "the size of the group of accesses"
>> -                         " is not a power of 2\n");
>> +                        "the size of the group of accesses"
>> +                        " is not a power of 2 or not eqaul to 3\n");
>
> equal
>
>>        return false;
>>      }
>>
>>    /* Check that the permutation is supported.  */
>>    if (VECTOR_MODE_P (mode))
>>      {
>> -      unsigned int i, nelt = GET_MODE_NUNITS (mode);
>> +      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
>>        unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
>>
>> -      for (i = 0; i < nelt; i++)
>> -       sel[i] = i * 2;
>> -      if (can_vec_perm_p (mode, false, sel))
>> +      if (exact_log2 (count) != -1)
>>         {
>>           for (i = 0; i < nelt; i++)
>> -           sel[i] = i * 2 + 1;
>> +           sel[i] = i * 2;
>>           if (can_vec_perm_p (mode, false, sel))
>> -           return true;
>> +           {
>> +             for (i = 0; i < nelt; i++)
>> +               sel[i] = i * 2 + 1;
>> +             if (can_vec_perm_p (mode, false, sel))
>> +               return true;
>> +           }
>> +        }
>> +      else if (count == 3)
>
> Please structure this if as having special cases first and then an
> else with gcc_assert (exact_log2 (count)).
>
>> +       {
>> +         unsigned int k;
>> +         for (k = 0; k < 3; k++)
>> +           {
>> +             for (i = 0; i < nelt; i++)
>> +               if (3 * i + k < 2 * nelt)
>> +                 sel[i] = 3 * i + k;
>> +               else
>> +                 sel[i] = 0;
>> +             if (!can_vec_perm_p (mode, false, sel))
>> +               {
>> +                 if (dump_enabled_p ())
>> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                                    "shuffle of 3 loads is not supported by \
>> +                                    target\n");
>
> Don't use multi-line strings but do
>
>                         "shuffle of ..."
>                         "target\n");
>
> instead.
>
>> +                   return false;
>> +               }
>> +             for (i = 0, j = 0; i < nelt; i++)
>> +               if (3 * i + k < 2 * nelt)
>> +                 sel[i] = i;
>> +               else
>> +                 sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
>> +             if (!can_vec_perm_p (mode, false, sel))
>> +               {
>> +                 if (dump_enabled_p ())
>> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                                    "shuffle of 3 loads is not supported by \
>> +                                    target\n");
>
> Likewise.
>
>> +                 return false;
>> +               }
>> +           }
>> +         return true;
>>         }
>>      }
>>
>>    if (dump_enabled_p ())
>>      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -                     "extract even/odd not supported by target\n");
>> +                    "extract even/odd not supported by target\n");
>>    return false;
>>  }
>>
>> @@ -4859,8 +4897,9 @@ vect_load_lanes_supported (tree vectype,
>> unsigned HOST_WIDE_INT count)
>>  /* Function vect_permute_load_chain.
>>
>>     Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
>> -   a power of 2, generate extract_even/odd stmts to reorder the input data
>> -   correctly.  Return the final references for loads in RESULT_CHAIN.
>> +   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
>> +   the input data correctly.  Return the final references for loads in
>> +   RESULT_CHAIN.
>>
>>     E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
>>     The input is 4 vectors each containing 8 elements. We assign a
>> number to each
>> @@ -4941,6 +4980,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
>>  {
>>    tree data_ref, first_vect, second_vect;
>>    tree perm_mask_even, perm_mask_odd;
>> +  tree perm3_mask_low, perm3_mask_high;
>>    gimple perm_stmt;
>>    tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
>>    unsigned int i, j, log_length = exact_log2 (length);
>> @@ -4951,45 +4991,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
>>    memcpy (result_chain->address (), dr_chain.address (),
>>           length * sizeof (tree));
>>
>> -  for (i = 0; i < nelt; ++i)
>> -    sel[i] = i * 2;
>> -  perm_mask_even = vect_gen_perm_mask (vectype, sel);
>> -  gcc_assert (perm_mask_even != NULL);
>> +  if (log_length != (unsigned int)-1)
>
> Same for the if-structure - first handle all special values
> and then in the else handle power-of-two cases.
>
> Ok with those changes.
>
> Thanks,
> Richard.
>
>> +    {
>> +      for (i = 0; i < nelt; ++i)
>> +       sel[i] = i * 2;
>> +      perm_mask_even = vect_gen_perm_mask (vectype, sel);
>> +      gcc_assert (perm_mask_even != NULL);
>>
>> -  for (i = 0; i < nelt; ++i)
>> -    sel[i] = i * 2 + 1;
>> -  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
>> -  gcc_assert (perm_mask_odd != NULL);
>> +      for (i = 0; i < nelt; ++i)
>> +       sel[i] = i * 2 + 1;
>> +      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
>> +      gcc_assert (perm_mask_odd != NULL);
>>
>> -  for (i = 0; i < log_length; i++)
>> -    {
>> -      for (j = 0; j < length; j += 2)
>> +      for (i = 0; i < log_length; i++)
>>         {
>> -         first_vect = dr_chain[j];
>> -         second_vect = dr_chain[j+1];
>> +         for (j = 0; j < length; j += 2)
>> +           {
>> +             first_vect = dr_chain[j];
>> +             second_vect = dr_chain[j+1];
>> +
>> +             /* data_ref = permute_even (first_data_ref, second_data_ref);  */
>> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
>> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>> +                                                       first_vect, second_vect,
>> +                                                       perm_mask_even);
>> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
>> +             (*result_chain)[j/2] = data_ref;
>> +
>> +             /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
>> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
>> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>> +                                                       first_vect, second_vect,
>> +                                                       perm_mask_odd);
>> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
>> +             (*result_chain)[j/2+length/2] = data_ref;
>> +           }
>> +         memcpy (dr_chain.address (), result_chain->address (),
>> +                 length * sizeof (tree));
>> +       }
>> +    }
>> +  /* length is not a power of 2.  */
>> +  else
>> +    {
>> +      unsigned int k;
>>
>> -         /* data_ref = permute_even (first_data_ref, second_data_ref);  */
>> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
>> +      /* currently only length 3 is supported as most frequent case.  */
>> +      gcc_assert (length == 3);
>> +
>> +      for (k = 0; k < 3; k++)
>> +       {
>> +         for (i = 0; i < nelt; i++)
>> +           if (3 * i + k < 2 * nelt)
>> +             sel[i] = 3 * i + k;
>> +           else
>> +             sel[i] = 0;
>> +         perm3_mask_low = vect_gen_perm_mask (vectype, sel);
>> +         gcc_assert (perm3_mask_low != NULL);
>> +
>> +         for (i = 0, j = 0; i < nelt; i++)
>> +           if (3 * i + k < 2 * nelt)
>> +             sel[i] = i;
>> +           else
>> +             sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
>> +
>> +         perm3_mask_high = vect_gen_perm_mask (vectype, sel);
>> +         gcc_assert (perm3_mask_high != NULL);
>> +
>> +         first_vect = dr_chain[0];
>> +         second_vect = dr_chain[1];
>> +
>> +         /* Create interleaving stmt (low part of):
>> +            low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
>> +                                                            ...}>  */
>> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
>>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>>                                                     first_vect, second_vect,
>> -                                                   perm_mask_even);
>> +                                                   perm3_mask_low);
>>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
>> -         (*result_chain)[j/2] = data_ref;
>>
>> -         /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
>> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
>> +         /* Create interleaving stmt (high part of):
>> +            high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
>> +                                                             ...}>  */
>> +         first_vect = data_ref;
>> +         second_vect = dr_chain[2];
>> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
>>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
>>                                                     first_vect, second_vect,
>> -                                                   perm_mask_odd);
>> +                                                   perm3_mask_high);
>>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
>> -         (*result_chain)[j/2+length/2] = data_ref;
>> +         (*result_chain)[k] = data_ref;
>>         }
>> -      memcpy (dr_chain.address (), result_chain->address (),
>> -             length * sizeof (tree));
>>      }
>>  }
>>
>> -
>>  /* Function vect_transform_grouped_load.
>>
>>     Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
>> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> index 1a51d6d..b87c143 100644
>> --- a/gcc/tree-vect-stmts.c
>> +++ b/gcc/tree-vect-stmts.c
>> @@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info,
>> int ncopies,
>>       include the cost of the permutes.  */
>>    if (!load_lanes_p && group_size > 1)
>>      {
>> -      /* Uses an even and odd extract operations for each needed permute.  */
>> -      int nstmts = ncopies * exact_log2 (group_size) * group_size;
>> -      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
>> -                                      stmt_info, 0, vect_body);
>> +      /* Uses an even and odd extract operations or shuffle operations
>> +        for each needed permute.  */
>> +      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
>> +      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
>> +                                     stmt_info, 0, vect_body);
>>
>>        if (dump_enabled_p ())
>>          dump_printf_loc (MSG_NOTE, vect_location,
>>
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> new file mode 100644
>> index 0000000..6e3cb52
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> @@ -0,0 +1,30 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -g -ftree-vectorize -mssse3
>> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
>> +
>> +#define byte unsigned char
>> +
>> +void
>> +matrix_mul (byte *in, byte *out, int size)
>> +{
>> +  int i;
>> +  for (i = 0; i < size; i++)
>> +    {
>> +      byte in0 = in[0];
>> +      byte in1 = in[1];
>> +      byte in2 = in[2];
>> +      byte out0, out1, out2, out3;
>> +      out0 = in0 + in1;
>> +      out1 = in0 + in2;
>> +      out2 = in1 + in2;
>> +      out3 = in0 + in1 + in2;
>> +      out[0] = out0;
>> +      out[1] = out1;
>> +      out[2] = out2;
>> +      out[3] = out3;
>> +      in += 3;
>> +      out += 4;
>> +    }
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>> +/* { dg-final { cleanup-tree-dump "vect" } } */
>>
>>
>> On Wed, Apr 30, 2014 at 6:31 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> > Ping.
>> >
>> > On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> >> Hi,
>> >>
>> >> Merged with current master the patch passes bootstrap and is giving
>> >> expected gains.
>> >> Patch and new tests are attached.
>> >>
>> >> ChangeLog:
>> >>
>> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>> >>
>> >>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>> >>         check for stores group of length 3.
>> >>         (vect_permute_store_chain): New permutations for stores group of
>> >>         length 3.
>> >>         (vect_grouped_load_supported): New check for loads group of length 3.
>> >>         (vect_permute_load_chain): New permutations for loads group of length 3.
>> >>         * tree-vect-stmts.c (vect_model_store_cost): Change cost
>> >>         of vec_perm_shuffle for the new permutations.
>> >>         (vect_model_load_cost): Ditto.
>> >>
>> >> ChangeLog for testsuite:
>> >>
>> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
>> >>
>> >>        PR tree-optimization/52252
>> >>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>> >>        * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.
>> >>
>> >> Evgeny
>> >>
>> >> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> >>> Missed attachment.
>> >>>
>> >>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> >>>> I've separated the patch into 2: cost model tuning and load/store
>> >>>> groups parallelism.
>> >>>> SLM tuning was partially introduced in the patch:
>> >>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
>> >>>> The patch introducing vectorization for load/store groups of size 3 attached.
>> >>>>
>> >>>> Is it ok for stage1?
>> >>>>
>> >>>> ChangeLog:
>> >>>>
>> >>>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
>> >>>>
>> >>>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
>> >>>>        check for stores group of length 3.
>> >>>>        (vect_permute_store_chain): New permutations for stores group of
>> >>>>        length 3.
>> >>>>        (vect_grouped_load_supported): New check for loads group of length 3.
>> >>>>        (vect_permute_load_chain): New permutations for loads group of length 3.
>> >>>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
>> >>>>        of vec_perm_shuffle for the new permutations.
>> >>>>        (vect_model_load_cost): Ditto.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
>> >>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>> >>>>>
>> >>>>>> Missed patch attached in plain-text.
>> >>>>>>
>> >>>>>> I have copyright assignment on file with the FSF covering work on GCC.
>> >>>>>>
>> >>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
>> >>>>>> case. It is used in RGB image processing (like test case in PR52252).
>> >>>>>> For sure we can extend the patch to length 5 and more. However, this
>> >>>>>> potentially affect performance on some other architectures and
>> >>>>>> requires larger testing. So length 3 it is just first step.The
>> >>>>>> algorithm in the patch could be modified for a general case in several
>> >>>>>> steps.
>> >>>>>>
>> >>>>>> I understand that the patch should wait for the stage 1, however since
>> >>>>>> its ready we can discuss it right now and make some changes (like
>> >>>>>> general size of group).
>> >>>>>
>> >>>>> Other than that I'd like to see a vectorizer hook querying the cost of a
>> >>>>> vec_perm_const expansion instead of adding vec_perm_shuffle
>> >>>>> (thus requires the constant shuffle mask to be passed as well
>> >>>>> as the vector type).  That's more useful for other uses that
>> >>>>> would require (arbitrary) shuffles.
>> >>>>>
>> >>>>> Didn't look at the rest of the patch yet - queued in my review
>> >>>>> pipeline.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Richard.
>> >>>>>
>> >>>>>> Thanks,
>> >>>>>> Evgeny
>> >>>>>>
>> >>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
>> >>>>>> >
>> >>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
>> >>>>>> >
>> >>>>>> > > Hi,
>> >>>>>> > >
>> >>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
>> >>>>>> > > (and even 6 times for AVX2).
>> >>>>>> > > It passes make check and bootstrap on x86.
>> >>>>>> > > spec2000/spec2006 got no regressions/gains on x86.
>> >>>>>> > >
>> >>>>>> > > Is this patch ok?
>> >>>>>> >
>> >>>>>> > I've worked on generalizing the permutation support in the light
>> >>>>>> > of the availability of the generic shuffle support in the IL
>> >>>>>> > but hit some road-blocks in the way code-generation works for
>> >>>>>> > group loads with permutations (I don't remember if I posted all patches).
>> >>>>>> >
>> >>>>>> > This patch seems to be to a slightly different place but it again
>> >>>>>> > special-cases a specific permutation.  Why's that?  Why can't we
>> >>>>>> > support groups of size 7 for example?  So - can this be generalized
>> >>>>>> > to support arbitrary non-power-of-two load/store groups?
>> >>>>>> >
>> >>>>>> > Other than that the patch has to wait for stage1 to open again,
>> >>>>>> > of course.  And it misses a testcase.
>> >>>>>> >
>> >>>>>> > Btw, do you have a copyright assignment on file with the FSF covering
>> >>>>>> > work on GCC?
>> >>>>>> >
>> >>>>>> > Thanks,
>> >>>>>> > Richard.
>> >>>>>> >
>> >>>>>> > > ChangeLog:
>> >>>>>> > >
>> >>>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
>> >>>>>> > >
>> >>>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
>> >>>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
>> >>>>>> > >         check for stores group of length 3.
>> >>>>>> > >         (vect_permute_store_chain): New permutations for stores group of
>> >>>>>> > >         length 3.
>> >>>>>> > >         (vect_grouped_load_supported): New check for loads group of length
>> >>>>>> > > 3.
>> >>>>>> > >         (vect_permute_load_chain): New permutations for loads group of
>> >>>>>> > > length 3.
>> >>>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
>> >>>>>> > > vec_perm_shuffle
>> >>>>>> > >         for the new permutations.
>> >>>>>> > >         (vect_model_load_cost): Ditto.
>> >>>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
>> >>>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
>> >>>>>> > >         * config/arm/arm.c: Ditto.
>> >>>>>> > >         * config/rs6000/rs6000.c: Ditto.
>> >>>>>> > >         * config/spu/spu.c: Ditto.
>> >>>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
>> >>>>>> > > byte
>> >>>>>> > >         shuffle on some x86 architectures.
>> >>>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
>> >>>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
>> >>>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
>> >>>>>> > > permutations.
>> >>>>>> > >         Fixing cost for other permutations.
>> >>>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
>> >>>>>> > >         slow (TARGET_SLOW_PHUFFB).
>> >>>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
>> >>>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
>> >>>>>> > >         Fixing cost model for Silvermont.
>> >>>>>> > >
>> >>>>>> > > Thanks,
>> >>>>>> > > Evgeny
>> >>>>>> > >
>> >>>>>> >
>> >>>>>> > --
>> >>>>>> > Richard Biener <rguenther@suse.de>
>> >>>>>> > SUSE / SUSE Labs
>> >>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>> >>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>> >>>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Richard Biener <rguenther@suse.de>
>> >>>>> SUSE / SUSE Labs
>> >>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
>> >>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
>>
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

[-- Attachment #2: vect3load.patch --]
[-- Type: application/octet-stream, Size: 8945 bytes --]

diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 274cdbd..ef710cf 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4812,36 +4812,76 @@ vect_grouped_load_supported (tree vectype, unsigned HOST_WIDE_INT count)
 {
   enum machine_mode mode = TYPE_MODE (vectype);
 
-  /* vect_permute_load_chain requires the group size to be a power of two.  */
-  if (exact_log2 (count) == -1)
+  /* vect_permute_load_chain requires the group size to be equal to 3 or
+     be a power of two.  */
+  if (count != 3 && exact_log2 (count) == -1)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                         "the size of the group of accesses"
-                         " is not a power of 2\n");
+			 "the size of the group of accesses"
+			 " is not a power of 2 or not equal to 3\n");
       return false;
     }
 
   /* Check that the permutation is supported.  */
   if (VECTOR_MODE_P (mode))
     {
-      unsigned int i, nelt = GET_MODE_NUNITS (mode);
+      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
       unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
 
-      for (i = 0; i < nelt; i++)
-	sel[i] = i * 2;
-      if (can_vec_perm_p (mode, false, sel))
+      if (count == 3)
 	{
+	  unsigned int k;
+	  for (k = 0; k < 3; k++)
+	    {
+	      for (i = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = 3 * i + k;
+		else
+		  sel[i] = 0;
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by"
+				     " target\n");
+		    return false;
+		}
+	      for (i = 0, j = 0; i < nelt; i++)
+		if (3 * i + k < 2 * nelt)
+		  sel[i] = i;
+		else
+		  sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+	      if (!can_vec_perm_p (mode, false, sel))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "shuffle of 3 loads is not supported by"
+				     " target\n");
+		  return false;
+		}
+	    }
+	  return true;
+	}
+      else
+	{
+	  /* If length is not equal to 3 then only power of 2 is supported.  */
+	  gcc_assert (exact_log2 (count) != -1);
 	  for (i = 0; i < nelt; i++)
-	    sel[i] = i * 2 + 1;
+	    sel[i] = i * 2;
 	  if (can_vec_perm_p (mode, false, sel))
-	    return true;
-	}
+	    {
+	      for (i = 0; i < nelt; i++)
+		sel[i] = i * 2 + 1;
+	      if (can_vec_perm_p (mode, false, sel))
+		return true;
+	    }
+        }
     }
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                     "extract even/odd not supported by target\n");
+		     "extract even/odd not supported by target\n");
   return false;
 }
 
@@ -4859,8 +4899,9 @@ vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
 /* Function vect_permute_load_chain.
 
    Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
-   a power of 2, generate extract_even/odd stmts to reorder the input data
-   correctly.  Return the final references for loads in RESULT_CHAIN.
+   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
+   the input data correctly.  Return the final references for loads in
+   RESULT_CHAIN.
 
    E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
    The input is 4 vectors each containing 8 elements. We assign a number to each
@@ -4941,6 +4982,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
 {
   tree data_ref, first_vect, second_vect;
   tree perm_mask_even, perm_mask_odd;
+  tree perm3_mask_low, perm3_mask_high;
   gimple perm_stmt;
   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
   unsigned int i, j, log_length = exact_log2 (length);
@@ -4951,44 +4993,97 @@ vect_permute_load_chain (vec<tree> dr_chain,
   memcpy (result_chain->address (), dr_chain.address (),
 	  length * sizeof (tree));
 
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2;
-  perm_mask_even = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_even != NULL);
-
-  for (i = 0; i < nelt; ++i)
-    sel[i] = i * 2 + 1;
-  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
-  gcc_assert (perm_mask_odd != NULL);
-
-  for (i = 0; i < log_length; i++)
+  if (length == 3)
     {
-      for (j = 0; j < length; j += 2)
-	{
-	  first_vect = dr_chain[j];
-	  second_vect = dr_chain[j+1];
+      unsigned int k;
 
-	  /* data_ref = permute_even (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+      for (k = 0; k < 3; k++)
+	{
+	  for (i = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = 3 * i + k;
+	    else
+	      sel[i] = 0;
+	  perm3_mask_low = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_low != NULL);
+
+	  for (i = 0, j = 0; i < nelt; i++)
+	    if (3 * i + k < 2 * nelt)
+	      sel[i] = i;
+	    else
+	      sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
+
+	  perm3_mask_high = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm3_mask_high != NULL);
+
+	  first_vect = dr_chain[0];
+	  second_vect = dr_chain[1];
+
+	  /* Create interleaving stmt (low part of):
+	     low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							     ...}>  */
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_even);
+						    perm3_mask_low);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2] = data_ref;
 
-	  /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
-	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	  /* Create interleaving stmt (high part of):
+	     high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
+							      ...}>  */
+	  first_vect = data_ref;
+	  second_vect = dr_chain[2];
+	  data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
 	  perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
 						    first_vect, second_vect,
-						    perm_mask_odd);
+						    perm3_mask_high);
 	  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	  (*result_chain)[j/2+length/2] = data_ref;
+	  (*result_chain)[k] = data_ref;
 	}
-      memcpy (dr_chain.address (), result_chain->address (),
-	      length * sizeof (tree));
     }
-}
+  else
+    {
+      /* If length is not equal to 3 then only power of 2 is supported.  */
+      gcc_assert (exact_log2 (length) != -1);
+
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2;
+      perm_mask_even = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_even != NULL);
+
+      for (i = 0; i < nelt; ++i)
+	sel[i] = i * 2 + 1;
+      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
+      gcc_assert (perm_mask_odd != NULL);
 
+      for (i = 0; i < log_length; i++)
+	{
+	  for (j = 0; j < length; j += 2)
+	    {
+	      first_vect = dr_chain[j];
+	      second_vect = dr_chain[j+1];
+
+	      /* data_ref = permute_even (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_even);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2] = data_ref;
+
+	      /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
+	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
+	      perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
+							first_vect, second_vect,
+							perm_mask_odd);
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[j/2+length/2] = data_ref;
+	    }
+	  memcpy (dr_chain.address (), result_chain->address (),
+		  length * sizeof (tree));
+	}
+    }
+}
 
 /* Function vect_transform_grouped_load.
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1a51d6d..b87c143 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info, int ncopies,
      include the cost of the permutes.  */
   if (!load_lanes_p && group_size > 1)
     {
-      /* Uses an even and odd extract operations for each needed permute.  */
-      int nstmts = ncopies * exact_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
-				       stmt_info, 0, vect_body);
+      /* Uses an even and odd extract operations or shuffle operations
+	 for each needed permute.  */
+      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
+      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
+				      stmt_info, 0, vect_body);
 
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-06 14:38                   ` Evgeny Stupachenko
@ 2014-05-06 14:40                     ` Richard Biener
  2014-05-12 15:14                     ` Rainer Orth
  1 sibling, 0 replies; 26+ messages in thread
From: Richard Biener @ 2014-05-06 14:40 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: GCC Patches, Jakub Jelinek, Uros Bizjak

On Tue, 6 May 2014, Evgeny Stupachenko wrote:

> Patch with fixes attached.

Ok if bootstrap/regtest passes.

Thanks,
Richard.

> Currently if-structure is as following:
> +      if (count == 3)
> ...
> +      else
> +       {
> +         /* If length is not equal to 3 then only power of 2 is supported.  */
> +         gcc_assert (exact_log2 (count) != -1);
> 
> For stores group I've created another mail thread.
> 
> Thanks,
> Evgeny
> 
> 
> On Tue, May 6, 2014 at 3:47 PM, Richard Biener <rguenther@suse.de> wrote:
> > On Tue, 6 May 2014, Evgeny Stupachenko wrote:
> >
> >> The patch on cost model was successfully committed.
> >> I've separated the rest part of the patch on loads/stores group into
> >> 2: on loads group and on stores group.
> >> Below is first part on loads group.
> >>
> >> Bootstrap and make check passed on x86.
> >>
> >> Is it ok?
> >>
> >> ChangeLog:
> >>
> >> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
> >>
> >>         * tree-vect-data-refs.c (vect_grouped_load_supported): New
> >>         check for loads group of length 3.
> >>         (vect_permute_load_chain): New permutations for loads group of
> >>         length 3.
> >>         * tree-vect-stmts.c (vect_model_load_cost): Change cost
> >>         of vec_perm_shuffle for the new permutations.
> >>
> >> ChangeLog for testsuite:
> >>
> >> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
> >>
> >>        PR tree-optimization/52252
> >>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
> >>
> >> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> >> index 274cdbd..feafb38 100644
> >> --- a/gcc/tree-vect-data-refs.c
> >> +++ b/gcc/tree-vect-data-refs.c
> >> @@ -4812,36 +4812,74 @@ vect_grouped_load_supported (tree vectype,
> >> unsigned HOST_WIDE_INT count)
> >>  {
> >>    enum machine_mode mode = TYPE_MODE (vectype);
> >>
> >> -  /* vect_permute_load_chain requires the group size to be a power of two.  */
> >> -  if (exact_log2 (count) == -1)
> >> +  /* vect_permute_load_chain requires the group size to be equal to 3 or
> >> +     be a power of two.  */
> >> +  if (count != 3 && exact_log2 (count) == -1)
> >>      {
> >>        if (dump_enabled_p ())
> >>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                         "the size of the group of accesses"
> >> -                         " is not a power of 2\n");
> >> +                        "the size of the group of accesses"
> >> +                        " is not a power of 2 or not eqaul to 3\n");
> >
> > equal
> >
> >>        return false;
> >>      }
> >>
> >>    /* Check that the permutation is supported.  */
> >>    if (VECTOR_MODE_P (mode))
> >>      {
> >> -      unsigned int i, nelt = GET_MODE_NUNITS (mode);
> >> +      unsigned int i, j, nelt = GET_MODE_NUNITS (mode);
> >>        unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
> >>
> >> -      for (i = 0; i < nelt; i++)
> >> -       sel[i] = i * 2;
> >> -      if (can_vec_perm_p (mode, false, sel))
> >> +      if (exact_log2 (count) != -1)
> >>         {
> >>           for (i = 0; i < nelt; i++)
> >> -           sel[i] = i * 2 + 1;
> >> +           sel[i] = i * 2;
> >>           if (can_vec_perm_p (mode, false, sel))
> >> -           return true;
> >> +           {
> >> +             for (i = 0; i < nelt; i++)
> >> +               sel[i] = i * 2 + 1;
> >> +             if (can_vec_perm_p (mode, false, sel))
> >> +               return true;
> >> +           }
> >> +        }
> >> +      else if (count == 3)
> >
> > Please structure this if as having special cases first and then an
> > else with gcc_assert (exact_log2 (count)).
> >
> >> +       {
> >> +         unsigned int k;
> >> +         for (k = 0; k < 3; k++)
> >> +           {
> >> +             for (i = 0; i < nelt; i++)
> >> +               if (3 * i + k < 2 * nelt)
> >> +                 sel[i] = 3 * i + k;
> >> +               else
> >> +                 sel[i] = 0;
> >> +             if (!can_vec_perm_p (mode, false, sel))
> >> +               {
> >> +                 if (dump_enabled_p ())
> >> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                                    "shuffle of 3 loads is not supported by \
> >> +                                    target\n");
> >
> > Don't use multi-line strings but do
> >
> >                         "shuffle of ..."
> >                         "target\n");
> >
> > instead.
> >
> >> +                   return false;
> >> +               }
> >> +             for (i = 0, j = 0; i < nelt; i++)
> >> +               if (3 * i + k < 2 * nelt)
> >> +                 sel[i] = i;
> >> +               else
> >> +                 sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
> >> +             if (!can_vec_perm_p (mode, false, sel))
> >> +               {
> >> +                 if (dump_enabled_p ())
> >> +                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                                    "shuffle of 3 loads is not supported by \
> >> +                                    target\n");
> >
> > Likewise.
> >
> >> +                 return false;
> >> +               }
> >> +           }
> >> +         return true;
> >>         }
> >>      }
> >>
> >>    if (dump_enabled_p ())
> >>      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                     "extract even/odd not supported by target\n");
> >> +                    "extract even/odd not supported by target\n");
> >>    return false;
> >>  }
> >>
> >> @@ -4859,8 +4897,9 @@ vect_load_lanes_supported (tree vectype,
> >> unsigned HOST_WIDE_INT count)
> >>  /* Function vect_permute_load_chain.
> >>
> >>     Given a chain of interleaved loads in DR_CHAIN of LENGTH that must be
> >> -   a power of 2, generate extract_even/odd stmts to reorder the input data
> >> -   correctly.  Return the final references for loads in RESULT_CHAIN.
> >> +   a power of 2 or equal to 3, generate extract_even/odd stmts to reorder
> >> +   the input data correctly.  Return the final references for loads in
> >> +   RESULT_CHAIN.
> >>
> >>     E.g., LENGTH is 4 and the scalar type is short, i.e., VF is 8.
> >>     The input is 4 vectors each containing 8 elements. We assign a
> >> number to each
> >> @@ -4941,6 +4980,7 @@ vect_permute_load_chain (vec<tree> dr_chain,
> >>  {
> >>    tree data_ref, first_vect, second_vect;
> >>    tree perm_mask_even, perm_mask_odd;
> >> +  tree perm3_mask_low, perm3_mask_high;
> >>    gimple perm_stmt;
> >>    tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt));
> >>    unsigned int i, j, log_length = exact_log2 (length);
> >> @@ -4951,45 +4991,99 @@ vect_permute_load_chain (vec<tree> dr_chain,
> >>    memcpy (result_chain->address (), dr_chain.address (),
> >>           length * sizeof (tree));
> >>
> >> -  for (i = 0; i < nelt; ++i)
> >> -    sel[i] = i * 2;
> >> -  perm_mask_even = vect_gen_perm_mask (vectype, sel);
> >> -  gcc_assert (perm_mask_even != NULL);
> >> +  if (log_length != (unsigned int)-1)
> >
> > Same for the if-structure - first handle all special values
> > and then in the else handle power-of-two cases.
> >
> > Ok with those changes.
> >
> > Thanks,
> > Richard.
> >
> >> +    {
> >> +      for (i = 0; i < nelt; ++i)
> >> +       sel[i] = i * 2;
> >> +      perm_mask_even = vect_gen_perm_mask (vectype, sel);
> >> +      gcc_assert (perm_mask_even != NULL);
> >>
> >> -  for (i = 0; i < nelt; ++i)
> >> -    sel[i] = i * 2 + 1;
> >> -  perm_mask_odd = vect_gen_perm_mask (vectype, sel);
> >> -  gcc_assert (perm_mask_odd != NULL);
> >> +      for (i = 0; i < nelt; ++i)
> >> +       sel[i] = i * 2 + 1;
> >> +      perm_mask_odd = vect_gen_perm_mask (vectype, sel);
> >> +      gcc_assert (perm_mask_odd != NULL);
> >>
> >> -  for (i = 0; i < log_length; i++)
> >> -    {
> >> -      for (j = 0; j < length; j += 2)
> >> +      for (i = 0; i < log_length; i++)
> >>         {
> >> -         first_vect = dr_chain[j];
> >> -         second_vect = dr_chain[j+1];
> >> +         for (j = 0; j < length; j += 2)
> >> +           {
> >> +             first_vect = dr_chain[j];
> >> +             second_vect = dr_chain[j+1];
> >> +
> >> +             /* data_ref = permute_even (first_data_ref, second_data_ref);  */
> >> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
> >> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> >> +                                                       first_vect, second_vect,
> >> +                                                       perm_mask_even);
> >> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> >> +             (*result_chain)[j/2] = data_ref;
> >> +
> >> +             /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
> >> +             data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
> >> +             perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> >> +                                                       first_vect, second_vect,
> >> +                                                       perm_mask_odd);
> >> +             vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> >> +             (*result_chain)[j/2+length/2] = data_ref;
> >> +           }
> >> +         memcpy (dr_chain.address (), result_chain->address (),
> >> +                 length * sizeof (tree));
> >> +       }
> >> +    }
> >> +  /* length is not a power of 2.  */
> >> +  else
> >> +    {
> >> +      unsigned int k;
> >>
> >> -         /* data_ref = permute_even (first_data_ref, second_data_ref);  */
> >> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
> >> +      /* currently only length 3 is supported as most frequent case.  */
> >> +      gcc_assert (length == 3);
> >> +
> >> +      for (k = 0; k < 3; k++)
> >> +       {
> >> +         for (i = 0; i < nelt; i++)
> >> +           if (3 * i + k < 2 * nelt)
> >> +             sel[i] = 3 * i + k;
> >> +           else
> >> +             sel[i] = 0;
> >> +         perm3_mask_low = vect_gen_perm_mask (vectype, sel);
> >> +         gcc_assert (perm3_mask_low != NULL);
> >> +
> >> +         for (i = 0, j = 0; i < nelt; i++)
> >> +           if (3 * i + k < 2 * nelt)
> >> +             sel[i] = i;
> >> +           else
> >> +             sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++);
> >> +
> >> +         perm3_mask_high = vect_gen_perm_mask (vectype, sel);
> >> +         gcc_assert (perm3_mask_high != NULL);
> >> +
> >> +         first_vect = dr_chain[0];
> >> +         second_vect = dr_chain[1];
> >> +
> >> +         /* Create interleaving stmt (low part of):
> >> +            low = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
> >> +                                                            ...}>  */
> >> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_low");
> >>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> >>                                                     first_vect, second_vect,
> >> -                                                   perm_mask_even);
> >> +                                                   perm3_mask_low);
> >>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> >> -         (*result_chain)[j/2] = data_ref;
> >>
> >> -         /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
> >> -         data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
> >> +         /* Create interleaving stmt (high part of):
> >> +            high = VEC_PERM_EXPR <first_vect, second_vect2, {k, 3 + k, 6 + k,
> >> +                                                             ...}>  */
> >> +         first_vect = data_ref;
> >> +         second_vect = dr_chain[2];
> >> +         data_ref = make_temp_ssa_name (vectype, NULL, "vect_suffle3_high");
> >>           perm_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR, data_ref,
> >>                                                     first_vect, second_vect,
> >> -                                                   perm_mask_odd);
> >> +                                                   perm3_mask_high);
> >>           vect_finish_stmt_generation (stmt, perm_stmt, gsi);
> >> -         (*result_chain)[j/2+length/2] = data_ref;
> >> +         (*result_chain)[k] = data_ref;
> >>         }
> >> -      memcpy (dr_chain.address (), result_chain->address (),
> >> -             length * sizeof (tree));
> >>      }
> >>  }
> >>
> >> -
> >>  /* Function vect_transform_grouped_load.
> >>
> >>     Given a chain of input interleaved data-refs (in DR_CHAIN), build statements
> >> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> index 1a51d6d..b87c143 100644
> >> --- a/gcc/tree-vect-stmts.c
> >> +++ b/gcc/tree-vect-stmts.c
> >> @@ -1091,10 +1091,11 @@ vect_model_load_cost (stmt_vec_info stmt_info,
> >> int ncopies,
> >>       include the cost of the permutes.  */
> >>    if (!load_lanes_p && group_size > 1)
> >>      {
> >> -      /* Uses an even and odd extract operations for each needed permute.  */
> >> -      int nstmts = ncopies * exact_log2 (group_size) * group_size;
> >> -      inside_cost += record_stmt_cost (body_cost_vec, nstmts, vec_perm,
> >> -                                      stmt_info, 0, vect_body);
> >> +      /* Uses an even and odd extract operations or shuffle operations
> >> +        for each needed permute.  */
> >> +      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
> >> +      inside_cost = record_stmt_cost (body_cost_vec, nstmts, vec_perm,
> >> +                                     stmt_info, 0, vect_body);
> >>
> >>        if (dump_enabled_p ())
> >>          dump_printf_loc (MSG_NOTE, vect_location,
> >>
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> >> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> >> new file mode 100644
> >> index 0000000..6e3cb52
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> >> @@ -0,0 +1,30 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> >> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
> >> +
> >> +#define byte unsigned char
> >> +
> >> +void
> >> +matrix_mul (byte *in, byte *out, int size)
> >> +{
> >> +  int i;
> >> +  for (i = 0; i < size; i++)
> >> +    {
> >> +      byte in0 = in[0];
> >> +      byte in1 = in[1];
> >> +      byte in2 = in[2];
> >> +      byte out0, out1, out2, out3;
> >> +      out0 = in0 + in1;
> >> +      out1 = in0 + in2;
> >> +      out2 = in1 + in2;
> >> +      out3 = in0 + in1 + in2;
> >> +      out[0] = out0;
> >> +      out[1] = out1;
> >> +      out[2] = out2;
> >> +      out[3] = out3;
> >> +      in += 3;
> >> +      out += 4;
> >> +    }
> >> +}
> >> +
> >> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> >> +/* { dg-final { cleanup-tree-dump "vect" } } */
> >>
> >>
> >> On Wed, Apr 30, 2014 at 6:31 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >> > Ping.
> >> >
> >> > On Fri, Apr 18, 2014 at 2:05 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >> >> Hi,
> >> >>
> >> >> Merged with current master the patch passes bootstrap and is giving
> >> >> expected gains.
> >> >> Patch and new tests are attached.
> >> >>
> >> >> ChangeLog:
> >> >>
> >> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
> >> >>
> >> >>         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >> >>         check for stores group of length 3.
> >> >>         (vect_permute_store_chain): New permutations for stores group of
> >> >>         length 3.
> >> >>         (vect_grouped_load_supported): New check for loads group of length 3.
> >> >>         (vect_permute_load_chain): New permutations for loads group of length 3.
> >> >>         * tree-vect-stmts.c (vect_model_store_cost): Change cost
> >> >>         of vec_perm_shuffle for the new permutations.
> >> >>         (vect_model_load_cost): Ditto.
> >> >>
> >> >> ChangeLog for testsuite:
> >> >>
> >> >> 2014-04-18  Evgeny Stupachenko  <evstupac@gmail.com>
> >> >>
> >> >>        PR tree-optimization/52252
> >> >>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
> >> >>        * gcc.dg/vect/pr52252-st.c: Test on stores group of size 3.
> >> >>
> >> >> Evgeny
> >> >>
> >> >> On Thu, Mar 6, 2014 at 6:44 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >> >>> Missed attachment.
> >> >>>
> >> >>> On Thu, Mar 6, 2014 at 6:42 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> >> >>>> I've separated the patch into 2: cost model tuning and load/store
> >> >>>> groups parallelism.
> >> >>>> SLM tuning was partially introduced in the patch:
> >> >>>> http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00226.html
> >> >>>> The patch introducing vectorization for load/store groups of size 3 attached.
> >> >>>>
> >> >>>> Is it ok for stage1?
> >> >>>>
> >> >>>> ChangeLog:
> >> >>>>
> >> >>>> 2014-03-06  Evgeny Stupachenko  <evstupac@gmail.com>
> >> >>>>
> >> >>>>        * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >> >>>>        check for stores group of length 3.
> >> >>>>        (vect_permute_store_chain): New permutations for stores group of
> >> >>>>        length 3.
> >> >>>>        (vect_grouped_load_supported): New check for loads group of length 3.
> >> >>>>        (vect_permute_load_chain): New permutations for loads group of length 3.
> >> >>>>        * tree-vect-stmts.c (vect_model_store_cost): Change cost
> >> >>>>        of vec_perm_shuffle for the new permutations.
> >> >>>>        (vect_model_load_cost): Ditto.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Tue, Feb 11, 2014 at 7:19 PM, Richard Biener <rguenther@suse.de> wrote:
> >> >>>>> On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >> >>>>>
> >> >>>>>> Missed patch attached in plain-text.
> >> >>>>>>
> >> >>>>>> I have copyright assignment on file with the FSF covering work on GCC.
> >> >>>>>>
> >> >>>>>> Load/stores groups of length 3 is the most frequent non-power-of-2
> >> >>>>>> case. It is used in RGB image processing (like test case in PR52252).
> >> >>>>>> For sure we can extend the patch to length 5 and more. However, this
> >> >>>>>> potentially affect performance on some other architectures and
> >> >>>>>> requires larger testing. So length 3 it is just first step.The
> >> >>>>>> algorithm in the patch could be modified for a general case in several
> >> >>>>>> steps.
> >> >>>>>>
> >> >>>>>> I understand that the patch should wait for the stage 1, however since
> >> >>>>>> its ready we can discuss it right now and make some changes (like
> >> >>>>>> general size of group).
> >> >>>>>
> >> >>>>> Other than that I'd like to see a vectorizer hook querying the cost of a
> >> >>>>> vec_perm_const expansion instead of adding vec_perm_shuffle
> >> >>>>> (thus requires the constant shuffle mask to be passed as well
> >> >>>>> as the vector type).  That's more useful for other uses that
> >> >>>>> would require (arbitrary) shuffles.
> >> >>>>>
> >> >>>>> Didn't look at the rest of the patch yet - queued in my review
> >> >>>>> pipeline.
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Richard.
> >> >>>>>
> >> >>>>>> Thanks,
> >> >>>>>> Evgeny
> >> >>>>>>
> >> >>>>>> On Tue, Feb 11, 2014 at 5:00 PM, Richard Biener <rguenther@suse.de> wrote:
> >> >>>>>> >
> >> >>>>>> > On Tue, 11 Feb 2014, Evgeny Stupachenko wrote:
> >> >>>>>> >
> >> >>>>>> > > Hi,
> >> >>>>>> > >
> >> >>>>>> > > The patch gives an expected 3 times gain for the test case in the PR52252
> >> >>>>>> > > (and even 6 times for AVX2).
> >> >>>>>> > > It passes make check and bootstrap on x86.
> >> >>>>>> > > spec2000/spec2006 got no regressions/gains on x86.
> >> >>>>>> > >
> >> >>>>>> > > Is this patch ok?
> >> >>>>>> >
> >> >>>>>> > I've worked on generalizing the permutation support in the light
> >> >>>>>> > of the availability of the generic shuffle support in the IL
> >> >>>>>> > but hit some road-blocks in the way code-generation works for
> >> >>>>>> > group loads with permutations (I don't remember if I posted all patches).
> >> >>>>>> >
> >> >>>>>> > This patch seems to be to a slightly different place but it again
> >> >>>>>> > special-cases a specific permutation.  Why's that?  Why can't we
> >> >>>>>> > support groups of size 7 for example?  So - can this be generalized
> >> >>>>>> > to support arbitrary non-power-of-two load/store groups?
> >> >>>>>> >
> >> >>>>>> > Other than that the patch has to wait for stage1 to open again,
> >> >>>>>> > of course.  And it misses a testcase.
> >> >>>>>> >
> >> >>>>>> > Btw, do you have a copyright assignment on file with the FSF covering
> >> >>>>>> > work on GCC?
> >> >>>>>> >
> >> >>>>>> > Thanks,
> >> >>>>>> > Richard.
> >> >>>>>> >
> >> >>>>>> > > ChangeLog:
> >> >>>>>> > >
> >> >>>>>> > > 2014-02-11  Evgeny Stupachenko  <evstupac@gmail.com>
> >> >>>>>> > >
> >> >>>>>> > >         * target.h (vect_cost_for_stmt): Defining new cost vec_perm_shuffle.
> >> >>>>>> > >         * tree-vect-data-refs.c (vect_grouped_store_supported): New
> >> >>>>>> > >         check for stores group of length 3.
> >> >>>>>> > >         (vect_permute_store_chain): New permutations for stores group of
> >> >>>>>> > >         length 3.
> >> >>>>>> > >         (vect_grouped_load_supported): New check for loads group of length
> >> >>>>>> > > 3.
> >> >>>>>> > >         (vect_permute_load_chain): New permutations for loads group of
> >> >>>>>> > > length 3.
> >> >>>>>> > >         * tree-vect-stmts.c (vect_model_store_cost): New cost
> >> >>>>>> > > vec_perm_shuffle
> >> >>>>>> > >         for the new permutations.
> >> >>>>>> > >         (vect_model_load_cost): Ditto.
> >> >>>>>> > >         * config/aarch64/aarch64.c (builtin_vectorization_cost): Adding
> >> >>>>>> > >         vec_perm_shuffle cost as equvivalent of vec_perm cost.
> >> >>>>>> > >         * config/arm/arm.c: Ditto.
> >> >>>>>> > >         * config/rs6000/rs6000.c: Ditto.
> >> >>>>>> > >         * config/spu/spu.c: Ditto.
> >> >>>>>> > >         * config/i386/x86-tune.def (TARGET_SLOW_PHUFFB): Target for slow
> >> >>>>>> > > byte
> >> >>>>>> > >         shuffle on some x86 architectures.
> >> >>>>>> > >         * config/i386/i386.h (processor_costs): Defining pshuffb cost.
> >> >>>>>> > >         * config/i386/i386.c (processor_costs): Adding pshuffb cost.
> >> >>>>>> > >         (ix86_builtin_vectorization_cost): Adding cost for the new
> >> >>>>>> > > permutations.
> >> >>>>>> > >         Fixing cost for other permutations.
> >> >>>>>> > >         (expand_vec_perm_even_odd_1): Avoid byte shuffles when they are
> >> >>>>>> > >         slow (TARGET_SLOW_PHUFFB).
> >> >>>>>> > >         (ix86_add_stmt_cost): Adding cost when STMT is WIDEN_MULTIPLY.
> >> >>>>>> > >         Adding new shuffle cost only when byte shuffle is expected.
> >> >>>>>> > >         Fixing cost model for Silvermont.
> >> >>>>>> > >
> >> >>>>>> > > Thanks,
> >> >>>>>> > > Evgeny
> >> >>>>>> > >
> >> >>>>>> >
> >> >>>>>> > --
> >> >>>>>> > Richard Biener <rguenther@suse.de>
> >> >>>>>> > SUSE / SUSE Labs
> >> >>>>>> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >> >>>>>> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> >> >>>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Richard Biener <rguenther@suse.de>
> >> >>>>> SUSE / SUSE Labs
> >> >>>>> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> >> >>>>> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> >>
> >>
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE / SUSE Labs
> > SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> > GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-06 14:38                   ` Evgeny Stupachenko
  2014-05-06 14:40                     ` Richard Biener
@ 2014-05-12 15:14                     ` Rainer Orth
  2014-05-12 18:08                       ` Evgeny Stupachenko
  1 sibling, 1 reply; 26+ messages in thread
From: Rainer Orth @ 2014-05-12 15:14 UTC (permalink / raw)
  To: Evgeny Stupachenko
  Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

Evgeny Stupachenko <evstupac@gmail.com> writes:

> Patch with fixes attached.
> Currently if-structure is as following:
> +      if (count == 3)
> ...
> +      else
> +       {
> +         /* If length is not equal to 3 then only power of 2 is supported.  */
> +         gcc_assert (exact_log2 (count) != -1);
>
> For stores group I've created another mail thread.
[...]
>>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>>
>>>        PR tree-optimization/52252
>>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.

This test FAILs on sparc-sun-solaris2.11, both 32 and 64-bit:

FAIL: gcc.dg/vect/pr52252-ld.c scan-tree-dump-times vect "vectorized 1 loops" 1
FAIL: gcc.dg/vect/pr52252-ld.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 1 loops" 1

The dumps have

/vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not vectorized: relevant stmt not supported: in0_9 = *in_27;
/vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:7:1: note: vectorized 0 loops in function.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-12 15:14                     ` Rainer Orth
@ 2014-05-12 18:08                       ` Evgeny Stupachenko
  2014-05-13  8:23                         ` Richard Biener
  2014-05-13  8:36                         ` Rainer Orth
  0 siblings, 2 replies; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-12 18:08 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

The test is on general changes. However I was able to test it on x86 only.
I see 2 possible solutions:
1. Set the test for x86 only.
2. Modify it so that it will pass on sparc-sun-solaris2.

If 2. is not acceptable I'll create patch for 1.
Currently I don't see why "in0_9 = *in_27" is not supported. Does the
test fail because of unsupported permutation?


On Mon, May 12, 2014 at 7:14 PM, Rainer Orth
<ro@cebitec.uni-bielefeld.de> wrote:
> Evgeny Stupachenko <evstupac@gmail.com> writes:
>
>> Patch with fixes attached.
>> Currently if-structure is as following:
>> +      if (count == 3)
>> ...
>> +      else
>> +       {
>> +         /* If length is not equal to 3 then only power of 2 is supported.  */
>> +         gcc_assert (exact_log2 (count) != -1);
>>
>> For stores group I've created another mail thread.
> [...]
>>>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
>>>>
>>>>        PR tree-optimization/52252
>>>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>
> This test FAILs on sparc-sun-solaris2.11, both 32 and 64-bit:
>
> FAIL: gcc.dg/vect/pr52252-ld.c scan-tree-dump-times vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/pr52252-ld.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 1 loops" 1
>
> The dumps have
>
> /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not vectorized: relevant stmt not supported: in0_9 = *in_27;
> /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:7:1: note: vectorized 0 loops in function.
>
>         Rainer
>
> --
> -----------------------------------------------------------------------------
> Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-12 18:08                       ` Evgeny Stupachenko
@ 2014-05-13  8:23                         ` Richard Biener
  2014-05-16 13:03                           ` Evgeny Stupachenko
  2014-05-13  8:36                         ` Rainer Orth
  1 sibling, 1 reply; 26+ messages in thread
From: Richard Biener @ 2014-05-13  8:23 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: Rainer Orth, GCC Patches, Jakub Jelinek, Uros Bizjak

On Mon, 12 May 2014, Evgeny Stupachenko wrote:

> The test is on general changes. However I was able to test it on x86 only.
> I see 2 possible solutions:
> 1. Set the test for x86 only.
> 2. Modify it so that it will pass on sparc-sun-solaris2.
> 
> If 2. is not acceptable I'll create patch for 1.
> Currently I don't see why "in0_9 = *in_27" is not supported. Does the
> test fail because of unsupported permutation?

The test uses

/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" 
{ target { i?86-*-* x86_64-*-* } } } */

that's bogus.  You shouldn't add any dg-options.  Instead use proper
dg-effective-target checks for the mssse3 feature you are using.
Note that the dg-final checking is applied regardless of the options
above are applied or not.

Why does the test only succeed with -mssse3 btw?

The proper way to restrict the test to a single target is to use

/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */

Sorry for not catching this in the review.

Richard.

> 
> On Mon, May 12, 2014 at 7:14 PM, Rainer Orth
> <ro@cebitec.uni-bielefeld.de> wrote:
> > Evgeny Stupachenko <evstupac@gmail.com> writes:
> >
> >> Patch with fixes attached.
> >> Currently if-structure is as following:
> >> +      if (count == 3)
> >> ...
> >> +      else
> >> +       {
> >> +         /* If length is not equal to 3 then only power of 2 is supported.  */
> >> +         gcc_assert (exact_log2 (count) != -1);
> >>
> >> For stores group I've created another mail thread.
> > [...]
> >>>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
> >>>>
> >>>>        PR tree-optimization/52252
> >>>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
> >
> > This test FAILs on sparc-sun-solaris2.11, both 32 and 64-bit:
> >
> > FAIL: gcc.dg/vect/pr52252-ld.c scan-tree-dump-times vect "vectorized 1 loops" 1
> > FAIL: gcc.dg/vect/pr52252-ld.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 1 loops" 1
> >
> > The dumps have
> >
> > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not vectorized: relevant stmt not supported: in0_9 = *in_27;
> > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:7:1: note: vectorized 0 loops in function.
> >
> >         Rainer
> >
> > --
> > -----------------------------------------------------------------------------
> > Rainer Orth, Center for Biotechnology, Bielefeld University
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-13  8:23                         ` Richard Biener
@ 2014-05-16 13:03                           ` Evgeny Stupachenko
  2014-05-16 13:11                             ` Rainer Orth
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-16 13:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Rainer Orth, GCC Patches, Jakub Jelinek, Uros Bizjak

The test uses SSSE3 because of the following restriction in i386.c:

static bool
expand_vec_perm_pshufb2 (struct expand_vec_perm_d *d)
{
  rtx rperm[2][16], vperm, l, h, op, m128;
  unsigned int i, nelt, eltsz;

  if (!TARGET_SSSE3 || GET_MODE_SIZE (d->vmode) != 16)
    return false;

Does the following fix ok?

2014-05-16  Evgeny Stupachenko  <evstupac@gmail.com>

       * gcc.dg/vect/pr52252-ld.c: Fix target for the test.


diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
index 6e3cb52..301433b 100644
--- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -g -ftree-vectorize -mssse3
-fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
+/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" } */
+/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */

 #define byte unsigned char

On Tue, May 13, 2014 at 12:21 PM, Richard Biener <rguenther@suse.de> wrote:
> On Mon, 12 May 2014, Evgeny Stupachenko wrote:
>
>> The test is on general changes. However I was able to test it on x86 only.
>> I see 2 possible solutions:
>> 1. Set the test for x86 only.
>> 2. Modify it so that it will pass on sparc-sun-solaris2.
>>
>> If 2. is not acceptable I'll create patch for 1.
>> Currently I don't see why "in0_9 = *in_27" is not supported. Does the
>> test fail because of unsupported permutation?
>
> The test uses
>
> /* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details"
> { target { i?86-*-* x86_64-*-* } } } */
>
> that's bogus.  You shouldn't add any dg-options.  Instead use proper
> dg-effective-target checks for the mssse3 feature you are using.
> Note that the dg-final checking is applied regardless of the options
> above are applied or not.
>
> Why does the test only succeed with -mssse3 btw?
>
> The proper way to restrict the test to a single target is to use
>
> /* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */
>
> Sorry for not catching this in the review.
>
> Richard.
>
>>
>> On Mon, May 12, 2014 at 7:14 PM, Rainer Orth
>> <ro@cebitec.uni-bielefeld.de> wrote:
>> > Evgeny Stupachenko <evstupac@gmail.com> writes:
>> >
>> >> Patch with fixes attached.
>> >> Currently if-structure is as following:
>> >> +      if (count == 3)
>> >> ...
>> >> +      else
>> >> +       {
>> >> +         /* If length is not equal to 3 then only power of 2 is supported.  */
>> >> +         gcc_assert (exact_log2 (count) != -1);
>> >>
>> >> For stores group I've created another mail thread.
>> > [...]
>> >>>> 2014-05-06  Evgeny Stupachenko  <evstupac@gmail.com>
>> >>>>
>> >>>>        PR tree-optimization/52252
>> >>>>        * gcc.dg/vect/pr52252-ld.c: Test on loads group of size 3.
>> >
>> > This test FAILs on sparc-sun-solaris2.11, both 32 and 64-bit:
>> >
>> > FAIL: gcc.dg/vect/pr52252-ld.c scan-tree-dump-times vect "vectorized 1 loops" 1
>> > FAIL: gcc.dg/vect/pr52252-ld.c -flto -ffat-lto-objects  scan-tree-dump-times vect "vectorized 1 loops" 1
>> >
>> > The dumps have
>> >
>> > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not vectorized: relevant stmt not supported: in0_9 = *in_27;
>> > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/pr52252-ld.c:7:1: note: vectorized 0 loops in function.
>> >
>> >         Rainer
>> >
>> > --
>> > -----------------------------------------------------------------------------
>> > Rainer Orth, Center for Biotechnology, Bielefeld University
>>
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend"orffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-16 13:03                           ` Evgeny Stupachenko
@ 2014-05-16 13:11                             ` Rainer Orth
  2014-05-16 13:22                               ` Jakub Jelinek
  2014-05-16 13:33                               ` Evgeny Stupachenko
  0 siblings, 2 replies; 26+ messages in thread
From: Rainer Orth @ 2014-05-16 13:11 UTC (permalink / raw)
  To: Evgeny Stupachenko
  Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

Hi Evgeny,

> Does the following fix ok?
>
> 2014-05-16  Evgeny Stupachenko  <evstupac@gmail.com>
>
>        * gcc.dg/vect/pr52252-ld.c: Fix target for the test.
>
>
> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> index 6e3cb52..301433b 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
> +/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" } */
> +/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */

If the test is really x86 specific, move it to gcc.target/i386 and
remove the dg-skip-if.  Otherwise, add an explanation for skipping the
test on other targets to the first arg of dg-skip-if.  This is supposed
to be a comment stating why the test is skipped, not "why" literally.

Approval or rejection of the testcase is up to the target maintainers.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-16 13:11                             ` Rainer Orth
@ 2014-05-16 13:22                               ` Jakub Jelinek
  2014-05-28 10:52                                 ` Evgeny Stupachenko
  2014-05-16 13:33                               ` Evgeny Stupachenko
  1 sibling, 1 reply; 26+ messages in thread
From: Jakub Jelinek @ 2014-05-16 13:22 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Evgeny Stupachenko, Richard Biener, GCC Patches, Uros Bizjak

On Fri, May 16, 2014 at 03:11:05PM +0200, Rainer Orth wrote:
> Hi Evgeny,
> 
> > Does the following fix ok?
> >
> > 2014-05-16  Evgeny Stupachenko  <evstupac@gmail.com>
> >
> >        * gcc.dg/vect/pr52252-ld.c: Fix target for the test.
> >
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> > b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> > index 6e3cb52..301433b 100644
> > --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> > +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> > @@ -1,5 +1,6 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> > -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
> > +/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" } */
> > +/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */
> 
> If the test is really x86 specific, move it to gcc.target/i386 and
> remove the dg-skip-if.  Otherwise, add an explanation for skipping the
> test on other targets to the first arg of dg-skip-if.  This is supposed
> to be a comment stating why the test is skipped, not "why" literally.

Well, I don't see anything i?86/x86_64 specific on the test.  What
is specific is the -mssse3, which supposedly should be added through
/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
and then perhaps the test might not necessarily be vectorized (so the
dg-final line may need target guard as well.
But, I see no reason not to try to compile this on other targets.

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-16 13:22                               ` Jakub Jelinek
@ 2014-05-28 10:52                                 ` Evgeny Stupachenko
  2014-05-28 10:55                                   ` Jakub Jelinek
  2014-05-28 10:55                                   ` Evgeny Stupachenko
  0 siblings, 2 replies; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-28 10:52 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Rainer Orth, Richard Biener, GCC Patches, Uros Bizjak

Does the following fix ok?

2014-05-28  Evgeny Stupachenko  <evstupac@gmail.com>

       * gcc.dg/vect/pr52252-ld.c: Fix target and options for the test.

diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
index 6e3cb52..57e8468 100644
--- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -g -ftree-vectorize -mssse3
-fdump-tree-vect-details" { target { i?86-*-* x86_64-*
-
+/* { dg-options "-O2 -g -ftree-vectorize -fdump-tree-vect-details"  } */
+/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
 #define byte unsigned char

 void
@@ -26,5 +26,5 @@ matrix_mul (byte *in, byte *out, int size)
     }
 }

-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
target { i?86-*-* x86_64-*-* } } }
 /* { dg-final { cleanup-tree-dump "vect" } } */

On Fri, May 16, 2014 at 5:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, May 16, 2014 at 03:11:05PM +0200, Rainer Orth wrote:
>> Hi Evgeny,
>>
>> > Does the following fix ok?
>> >
>> > 2014-05-16  Evgeny Stupachenko  <evstupac@gmail.com>
>> >
>> >        * gcc.dg/vect/pr52252-ld.c: Fix target for the test.
>> >
>> >
>> > diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> > b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> > index 6e3cb52..301433b 100644
>> > --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> > +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> > @@ -1,5 +1,6 @@
>> >  /* { dg-do compile } */
>> > -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
>> > -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
>> > +/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" } */
>> > +/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */
>>
>> If the test is really x86 specific, move it to gcc.target/i386 and
>> remove the dg-skip-if.  Otherwise, add an explanation for skipping the
>> test on other targets to the first arg of dg-skip-if.  This is supposed
>> to be a comment stating why the test is skipped, not "why" literally.
>
> Well, I don't see anything i?86/x86_64 specific on the test.  What
> is specific is the -mssse3, which supposedly should be added through
> /* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
> and then perhaps the test might not necessarily be vectorized (so the
> dg-final line may need target guard as well.
> But, I see no reason not to try to compile this on other targets.
>
>         Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-28 10:52                                 ` Evgeny Stupachenko
@ 2014-05-28 10:55                                   ` Jakub Jelinek
  2014-05-28 11:33                                     ` Evgeny Stupachenko
  2014-05-28 10:55                                   ` Evgeny Stupachenko
  1 sibling, 1 reply; 26+ messages in thread
From: Jakub Jelinek @ 2014-05-28 10:55 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: Rainer Orth, Richard Biener, GCC Patches, Uros Bizjak

On Wed, May 28, 2014 at 02:51:57PM +0400, Evgeny Stupachenko wrote:
> Does the following fix ok?
> 
> 2014-05-28  Evgeny Stupachenko  <evstupac@gmail.com>
> 
>        * gcc.dg/vect/pr52252-ld.c: Fix target and options for the test.
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> index 6e3cb52..57e8468 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*
> -
> +/* { dg-options "-O2 -g -ftree-vectorize -fdump-tree-vect-details"  } */

dg-options should not be used in g*.dg/vect/* at all.
Not sure about -g, but the other options are provided by default already,
and shouldn't be overriden.

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-28 10:55                                   ` Jakub Jelinek
@ 2014-05-28 11:33                                     ` Evgeny Stupachenko
  2014-05-28 11:35                                       ` Jakub Jelinek
  0 siblings, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-28 11:33 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Rainer Orth, Richard Biener, GCC Patches, Uros Bizjak

Ok. Fixed. Test still passes on x86:

diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
index 6e3cb52..e37b177 100644
--- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -1,6 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -g -ftree-vectorize -mssse3
-fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
-
+/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
 #define byte unsigned char

 void
@@ -26,5 +25,5 @@ matrix_mul (byte *in, byte *out, int size)
     }
 }

-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
target { i?86-*-* x86_64-*-* } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */

On Wed, May 28, 2014 at 2:55 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, May 28, 2014 at 02:51:57PM +0400, Evgeny Stupachenko wrote:
>> Does the following fix ok?
>>
>> 2014-05-28  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>>        * gcc.dg/vect/pr52252-ld.c: Fix target and options for the test.
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> index 6e3cb52..57e8468 100644
>> --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>> @@ -1,6 +1,6 @@
>>  /* { dg-do compile } */
>> -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
>> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*
>> -
>> +/* { dg-options "-O2 -g -ftree-vectorize -fdump-tree-vect-details"  } */
>
> dg-options should not be used in g*.dg/vect/* at all.
> Not sure about -g, but the other options are provided by default already,
> and shouldn't be overriden.
>
>         Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-28 11:33                                     ` Evgeny Stupachenko
@ 2014-05-28 11:35                                       ` Jakub Jelinek
  0 siblings, 0 replies; 26+ messages in thread
From: Jakub Jelinek @ 2014-05-28 11:35 UTC (permalink / raw)
  To: Evgeny Stupachenko; +Cc: Rainer Orth, Richard Biener, GCC Patches, Uros Bizjak

On Wed, May 28, 2014 at 03:33:15PM +0400, Evgeny Stupachenko wrote:
> Ok. Fixed. Test still passes on x86:

Ok.

> --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> @@ -1,6 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
> -
> +/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
>  #define byte unsigned char
> 
>  void
> @@ -26,5 +25,5 @@ matrix_mul (byte *in, byte *out, int size)
>      }
>  }
> 
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
> target { i?86-*-* x86_64-*-* } } } } */
>  /* { dg-final { cleanup-tree-dump "vect" } } */
> 

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-28 10:52                                 ` Evgeny Stupachenko
  2014-05-28 10:55                                   ` Jakub Jelinek
@ 2014-05-28 10:55                                   ` Evgeny Stupachenko
  1 sibling, 0 replies; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-28 10:55 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Rainer Orth, Richard Biener, GCC Patches, Uros Bizjak

missed some line tails. Correct patch below:

diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
index 6e3cb52..57e8468 100644
--- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
+++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -g -ftree-vectorize -mssse3
-fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
-
+/* { dg-options "-O2 -g -ftree-vectorize -fdump-tree-vect-details"  } */
+/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
 #define byte unsigned char

 void
@@ -26,5 +26,5 @@ matrix_mul (byte *in, byte *out, int size)
     }
 }

-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
target { i?86-*-* x86_64-*-* } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */

On Wed, May 28, 2014 at 2:51 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
> Does the following fix ok?
>
> 2014-05-28  Evgeny Stupachenko  <evstupac@gmail.com>
>
>        * gcc.dg/vect/pr52252-ld.c: Fix target and options for the test.
>
> diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> index 6e3cb52..57e8468 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
> -fdump-tree-vect-details" { target { i?86-*-* x86_64-*
> -
> +/* { dg-options "-O2 -g -ftree-vectorize -fdump-tree-vect-details"  } */
> +/* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
>  #define byte unsigned char
>
>  void
> @@ -26,5 +26,5 @@ matrix_mul (byte *in, byte *out, int size)
>      }
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {
> target { i?86-*-* x86_64-*-* } } }
>  /* { dg-final { cleanup-tree-dump "vect" } } */
>
> On Fri, May 16, 2014 at 5:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> On Fri, May 16, 2014 at 03:11:05PM +0200, Rainer Orth wrote:
>>> Hi Evgeny,
>>>
>>> > Does the following fix ok?
>>> >
>>> > 2014-05-16  Evgeny Stupachenko  <evstupac@gmail.com>
>>> >
>>> >        * gcc.dg/vect/pr52252-ld.c: Fix target for the test.
>>> >
>>> >
>>> > diff --git a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>>> > b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>>> > index 6e3cb52..301433b 100644
>>> > --- a/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>>> > +++ b/gcc/testsuite/gcc.dg/vect/pr52252-ld.c
>>> > @@ -1,5 +1,6 @@
>>> >  /* { dg-do compile } */
>>> > -/* { dg-options "-O2 -g -ftree-vectorize -mssse3
>>> > -fdump-tree-vect-details" { target { i?86-*-* x86_64-*-* } } } */
>>> > +/* { dg-options "-O2 -g -ftree-vectorize -mssse3 -fdump-tree-vect-details" } */
>>> > +/* { dg-skip-if "why" { ! { x86_64-*-* i?86-*-* } } } */
>>>
>>> If the test is really x86 specific, move it to gcc.target/i386 and
>>> remove the dg-skip-if.  Otherwise, add an explanation for skipping the
>>> test on other targets to the first arg of dg-skip-if.  This is supposed
>>> to be a comment stating why the test is skipped, not "why" literally.
>>
>> Well, I don't see anything i?86/x86_64 specific on the test.  What
>> is specific is the -mssse3, which supposedly should be added through
>> /* { dg-additional-options "-mssse3" { target { i?86-*-* x86_64-*-* } } } */
>> and then perhaps the test might not necessarily be vectorized (so the
>> dg-final line may need target guard as well.
>> But, I see no reason not to try to compile this on other targets.
>>
>>         Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-16 13:11                             ` Rainer Orth
  2014-05-16 13:22                               ` Jakub Jelinek
@ 2014-05-16 13:33                               ` Evgeny Stupachenko
  2014-05-16 13:38                                 ` Rainer Orth
  1 sibling, 1 reply; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-16 13:33 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

I'm not sure about other architectures. I can test it on x86. Most
likely it will pass on Arm, but again I'm not sure that { arm*-*-* }
is ok.

The test is on general changes. So we can wait for others and if there
are no more objections leave sparc-sun-solaris2 as target to skip.

Or change to the following:
/* { dg-skip-if "The test should pass on x86, other architectures are
untested" { ! { x86_64-*-* i?86-*-* } } } */

So that other will add their targets if necessary.

Thanks,
Evgeny

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-16 13:33                               ` Evgeny Stupachenko
@ 2014-05-16 13:38                                 ` Rainer Orth
  0 siblings, 0 replies; 26+ messages in thread
From: Rainer Orth @ 2014-05-16 13:38 UTC (permalink / raw)
  To: Evgeny Stupachenko
  Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

Evgeny Stupachenko <evstupac@gmail.com> writes:

> The test is on general changes. So we can wait for others and if there
> are no more objections leave sparc-sun-solaris2 as target to skip.

If so, use sparc*-*-* instead.

> Or change to the following:
> /* { dg-skip-if "The test should pass on x86, other architectures are
> untested" { ! { x86_64-*-* i?86-*-* } } } */
>
> So that other will add their targets if necessary.

I wouldn't do it this way, because this will never happen.  Rather,
start with all targets and skip or xfail if necessary, adding an
explanation why this is necessary.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-12 18:08                       ` Evgeny Stupachenko
  2014-05-13  8:23                         ` Richard Biener
@ 2014-05-13  8:36                         ` Rainer Orth
  2014-05-14 12:09                           ` Evgeny Stupachenko
  1 sibling, 1 reply; 26+ messages in thread
From: Rainer Orth @ 2014-05-13  8:36 UTC (permalink / raw)
  To: Evgeny Stupachenko
  Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

Evgeny Stupachenko <evstupac@gmail.com> writes:

> The test is on general changes. However I was able to test it on x86 only.
> I see 2 possible solutions:
> 1. Set the test for x86 only.
> 2. Modify it so that it will pass on sparc-sun-solaris2.
>
> If 2. is not acceptable I'll create patch for 1.
> Currently I don't see why "in0_9 = *in_27" is not supported. Does the
> test fail because of unsupported permutation?

The .vect dump has

gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: ==> examining statement: in0_9 = *in_27;

gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: vect_is_simple_use: operand *in_27
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not ssa-name.
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: use not simple.
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: vect_is_simple_use: operand *in_27
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not ssa-name.
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: use not simple.
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: no array mode for V8QI[3]
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: shuffle of 3 loads is not supported by target
gcc/testsuite/gcc.dg/vect/pr52252-ld.c:10:3: note: not vectorized: relevant stmt not supported: in0_9 = *in_27;

I can send you the full dump if necessary.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH, PR52252] Vectorization for load/store groups of size 3.
  2014-05-13  8:36                         ` Rainer Orth
@ 2014-05-14 12:09                           ` Evgeny Stupachenko
  0 siblings, 0 replies; 26+ messages in thread
From: Evgeny Stupachenko @ 2014-05-14 12:09 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Richard Biener, GCC Patches, Jakub Jelinek, Uros Bizjak

It seems like "shuffle of 3 loads is not supported by target" is the root cause.
The permutation like 0, 3, 6, 9, c, .... is not supported by the target.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-05-28 11:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOvf_xxEQ3tm+fwL5EfVSOUKDQnaBt+jTz4huK66T_8+TXzzfQ@mail.gmail.com>
2014-02-11 13:01 ` [PATCH, PR52252] Vectorization for load/store groups of size 3 Richard Biener
2014-02-11 14:03   ` Evgeny Stupachenko
2014-02-11 15:20     ` Richard Biener
2014-03-06 14:42       ` Evgeny Stupachenko
2014-03-06 14:44         ` Evgeny Stupachenko
2014-04-18 10:05           ` Evgeny Stupachenko
2014-04-30 15:03             ` Evgeny Stupachenko
2014-05-06 11:27               ` Evgeny Stupachenko
2014-05-06 11:49                 ` Richard Biener
2014-05-06 14:38                   ` Evgeny Stupachenko
2014-05-06 14:40                     ` Richard Biener
2014-05-12 15:14                     ` Rainer Orth
2014-05-12 18:08                       ` Evgeny Stupachenko
2014-05-13  8:23                         ` Richard Biener
2014-05-16 13:03                           ` Evgeny Stupachenko
2014-05-16 13:11                             ` Rainer Orth
2014-05-16 13:22                               ` Jakub Jelinek
2014-05-28 10:52                                 ` Evgeny Stupachenko
2014-05-28 10:55                                   ` Jakub Jelinek
2014-05-28 11:33                                     ` Evgeny Stupachenko
2014-05-28 11:35                                       ` Jakub Jelinek
2014-05-28 10:55                                   ` Evgeny Stupachenko
2014-05-16 13:33                               ` Evgeny Stupachenko
2014-05-16 13:38                                 ` Rainer Orth
2014-05-13  8:36                         ` Rainer Orth
2014-05-14 12:09                           ` Evgeny Stupachenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).