[PATCH][1/2] Fix PR68553

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][1/2] Fix PR68553
@ 2015-11-27  8:36 Richard Biener
  2015-12-04 15:32 ` Alan Lawrence
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2015-11-27  8:36 UTC (permalink / raw)
  To: gcc-patches


This is part 1 of a fix for PR68533 which shows that some targets
cannot can_vec_perm_p on an identity permutation.  I chose to fix
this in the vectorizer by detecting the identity itself but with
the current structure of vect_transform_slp_perm_load this is
somewhat awkward.  Thus the following no-op patch simplifies it
greatly (from the times it was restricted to do interleaving-kind
of permutes).  It turned out to not be 100% no-op as we now can
handle non-adjacent source operands so I split it out from the
actual fix.

The two adjusted testcases no longer fail to vectorize because
of "need three vectors" but unadjusted would fail because there
are simply not enough scalar iterations in the loop.  I adjusted
that and now we vectorize it just fine (running into PR68559
which I filed).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-11-27  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/68553
	* tree-vect-slp.c (vect_get_mask_element): Remove.
	(vect_transform_slp_perm_load): Implement in a simpler way.

	* gcc.dg/vect/pr45752.c: Adjust.
	* gcc.dg/vect/slp-perm-4.c: Likewise.

Index: gcc/tree-vect-slp.c
===================================================================
*** gcc/tree-vect-slp.c	(revision 230962)
--- gcc/tree-vect-slp.c	(working copy)
*************** vect_create_mask_and_perm (gimple *stmt,
*** 3241,3342 ****
  }
  
  
- /* Given FIRST_MASK_ELEMENT - the mask element in element representation,
-    return in CURRENT_MASK_ELEMENT its equivalent in target specific
-    representation.  Check that the mask is valid and return FALSE if not.
-    Return TRUE in NEED_NEXT_VECTOR if the permutation requires to move to
-    the next vector, i.e., the current first vector is not needed.  */
- 
- static bool
- vect_get_mask_element (gimple *stmt, int first_mask_element, int m,
-                        int mask_nunits, bool only_one_vec, int index,
- 		       unsigned char *mask, int *current_mask_element,
-                        bool *need_next_vector, int *number_of_mask_fixes,
-                        bool *mask_fixed, bool *needs_first_vector)
- {
-   int i;
- 
-   /* Convert to target specific representation.  */
-   *current_mask_element = first_mask_element + m;
-   /* Adjust the value in case it's a mask for second and third vectors.  */
-   *current_mask_element -= mask_nunits * (*number_of_mask_fixes - 1);
- 
-   if (*current_mask_element < 0)
-     {
-       if (dump_enabled_p ())
- 	{
- 	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- 			   "permutation requires past vector ");
- 	  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
- 	  dump_printf (MSG_MISSED_OPTIMIZATION, "\n");
- 	}
-       return false;
-     }
- 
-   if (*current_mask_element < mask_nunits)
-     *needs_first_vector = true;
- 
-   /* We have only one input vector to permute but the mask accesses values in
-      the next vector as well.  */
-   if (only_one_vec && *current_mask_element >= mask_nunits)
-     {
-       if (dump_enabled_p ())
-         {
-           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- 			   "permutation requires at least two vectors ");
-           dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
-           dump_printf (MSG_MISSED_OPTIMIZATION, "\n");
-         }
- 
-       return false;
-     }
- 
-   /* The mask requires the next vector.  */
-   while (*current_mask_element >= mask_nunits * 2)
-     {
-       if (*needs_first_vector || *mask_fixed)
-         {
-           /* We either need the first vector too or have already moved to the
-              next vector. In both cases, this permutation needs three
-              vectors.  */
-           if (dump_enabled_p ())
-             {
-               dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
- 			       "permutation requires at "
- 			       "least three vectors ");
-               dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
-               dump_printf (MSG_MISSED_OPTIMIZATION, "\n");
-             }
- 
-           return false;
-         }
- 
-       /* We move to the next vector, dropping the first one and working with
-          the second and the third - we need to adjust the values of the mask
-          accordingly.  */
-       *current_mask_element -= mask_nunits * *number_of_mask_fixes;
- 
-       for (i = 0; i < index; i++)
-         mask[i] -= mask_nunits * *number_of_mask_fixes;
- 
-       (*number_of_mask_fixes)++;
-       *mask_fixed = true;
-     }
- 
-   *need_next_vector = *mask_fixed;
- 
-   /* This was the last element of this mask. Start a new one.  */
-   if (index == mask_nunits - 1)
-     {
-       *number_of_mask_fixes = 1;
-       *mask_fixed = false;
-       *needs_first_vector = false;
-     }
- 
-   return true;
- }
- 
- 
  /* Generate vector permute statements from a list of loads in DR_CHAIN.
     If ANALYZE_ONLY is TRUE, only check that it is possible to create valid
     permute statements for the SLP node NODE of the SLP instance
--- 3270,3275 ----
*************** vect_transform_slp_perm_load (slp_tree n
*** 3350,3366 ****
    gimple *stmt = SLP_TREE_SCALAR_STMTS (node)[0];
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree mask_element_type = NULL_TREE, mask_type;
!   int i, j, k, nunits, vec_index = 0;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    int group_size = SLP_INSTANCE_GROUP_SIZE (slp_node_instance);
!   int first_mask_element;
!   int index, unroll_factor, current_mask_element, ncopies;
    unsigned char *mask;
-   bool only_one_vec = false, need_next_vector = false;
-   int first_vec_index, second_vec_index, orig_vec_stmts_num, vect_stmts_counter;
-   int number_of_mask_fixes = 1;
-   bool mask_fixed = false;
-   bool needs_first_vector = false;
    machine_mode mode;
  
    if (!STMT_VINFO_GROUPED_ACCESS (stmt_info))
--- 3283,3293 ----
    gimple *stmt = SLP_TREE_SCALAR_STMTS (node)[0];
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree mask_element_type = NULL_TREE, mask_type;
!   int nunits, vec_index = 0;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    int group_size = SLP_INSTANCE_GROUP_SIZE (slp_node_instance);
!   int unroll_factor, mask_element, ncopies;
    unsigned char *mask;
    machine_mode mode;
  
    if (!STMT_VINFO_GROUPED_ACCESS (stmt_info))
*************** vect_transform_slp_perm_load (slp_tree n
*** 3391,3405 ****
    mask = XALLOCAVEC (unsigned char, nunits);
    unroll_factor = SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance);
  
-   /* The number of vector stmts to generate based only on SLP_NODE_INSTANCE
-      unrolling factor.  */
-   orig_vec_stmts_num
-     = (STMT_VINFO_GROUP_SIZE (stmt_info)
-        * SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance)
-        + nunits - 1) / nunits;
-   if (orig_vec_stmts_num == 1)
-     only_one_vec = true;
- 
    /* Number of copies is determined by the final vectorization factor
       relatively to SLP_NODE_INSTANCE unrolling factor.  */
    ncopies = vf / SLP_INSTANCE_UNROLLING_FACTOR (slp_node_instance);
--- 3318,3323 ----
*************** vect_transform_slp_perm_load (slp_tree n
*** 3422,3496 ****
       we need the second and the third vectors: {b1,c1,a2,b2} and
       {c2,a3,b3,c3}.  */
  
!   {
!       index = 0;
!       vect_stmts_counter = 0;
!       vec_index = 0;
!       first_vec_index = vec_index++;
!       if (only_one_vec)
!         second_vec_index = first_vec_index;
!       else
!         second_vec_index =  vec_index++;
  
!       for (j = 0; j < unroll_factor; j++)
!         {
!           for (k = 0; k < group_size; k++)
!             {
! 	      i = SLP_TREE_LOAD_PERMUTATION (node)[k];
!               first_mask_element = i + j * STMT_VINFO_GROUP_SIZE (stmt_info);
!               if (!vect_get_mask_element (stmt, first_mask_element, 0,
! 					  nunits, only_one_vec, index,
! 					  mask, &current_mask_element,
! 					  &need_next_vector,
! 					  &number_of_mask_fixes, &mask_fixed,
! 					  &needs_first_vector))
! 		return false;
! 	      gcc_assert (current_mask_element >= 0
! 			  && current_mask_element < 2 * nunits);
! 	      mask[index++] = current_mask_element;
  
!               if (index == nunits)
!                 {
! 		  index = 0;
! 		  if (!can_vec_perm_p (mode, false, mask))
  		    {
! 		      if (dump_enabled_p ())
! 			{
! 			  dump_printf_loc (MSG_MISSED_OPTIMIZATION,
! 					   vect_location, 
! 					   "unsupported vect permute { ");
! 			  for (i = 0; i < nunits; ++i)
! 			    dump_printf (MSG_MISSED_OPTIMIZATION, "%d ",
! 					 mask[i]);
! 			  dump_printf (MSG_MISSED_OPTIMIZATION, "}\n");
! 			}
! 		      return false;
  		    }
  
!                   if (!analyze_only)
!                     {
! 		      int l;
! 		      tree mask_vec, *mask_elts;
! 		      mask_elts = XALLOCAVEC (tree, nunits);
! 		      for (l = 0; l < nunits; ++l)
! 			mask_elts[l] = build_int_cst (mask_element_type,
! 						      mask[l]);
! 		      mask_vec = build_vector (mask_type, mask_elts);
! 
! 		      if (need_next_vector)
!                         {
!                           first_vec_index = second_vec_index;
!                           second_vec_index = vec_index;
!                         }
! 
!                       vect_create_mask_and_perm (stmt,
!                                mask_vec, first_vec_index, second_vec_index,
! 			       gsi, node, vectype, dr_chain,
! 			       ncopies, vect_stmts_counter++);
!                     }
!                 }
!             }
!         }
      }
  
    return true;
--- 3340,3424 ----
       we need the second and the third vectors: {b1,c1,a2,b2} and
       {c2,a3,b3,c3}.  */
  
!   int vect_stmts_counter = 0;
!   int index = 0;
!   int first_vec_index = -1;
!   int second_vec_index = -1;
  
!   for (int j = 0; j < unroll_factor; j++)
!     {
!       for (int k = 0; k < group_size; k++)
! 	{
! 	  int i = (SLP_TREE_LOAD_PERMUTATION (node)[k]
! 		   + j * STMT_VINFO_GROUP_SIZE (stmt_info));
! 	  vec_index = i / nunits;
! 	  mask_element = i % nunits;
! 	  if (vec_index == first_vec_index
! 	      || first_vec_index == -1)
! 	    {
! 	      first_vec_index = vec_index;
! 	    }
! 	  else if (vec_index == second_vec_index
! 		   || second_vec_index == -1)
! 	    {
! 	      second_vec_index = vec_index;
! 	      mask_element += nunits;
! 	    }
! 	  else
! 	    {
! 	      if (dump_enabled_p ())
! 		{
! 		  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
! 				   "permutation requires at "
! 				   "least three vectors ");
! 		  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
! 				    stmt, 0);
! 		  dump_printf (MSG_MISSED_OPTIMIZATION, "\n");
! 		}
! 	      return false;
! 	    }
  
! 	  gcc_assert (mask_element >= 0
! 		      && mask_element < 2 * nunits);
! 	  mask[index++] = mask_element;
! 
! 	  if (index == nunits)
! 	    {
! 	      if (!can_vec_perm_p (mode, false, mask))
! 		{
! 		  if (dump_enabled_p ())
  		    {
! 		      dump_printf_loc (MSG_MISSED_OPTIMIZATION,
! 				       vect_location, 
! 				       "unsupported vect permute { ");
! 		      for (i = 0; i < nunits; ++i)
! 			dump_printf (MSG_MISSED_OPTIMIZATION, "%d ", mask[i]);
! 		      dump_printf (MSG_MISSED_OPTIMIZATION, "}\n");
  		    }
+ 		  return false;
+ 		}
  
! 	      if (!analyze_only)
! 		{
! 		  tree mask_vec, *mask_elts;
! 		  mask_elts = XALLOCAVEC (tree, nunits);
! 		  for (int l = 0; l < nunits; ++l)
! 		    mask_elts[l] = build_int_cst (mask_element_type, mask[l]);
! 		  mask_vec = build_vector (mask_type, mask_elts);
! 
! 		  if (second_vec_index == -1)
! 		    second_vec_index = first_vec_index;
! 		  vect_create_mask_and_perm (stmt, mask_vec, first_vec_index,
! 					     second_vec_index,
! 					     gsi, node, vectype, dr_chain,
! 					     ncopies, vect_stmts_counter++);
! 		}
! 
! 	      index = 0;
! 	      first_vec_index = -1;
! 	      second_vec_index = -1;
! 	    }
! 	}
      }
  
    return true;
Index: gcc/testsuite/gcc.dg/vect/pr45752.c
===================================================================
*** gcc/testsuite/gcc.dg/vect/pr45752.c	(revision 230962)
--- gcc/testsuite/gcc.dg/vect/pr45752.c	(working copy)
***************
*** 33,39 ****
  #define M34 7716
  #define M44 16
  
! #define N 16
  
  void foo (unsigned int *__restrict__ pInput,
            unsigned int *__restrict__ pOutput,
--- 33,39 ----
  #define M34 7716
  #define M44 16
  
! #define N 40
  
  void foo (unsigned int *__restrict__ pInput,
            unsigned int *__restrict__ pOutput,
*************** void foo (unsigned int *__restrict__ pIn
*** 75,84 ****
  int main (int argc, const char* argv[])
  {
    unsigned int input[N], output[N], i, input2[N], output2[N];
!   unsigned int check_results[N] = {3208, 1334, 28764, 35679, 2789, 13028,
!     4754, 168364, 91254, 12399, 22848, 8174, 307964, 146829, 22009, 0};
!   unsigned int check_results2[N] = {7136, 2702, 84604, 57909, 6633, 16956,
!     6122, 224204, 113484, 16243, 26776, 9542, 363804, 169059, 25853, 0};
  
    check_vect ();
  
--- 75,90 ----
  int main (int argc, const char* argv[])
  {
    unsigned int input[N], output[N], i, input2[N], output2[N];
!   unsigned int check_results[N]
!     = {3208, 1334, 28764, 35679, 2789, 13028, 4754, 168364, 91254, 12399, 
!     22848, 8174, 307964, 146829, 22009, 32668, 11594, 447564, 202404, 31619, 
!     42488, 15014, 587164, 257979, 41229, 52308, 18434, 726764, 313554, 50839, 
!     62128, 21854, 866364, 369129, 60449, 71948, 25274, 1005964, 424704, 70059};
!   unsigned int check_results2[N]
!     = {7136, 2702, 84604, 57909, 6633, 16956, 6122, 224204, 113484, 16243, 
!     26776, 9542, 363804, 169059, 25853, 36596, 12962, 503404, 224634, 35463, 
!     46416, 16382, 643004, 280209, 45073, 56236, 19802, 782604, 335784, 54683, 
!     66056, 23222, 922204, 391359, 64293, 75876, 26642, 1061804, 446934, 73903};
  
    check_vect ();
  
*************** int main (int argc, const char* argv[])
*** 101,107 ****
    return 0;
  }
  
! /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 0 "vect"  } } */
! /* { dg-final { scan-tree-dump "permutation requires at least three vectors" "vect" { target vect_perm } } } */
! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  } } */
! 
--- 107,111 ----
    return 0;
  }
  
! /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_perm } } } */
Index: gcc/testsuite/gcc.dg/vect/slp-perm-4.c
===================================================================
*** gcc/testsuite/gcc.dg/vect/slp-perm-4.c	(revision 230962)
--- gcc/testsuite/gcc.dg/vect/slp-perm-4.c	(working copy)
***************
*** 33,39 ****
  #define M34 7716
  #define M44 16
  
! #define N 16
  
  void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
  {
--- 33,39 ----
  #define M34 7716
  #define M44 16
  
! #define N 40
  
  void foo (unsigned int *__restrict__ pInput, unsigned int *__restrict__ pOutput)
  {
*************** void foo (unsigned int *__restrict__ pIn
*** 58,64 ****
  int main (int argc, const char* argv[])
  {
    unsigned int input[N], output[N], i;
!   unsigned int check_results[N] = {3208, 1334, 28764, 35679, 2789, 13028, 4754, 168364, 91254, 12399, 22848, 8174, 307964, 146829, 22009, 0};
  
    check_vect ();
  
--- 58,68 ----
  int main (int argc, const char* argv[])
  {
    unsigned int input[N], output[N], i;
!   unsigned int check_results[N]
!     = {3208, 1334, 28764, 35679, 2789, 13028, 4754, 168364, 91254, 12399, 
!     22848, 8174, 307964, 146829, 22009, 32668, 11594, 447564, 202404, 31619, 
!     42488, 15014, 587164, 257979, 41229, 52308, 18434, 726764, 313554, 50839, 
!     62128, 21854, 866364, 369129, 60449, 71948, 25274, 1005964, 424704, 70059};
  
    check_vect ();
  
*************** int main (int argc, const char* argv[])
*** 80,86 ****
    return 0;
  }
  
! /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 0 "vect"  } } */
! /* { dg-final { scan-tree-dump "permutation requires at least three vectors" "vect" { target vect_perm} } } */
! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect"  } } */
  
--- 84,89 ----
    return 0;
  }
  
! /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
! /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */
  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][1/2] Fix PR68553
  2015-11-27  8:36 [PATCH][1/2] Fix PR68553 Richard Biener
@ 2015-12-04 15:32 ` Alan Lawrence
  2015-12-04 16:04   ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Alan Lawrence @ 2015-12-04 15:32 UTC (permalink / raw)
  To: Richard Biener, gcc-patches

On 27/11/15 08:30, Richard Biener wrote:
>
> This is part 1 of a fix for PR68533 which shows that some targets
> cannot can_vec_perm_p on an identity permutation.  I chose to fix
> this in the vectorizer by detecting the identity itself but with
> the current structure of vect_transform_slp_perm_load this is
> somewhat awkward.  Thus the following no-op patch simplifies it
> greatly (from the times it was restricted to do interleaving-kind
> of permutes).  It turned out to not be 100% no-op as we now can
> handle non-adjacent source operands so I split it out from the
> actual fix.
>
> The two adjusted testcases no longer fail to vectorize because
> of "need three vectors" but unadjusted would fail because there
> are simply not enough scalar iterations in the loop.  I adjusted
> that and now we vectorize it just fine (running into PR68559
> which I filed).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>
> Richard.
>
> 2015-11-27  Richard Biener  <rguenther@suse.de>
>
> 	PR tree-optimization/68553
> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>
> 	* gcc.dg/vect/pr45752.c: Adjust.
> 	* gcc.dg/vect/slp-perm-4.c: Likewise.

On aarch64 and ARM targets, this causes

PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 0

That is, we now vectorize using SLP, when previously we did not.

On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, without 
unrolling, but now we unroll * 4, and vectorize using 3 loads and permutes:

../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.15_94 = VEC_PERM_EXPR <vect__31.11_87, vect__31.12_89, { 0, 1, 2, 4 }>;
../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.16_95 = VEC_PERM_EXPR <vect__31.12_89, vect__31.13_91, { 1, 2, 4, 5 }>;
../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.17_96 = VEC_PERM_EXPR <vect__31.13_91, vect__31.14_93, { 2, 4, 5, 6 }>

which *is* a valid vectorization strategy...


--Alan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][1/2] Fix PR68553
  2015-12-04 15:32 ` Alan Lawrence
@ 2015-12-04 16:04   ` Richard Biener
  2015-12-04 17:46     ` Ramana Radhakrishnan
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2015-12-04 16:04 UTC (permalink / raw)
  To: Alan Lawrence, gcc-patches

On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawrence@arm.com> wrote:
>On 27/11/15 08:30, Richard Biener wrote:
>>
>> This is part 1 of a fix for PR68533 which shows that some targets
>> cannot can_vec_perm_p on an identity permutation.  I chose to fix
>> this in the vectorizer by detecting the identity itself but with
>> the current structure of vect_transform_slp_perm_load this is
>> somewhat awkward.  Thus the following no-op patch simplifies it
>> greatly (from the times it was restricted to do interleaving-kind
>> of permutes).  It turned out to not be 100% no-op as we now can
>> handle non-adjacent source operands so I split it out from the
>> actual fix.
>>
>> The two adjusted testcases no longer fail to vectorize because
>> of "need three vectors" but unadjusted would fail because there
>> are simply not enough scalar iterations in the loop.  I adjusted
>> that and now we vectorize it just fine (running into PR68559
>> which I filed).
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>
>> Richard.
>>
>> 2015-11-27  Richard Biener  <rguenther@suse.de>
>>
>> 	PR tree-optimization/68553
>> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
>> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>>
>> 	* gcc.dg/vect/pr45752.c: Adjust.
>> 	* gcc.dg/vect/slp-perm-4.c: Likewise.
>
>On aarch64 and ARM targets, this causes
>
>PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
>"vectorizing 
>stmts using SLP" 0
>
>That is, we now vectorize using SLP, when previously we did not.
>
>On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
>without 
>unrolling, 
but now we unroll * 4, and vectorize using 3 loads and
>permutes:

Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to start introducing much more fine-grained target-supports for vect_perm to reliably guard all targets.

Richard.

>../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>
>vect__31.15_94 = VEC_PERM_EXPR <vect__31.11_87, vect__31.12_89, { 0, 1,
>2, 4 }>;
>../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>
>vect__31.16_95 = VEC_PERM_EXPR <vect__31.12_89, vect__31.13_91, { 1, 2,
>4, 5 }>;
>../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>
>vect__31.17_96 = VEC_PERM_EXPR <vect__31.13_91, vect__31.14_93, { 2, 4,
>5, 6 }>
>
>which *is* a valid vectorization strategy...
>
>
>--Alan


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][1/2] Fix PR68553
  2015-12-04 16:04   ` Richard Biener
@ 2015-12-04 17:46     ` Ramana Radhakrishnan
  2015-12-04 18:23       ` Alan Lawrence
  0 siblings, 1 reply; 6+ messages in thread
From: Ramana Radhakrishnan @ 2015-12-04 17:46 UTC (permalink / raw)
  To: Richard Biener, Alan Lawrence, gcc-patches



On 04/12/15 16:04, Richard Biener wrote:
> On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> On 27/11/15 08:30, Richard Biener wrote:
>>>
>>> This is part 1 of a fix for PR68533 which shows that some targets
>>> cannot can_vec_perm_p on an identity permutation.  I chose to fix
>>> this in the vectorizer by detecting the identity itself but with
>>> the current structure of vect_transform_slp_perm_load this is
>>> somewhat awkward.  Thus the following no-op patch simplifies it
>>> greatly (from the times it was restricted to do interleaving-kind
>>> of permutes).  It turned out to not be 100% no-op as we now can
>>> handle non-adjacent source operands so I split it out from the
>>> actual fix.
>>>
>>> The two adjusted testcases no longer fail to vectorize because
>>> of "need three vectors" but unadjusted would fail because there
>>> are simply not enough scalar iterations in the loop.  I adjusted
>>> that and now we vectorize it just fine (running into PR68559
>>> which I filed).
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>>
>>> Richard.
>>>
>>> 2015-11-27  Richard Biener  <rguenther@suse.de>
>>>
>>> 	PR tree-optimization/68553
>>> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
>>> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>>>
>>> 	* gcc.dg/vect/pr45752.c: Adjust.
>>> 	* gcc.dg/vect/slp-perm-4.c: Likewise.
>>
>> On aarch64 and ARM targets, this causes
>>
>> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
>> "vectorizing 
>> stmts using SLP" 0
>>
>> That is, we now vectorize using SLP, when previously we did not.
>>
>> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
>> without 
>> unrolling, 
> but now we unroll * 4, and vectorize using 3 loads and
>> permutes:
> 
> Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to start introducing much more fine-grained target-supports for vect_perm to reliably guard all targets.

I don't know enough about SSE4.1 to know whether it's a problem there or not. This is an actual regression on AArch64 and ARM and not just a testism, you now get :

.L5:
        ldr     q0, [x5, 16]
        add     x4, x4, 48
        ldr     q1, [x5, 32]
        add     w6, w6, 1
        ldr     q4, [x5, 48]
        cmp     w3, w6
        ldr     q2, [x5], 64
        orr     v3.16b, v0.16b, v0.16b
        orr     v5.16b, v4.16b, v4.16b
        orr     v4.16b, v1.16b, v1.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
        tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
        tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
        str     q0, [x4, -32]
        str     q2, [x4, -48]
        str     q4, [x4, -16]
        bhi     .L5

instead of 

.L5:
        ld4     {v4.4s - v7.4s}, [x7], 64
        add     w4, w4, 1
        cmp     w3, w4
        orr     v1.16b, v4.16b, v4.16b
        orr     v2.16b, v5.16b, v5.16b
        orr     v3.16b, v6.16b, v6.16b
        st3     {v1.4s - v3.4s}, [x6], 48
        bhi     .L5

LD4 and ST3 do all the permutes without needing actual permute instructions - a strategy that favours generic permutes avoiding the load_lanes case is likely to be more expensive on most implementations. I think worth a PR atleast.

regards
Ramana








> 
> Richard.
> 
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.15_94 = VEC_PERM_EXPR <vect__31.11_87, vect__31.12_89, { 0, 1,
>> 2, 4 }>;
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.16_95 = VEC_PERM_EXPR <vect__31.12_89, vect__31.13_91, { 1, 2,
>> 4, 5 }>;
>> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt:
>>
>> vect__31.17_96 = VEC_PERM_EXPR <vect__31.13_91, vect__31.14_93, { 2, 4,
>> 5, 6 }>
>>
>> which *is* a valid vectorization strategy...
>>
>>
>> --Alan
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][1/2] Fix PR68553
  2015-12-04 17:46     ` Ramana Radhakrishnan
@ 2015-12-04 18:23       ` Alan Lawrence
  2015-12-07  8:24         ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Alan Lawrence @ 2015-12-04 18:23 UTC (permalink / raw)
  To: Ramana Radhakrishnan, Richard Biener, gcc-patches

On 04/12/15 17:46, Ramana Radhakrishnan wrote:
>
>
> On 04/12/15 16:04, Richard Biener wrote:
>> On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>> On 27/11/15 08:30, Richard Biener wrote:
>>>>
>>>> This is part 1 of a fix for PR68533 which shows that some targets
>>>> cannot can_vec_perm_p on an identity permutation.  I chose to fix
>>>> this in the vectorizer by detecting the identity itself but with
>>>> the current structure of vect_transform_slp_perm_load this is
>>>> somewhat awkward.  Thus the following no-op patch simplifies it
>>>> greatly (from the times it was restricted to do interleaving-kind
>>>> of permutes).  It turned out to not be 100% no-op as we now can
>>>> handle non-adjacent source operands so I split it out from the
>>>> actual fix.
>>>>
>>>> The two adjusted testcases no longer fail to vectorize because
>>>> of "need three vectors" but unadjusted would fail because there
>>>> are simply not enough scalar iterations in the loop.  I adjusted
>>>> that and now we vectorize it just fine (running into PR68559
>>>> which I filed).
>>>>
>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
>>>>
>>>> Richard.
>>>>
>>>> 2015-11-27  Richard Biener  <rguenther@suse.de>
>>>>
>>>> 	PR tree-optimization/68553
>>>> 	* tree-vect-slp.c (vect_get_mask_element): Remove.
>>>> 	(vect_transform_slp_perm_load): Implement in a simpler way.
>>>>
>>>> 	* gcc.dg/vect/pr45752.c: Adjust.
>>>> 	* gcc.dg/vect/slp-perm-4.c: Likewise.
>>>
>>> On aarch64 and ARM targets, this causes
>>>
>>> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
>>> "vectorizing
>>> stmts using SLP" 0
>>>
>>> That is, we now vectorize using SLP, when previously we did not.
>>>
>>> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
>>> without
>>> unrolling,
>> but now we unroll * 4, and vectorize using 3 loads and
>>> permutes:
>>
>> Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to start introducing much more fine-grained target-supports for vect_perm to reliably guard all targets.
>
> I don't know enough about SSE4.1 to know whether it's a problem there or not. This is an actual regression on AArch64 and ARM and not just a testism, you now get :
>
> .L5:
>          ldr     q0, [x5, 16]
>          add     x4, x4, 48
>          ldr     q1, [x5, 32]
>          add     w6, w6, 1
>          ldr     q4, [x5, 48]
>          cmp     w3, w6
>          ldr     q2, [x5], 64
>          orr     v3.16b, v0.16b, v0.16b
>          orr     v5.16b, v4.16b, v4.16b
>          orr     v4.16b, v1.16b, v1.16b
>          tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
>          tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
>          tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
>          str     q0, [x4, -32]
>          str     q2, [x4, -48]
>          str     q4, [x4, -16]
>          bhi     .L5
>
> instead of
>
> .L5:
>          ld4     {v4.4s - v7.4s}, [x7], 64
>          add     w4, w4, 1
>          cmp     w3, w4
>          orr     v1.16b, v4.16b, v4.16b
>          orr     v2.16b, v5.16b, v5.16b
>          orr     v3.16b, v6.16b, v6.16b
>          st3     {v1.4s - v3.4s}, [x6], 48
>          bhi     .L5
>
> LD4 and ST3 do all the permutes without needing actual permute instructions - a strategy that favours generic permutes avoiding the load_lanes case is likely to be more expensive on most implementations. I think worth a PR atleast.
>
> regards
> Ramana
>

Yes, quite right. PR 68707.

--Alan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][1/2] Fix PR68553
  2015-12-04 18:23       ` Alan Lawrence
@ 2015-12-07  8:24         ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2015-12-07  8:24 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: Ramana Radhakrishnan, gcc-patches

On Fri, 4 Dec 2015, Alan Lawrence wrote:

> On 04/12/15 17:46, Ramana Radhakrishnan wrote:
> > 
> > 
> > On 04/12/15 16:04, Richard Biener wrote:
> > > On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence
> > > <alan.lawrence@arm.com> wrote:
> > > > On 27/11/15 08:30, Richard Biener wrote:
> > > > > 
> > > > > This is part 1 of a fix for PR68533 which shows that some targets
> > > > > cannot can_vec_perm_p on an identity permutation.  I chose to fix
> > > > > this in the vectorizer by detecting the identity itself but with
> > > > > the current structure of vect_transform_slp_perm_load this is
> > > > > somewhat awkward.  Thus the following no-op patch simplifies it
> > > > > greatly (from the times it was restricted to do interleaving-kind
> > > > > of permutes).  It turned out to not be 100% no-op as we now can
> > > > > handle non-adjacent source operands so I split it out from the
> > > > > actual fix.
> > > > > 
> > > > > The two adjusted testcases no longer fail to vectorize because
> > > > > of "need three vectors" but unadjusted would fail because there
> > > > > are simply not enough scalar iterations in the loop.  I adjusted
> > > > > that and now we vectorize it just fine (running into PR68559
> > > > > which I filed).
> > > > > 
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
> > > > > 
> > > > > Richard.
> > > > > 
> > > > > 2015-11-27  Richard Biener  <rguenther@suse.de>
> > > > > 
> > > > > 	PR tree-optimization/68553
> > > > > 	* tree-vect-slp.c (vect_get_mask_element): Remove.
> > > > > 	(vect_transform_slp_perm_load): Implement in a simpler way.
> > > > > 
> > > > > 	* gcc.dg/vect/pr45752.c: Adjust.
> > > > > 	* gcc.dg/vect/slp-perm-4.c: Likewise.
> > > > 
> > > > On aarch64 and ARM targets, this causes
> > > > 
> > > > PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
> > > > "vectorizing
> > > > stmts using SLP" 0
> > > > 
> > > > That is, we now vectorize using SLP, when previously we did not.
> > > > 
> > > > On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
> > > > without
> > > > unrolling,
> > > but now we unroll * 4, and vectorize using 3 loads and
> > > > permutes:
> > > 
> > > Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have
> > > to start introducing much more fine-grained target-supports for vect_perm
> > > to reliably guard all targets.
> > 
> > I don't know enough about SSE4.1 to know whether it's a problem there or
> > not. This is an actual regression on AArch64 and ARM and not just a testism,
> > you now get :
> > 
> > .L5:
> >          ldr     q0, [x5, 16]
> >          add     x4, x4, 48
> >          ldr     q1, [x5, 32]
> >          add     w6, w6, 1
> >          ldr     q4, [x5, 48]
> >          cmp     w3, w6
> >          ldr     q2, [x5], 64
> >          orr     v3.16b, v0.16b, v0.16b
> >          orr     v5.16b, v4.16b, v4.16b
> >          orr     v4.16b, v1.16b, v1.16b
> >          tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
> >          tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
> >          tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
> >          str     q0, [x4, -32]
> >          str     q2, [x4, -48]
> >          str     q4, [x4, -16]
> >          bhi     .L5
> > 
> > instead of
> > 
> > .L5:
> >          ld4     {v4.4s - v7.4s}, [x7], 64
> >          add     w4, w4, 1
> >          cmp     w3, w4
> >          orr     v1.16b, v4.16b, v4.16b
> >          orr     v2.16b, v5.16b, v5.16b
> >          orr     v3.16b, v6.16b, v6.16b
> >          st3     {v1.4s - v3.4s}, [x6], 48
> >          bhi     .L5
> > 
> > LD4 and ST3 do all the permutes without needing actual permute instructions
> > - a strategy that favours generic permutes avoiding the load_lanes case is
> > likely to be more expensive on most implementations. I think worth a PR
> > atleast.
> > 
> > regards
> > Ramana
> > 
> 
> Yes, quite right. PR 68707.

Thanks - I will think of sth.  Note that it's not all clearly obvious
the 2nd variant is cheaper because in the load-lane variant you have
a larger unrolling factor plus extra peeling due to the gap (ld4).
This means that loops not iterating much at runtime are likely pessimized
compared to the SLP variant.  Not sure where the actual cut-off would be.

Btw, is a ld3/st3 pair actually cheaper than three ld/st (without
extra permutation)?  It's shorter code, but is it faster?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-12-07  8:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-27  8:36 [PATCH][1/2] Fix PR68553 Richard Biener
2015-12-04 15:32 ` Alan Lawrence
2015-12-04 16:04   ` Richard Biener
2015-12-04 17:46     ` Ramana Radhakrishnan
2015-12-04 18:23       ` Alan Lawrence
2015-12-07  8:24         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).