public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 00/10] vect: Move costing next to the transform for vect store
@ 2023-09-14  3:11 Kewen Lin
  2023-09-14  3:11 ` [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case Kewen Lin
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch series is a follow up for the previous patch
series for vector load [1].  Some of associated test cases
show the benefits of this kind of structuring.  Like the
one on vect load, this patch series makes costing not call
function vect_model_store_cost any more but next to the
transform.

Most of them are organized according to the memory access
types of vector store, hope it can make the review and
bisection easy.  The changes just follow the handlings in
the function vect_model_store_cost first, then refine a bit
by referring to the transform code, I also checked them
with some typical test cases to verify.

The whole series can be bootstrapped and regtested
incrementally on:
  - x86_64-redhat-linux
  - aarch64-linux-gnu
  - powerpc64-linux-gnu P7, P8 and P9
  - powerpc64le-linux-gnu P8, P9 and P10

By considering the current vector test buckets are mainly
tested without cost model, I also verified the whole patch
series was neutral for SPEC2017 int/fp on Power9 at O2,
O3 and Ofast separately.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621460.html

Kewen Lin (10):
  vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case
  vect: Move vect_model_store_cost next to the transform in vectorizable_store
  vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER
  vect: Simplify costing on vectorizable_scan_store
  vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES
  vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE
  aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
  vect: Get rid of vect_model_store_cost
  vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE

 gcc/config/aarch64/aarch64.cc                 |   2 +-
 .../costmodel/ppc/costmodel-vect-store-1.c    |  23 +
 .../costmodel/ppc/costmodel-vect-store-2.c    |  29 +
 gcc/tree-vect-stmts.cc                        | 717 +++++++++++-------
 4 files changed, 493 insertions(+), 278 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:22   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store Kewen Lin
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

When making/testing patches to move costing next to the
transform code for vectorizable_store, some ICEs got
exposed when I further refined the costing handlings on
VMAT_ELEMENTWISE.  The apparent cause is triggering the
assertion in rs6000 specific function for costing
rs6000_builtin_vectorization_cost:

  if (TARGET_ALTIVEC)
     /* Misaligned stores are not supported.  */
     gcc_unreachable ();

I used vect_get_store_cost instead of the original way by
record_stmt_cost with scalar_store for costing, that is to
use one unaligned_store instead, it matches what we use in
transforming, it's a vector store as below:

  else if (group_size >= const_nunits
           && group_size % const_nunits == 0)
    {
       nstores = 1;
       lnel = const_nunits;
       ltype = vectype;
       lvectype = vectype;
    }

So IMHO it's more consistent with vector store instead of
scalar store, with the given compilation option
-mno-allow-movmisalign, the misaligned vector store is
unexpected to be used in vectorizer, but why it's still
adopted?  In the current implementation of function
get_group_load_store_type, we always set alignment support
scheme as dr_unaligned_supported for VMAT_ELEMENTWISE, it
is true if we always adopt scalar stores, but as the above
code shows, we could use vector stores for some cases, so
we should use the correct alignment support scheme for it.

This patch is to ensure the vector store is supported by
further checking with vect_supportable_dr_alignment.  The
ICEs got exposed with patches moving costing next to the
transform but they haven't been landed, the test coverage
would be there once they get landed.  The affected test
cases are:
  - gcc.dg/vect/slp-45.c
  - gcc.dg/vect/vect-alias-check-{10,11,12}.c

btw, I tried to make some correctness test case, but I
realized that -mno-allow-movmisalign is mainly for noting
movmisalign optab and it doesn't guard for the actual hw
vector memory access insns, so I failed to make it unless
I also altered some conditions for them as it.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_store): Ensure the generated
	vector store for some case of VMAT_ELEMENTWISE is supported.
---
 gcc/tree-vect-stmts.cc | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index cd7c1090d88..a5caaf0bca2 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8558,10 +8558,18 @@ vectorizable_store (vec_info *vinfo,
 	  else if (group_size >= const_nunits
 		   && group_size % const_nunits == 0)
 	    {
-	      nstores = 1;
-	      lnel = const_nunits;
-	      ltype = vectype;
-	      lvectype = vectype;
+	      int mis_align = dr_misalignment (first_dr_info, vectype);
+	      dr_alignment_support dr_align
+		= vect_supportable_dr_alignment (vinfo, dr_info, vectype,
+						 mis_align);
+	      if (dr_align == dr_aligned
+		  || dr_align == dr_unaligned_supported)
+		{
+		  nstores = 1;
+		  lnel = const_nunits;
+		  ltype = vectype;
+		  lvectype = vectype;
+		}
 	    }
 	  ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type));
 	  ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
  2023-09-14  3:11 ` [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:23   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER Kewen Lin
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch is an initial patch to move costing next to the
transform, it still adopts vect_model_store_cost for costing
but moves and duplicates it down according to the handlings
of different vect_memory_access_types or some special
handling need, hope it can make the subsequent patches easy
to review.  This patch should not have any functional
changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_store): Move and duplicate the call
	to vect_model_store_cost down to some different transform paths
	according to the handlings of different vect_memory_access_types
	or some special handling need.
---
 gcc/tree-vect-stmts.cc | 79 ++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 19 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a5caaf0bca2..36f7c5b9f4b 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8372,7 +8372,8 @@ vectorizable_store (vec_info *vinfo,
 	return false;
     }
 
-  if (!vec_stmt) /* transformation not required.  */
+  bool costing_p = !vec_stmt;
+  if (costing_p) /* transformation not required.  */
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
 
@@ -8401,11 +8402,6 @@ vectorizable_store (vec_info *vinfo,
 			 "Vectorizing an unaligned access.\n");
 
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
-      vect_model_store_cost (vinfo, stmt_info, ncopies,
-			     memory_access_type, &gs_info,
-			     alignment_support_scheme,
-			     misalignment, vls_type, slp_node, cost_vec);
-      return true;
     }
   gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
 
@@ -8415,12 +8411,27 @@ vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
     {
-      vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
-				      &gs_info, mask);
+      if (costing_p)
+	vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+			       &gs_info, alignment_support_scheme, misalignment,
+			       vls_type, slp_node, cost_vec);
+      else
+	vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
+					&gs_info, mask);
       return true;
     }
   else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
-    return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
+    {
+      gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
+      if (costing_p)
+	{
+	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+				 &gs_info, alignment_support_scheme,
+				 misalignment, vls_type, slp_node, cost_vec);
+	  return true;
+	}
+      return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
+    }
 
   if (grouped_store)
     {
@@ -8449,13 +8460,21 @@ vectorizable_store (vec_info *vinfo,
   else
     ref_type = reference_alias_ptr_type (DR_REF (first_dr_info->dr));
 
-  if (dump_enabled_p ())
-    dump_printf_loc (MSG_NOTE, vect_location,
-                     "transform store. ncopies = %d\n", ncopies);
+  if (!costing_p && dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "transform store. ncopies = %d\n",
+		     ncopies);
 
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
     {
+      if (costing_p)
+	{
+	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+				 &gs_info, alignment_support_scheme,
+				 misalignment, vls_type, slp_node, cost_vec);
+	  return true;
+	}
+
       gimple_stmt_iterator incr_gsi;
       bool insert_after;
       gimple *incr;
@@ -8718,8 +8737,9 @@ vectorizable_store (vec_info *vinfo,
   else if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       aggr_type = elem_type;
-      vect_get_strided_load_store_ops (stmt_info, loop_vinfo, gsi, &gs_info,
-				       &bump, &vec_offset, loop_lens);
+      if (!costing_p)
+	vect_get_strided_load_store_ops (stmt_info, loop_vinfo, gsi, &gs_info,
+					 &bump, &vec_offset, loop_lens);
     }
   else
     {
@@ -8731,7 +8751,7 @@ vectorizable_store (vec_info *vinfo,
 					  memory_access_type, loop_lens);
     }
 
-  if (mask)
+  if (mask && !costing_p)
     LOOP_VINFO_HAS_MASK_STORE (loop_vinfo) = true;
 
   /* In case the vectorization factor (VF) is bigger than the number
@@ -8782,6 +8802,13 @@ vectorizable_store (vec_info *vinfo,
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
       gcc_assert (!slp && grouped_store);
+      if (costing_p)
+	{
+	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+				 &gs_info, alignment_support_scheme,
+				 misalignment, vls_type, slp_node, cost_vec);
+	  return true;
+	}
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
@@ -8927,6 +8954,13 @@ vectorizable_store (vec_info *vinfo,
   if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       gcc_assert (!slp && !grouped_store);
+      if (costing_p)
+	{
+	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+				 &gs_info, alignment_support_scheme,
+				 misalignment, vls_type, slp_node, cost_vec);
+	  return true;
+	}
       auto_vec<tree> vec_offsets;
       for (j = 0; j < ncopies; j++)
 	{
@@ -9091,7 +9125,7 @@ vectorizable_store (vec_info *vinfo,
   for (j = 0; j < ncopies; j++)
     {
       gimple *new_stmt;
-      if (j == 0)
+      if (j == 0 && !costing_p)
 	{
 	  if (slp)
 	    {
@@ -9158,7 +9192,7 @@ vectorizable_store (vec_info *vinfo,
 					  offset, &dummy, gsi, &ptr_incr,
 					  simd_lane_access_p, bump);
 	}
-      else
+      else if (!costing_p)
 	{
 	  gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	  /* DR_CHAIN is then used as an input to vect_permute_store_chain().
@@ -9179,7 +9213,7 @@ vectorizable_store (vec_info *vinfo,
 	}
 
       new_stmt = NULL;
-      if (grouped_store)
+      if (!costing_p && grouped_store)
 	/* Permute.  */
 	vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info, gsi,
 				  &result_chain);
@@ -9187,6 +9221,8 @@ vectorizable_store (vec_info *vinfo,
       stmt_vec_info next_stmt_info = first_stmt_info;
       for (i = 0; i < vec_num; i++)
 	{
+	  if (costing_p)
+	    continue;
 	  unsigned misalign;
 	  unsigned HOST_WIDE_INT align;
 
@@ -9361,7 +9397,7 @@ vectorizable_store (vec_info *vinfo,
 	  if (!next_stmt_info)
 	    break;
 	}
-      if (!slp)
+      if (!slp && !costing_p)
 	{
 	  if (j == 0)
 	    *vec_stmt = new_stmt;
@@ -9369,6 +9405,11 @@ vectorizable_store (vec_info *vinfo,
 	}
     }
 
+  if (costing_p)
+    vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+			   &gs_info, alignment_support_scheme, misalignment,
+			   vls_type, slp_node, cost_vec);
+
   return true;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
  2023-09-14  3:11 ` [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case Kewen Lin
  2023-09-14  3:11 ` [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:24   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store Kewen Lin
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch adjusts the cost handling on VMAT_GATHER_SCATTER
in function vectorizable_store (all three cases), then we
won't depend on vect_model_load_store for its costing any
more.  This patch shouldn't have any functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_store_cost): Assert it won't get
	VMAT_GATHER_SCATTER any more, remove VMAT_GATHER_SCATTER related
	handlings and the related parameter gs_info.
	(vect_build_scatter_store_calls): Add the handlings on costing with
	one more argument cost_vec.
	(vectorizable_store): Adjust the cost handling on VMAT_GATHER_SCATTER
	without calling vect_model_store_cost any more.
---
 gcc/tree-vect-stmts.cc | 188 ++++++++++++++++++++++++++---------------
 1 file changed, 118 insertions(+), 70 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 36f7c5b9f4b..3f908242fee 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -959,12 +959,12 @@ cfun_returns (tree decl)
 static void
 vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 		       vect_memory_access_type memory_access_type,
-		       gather_scatter_info *gs_info,
 		       dr_alignment_support alignment_support_scheme,
 		       int misalignment,
 		       vec_load_store_type vls_type, slp_tree slp_node,
 		       stmt_vector_for_cost *cost_vec)
 {
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER);
   unsigned int inside_cost = 0, prologue_cost = 0;
   stmt_vec_info first_stmt_info = stmt_info;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1012,18 +1012,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 
   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   /* Costs of the stores.  */
-  if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_GATHER_SCATTER)
+  if (memory_access_type == VMAT_ELEMENTWISE)
     {
       unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
-      if (memory_access_type == VMAT_GATHER_SCATTER
-	  && gs_info->ifn == IFN_LAST && !gs_info->decl)
-	/* For emulated scatter N offset vector element extracts
-	   (we assume the scalar scaling and ptr + offset add is consumed by
-	   the load).  */
-	inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits,
-					 vec_to_scalar, stmt_info, 0,
-					 vect_body);
       /* N scalar stores plus extracting the elements.  */
       inside_cost += record_stmt_cost (cost_vec,
 				       ncopies * assumed_nunits,
@@ -1034,9 +1025,7 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 			 misalignment, &inside_cost, cost_vec);
 
   if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_STRIDED_SLP
-      || (memory_access_type == VMAT_GATHER_SCATTER
-	  && gs_info->ifn == IFN_LAST && !gs_info->decl))
+      || memory_access_type == VMAT_STRIDED_SLP)
     {
       /* N scalar stores plus extracting the elements.  */
       unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
@@ -2999,7 +2988,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
 static void
 vect_build_scatter_store_calls (vec_info *vinfo, stmt_vec_info stmt_info,
 				gimple_stmt_iterator *gsi, gimple **vec_stmt,
-				gather_scatter_info *gs_info, tree mask)
+				gather_scatter_info *gs_info, tree mask,
+				stmt_vector_for_cost *cost_vec)
 {
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
@@ -3009,6 +2999,30 @@ vect_build_scatter_store_calls (vec_info *vinfo, stmt_vec_info stmt_info,
   poly_uint64 scatter_off_nunits
     = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
 
+  /* FIXME: Keep the previous costing way in vect_model_store_cost by
+     costing N scalar stores, but it should be tweaked to use target
+     specific costs on related scatter store calls.  */
+  if (cost_vec)
+    {
+      tree op = vect_get_store_rhs (stmt_info);
+      enum vect_def_type dt;
+      gcc_assert (vect_is_simple_use (op, vinfo, &dt));
+      unsigned int inside_cost, prologue_cost = 0;
+      if (dt == vect_constant_def || dt == vect_external_def)
+	prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+					   stmt_info, 0, vect_prologue);
+      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
+      inside_cost = record_stmt_cost (cost_vec, ncopies * assumed_nunits,
+				      scalar_store, stmt_info, 0, vect_body);
+
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_store_cost: inside_cost = %d, "
+			 "prologue_cost = %d .\n",
+			 inside_cost, prologue_cost);
+      return;
+    }
+
   tree perm_mask = NULL_TREE, mask_halfvectype = NULL_TREE;
   if (known_eq (nunits, scatter_off_nunits))
     modifier = NONE;
@@ -8411,13 +8425,8 @@ vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
     {
-      if (costing_p)
-	vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-			       &gs_info, alignment_support_scheme, misalignment,
-			       vls_type, slp_node, cost_vec);
-      else
-	vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
-					&gs_info, mask);
+      vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
+				      mask, cost_vec);
       return true;
     }
   else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
@@ -8426,8 +8435,8 @@ vectorizable_store (vec_info *vinfo,
       if (costing_p)
 	{
 	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 &gs_info, alignment_support_scheme,
-				 misalignment, vls_type, slp_node, cost_vec);
+				 alignment_support_scheme, misalignment,
+				 vls_type, slp_node, cost_vec);
 	  return true;
 	}
       return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
@@ -8470,8 +8479,8 @@ vectorizable_store (vec_info *vinfo,
       if (costing_p)
 	{
 	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 &gs_info, alignment_support_scheme,
-				 misalignment, vls_type, slp_node, cost_vec);
+				 alignment_support_scheme, misalignment,
+				 vls_type, slp_node, cost_vec);
 	  return true;
 	}
 
@@ -8805,8 +8814,8 @@ vectorizable_store (vec_info *vinfo,
       if (costing_p)
 	{
 	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 &gs_info, alignment_support_scheme,
-				 misalignment, vls_type, slp_node, cost_vec);
+				 alignment_support_scheme, misalignment,
+				 vls_type, slp_node, cost_vec);
 	  return true;
 	}
       for (j = 0; j < ncopies; j++)
@@ -8954,49 +8963,50 @@ vectorizable_store (vec_info *vinfo,
   if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       gcc_assert (!slp && !grouped_store);
-      if (costing_p)
-	{
-	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 &gs_info, alignment_support_scheme,
-				 misalignment, vls_type, slp_node, cost_vec);
-	  return true;
-	}
       auto_vec<tree> vec_offsets;
+      unsigned int inside_cost = 0, prologue_cost = 0;
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
 	  if (j == 0)
 	    {
-	      /* Since the store is not grouped, DR_GROUP_SIZE is 1, and
-		 DR_CHAIN is of size 1.  */
-	      gcc_assert (group_size == 1);
-	      op = vect_get_store_rhs (first_stmt_info);
-	      vect_get_vec_defs_for_operand (vinfo, first_stmt_info, ncopies,
-					     op, gvec_oprnds[0]);
-	      vec_oprnd = (*gvec_oprnds[0])[0];
-	      dr_chain.quick_push (vec_oprnd);
-	      if (mask)
+	      if (costing_p && vls_type == VLS_STORE_INVARIANT)
+		prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+						   stmt_info, 0, vect_prologue);
+	      else if (!costing_p)
 		{
-		  vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-						 mask, &vec_masks,
-						 mask_vectype);
-		  vec_mask = vec_masks[0];
-		}
+		  /* Since the store is not grouped, DR_GROUP_SIZE is 1, and
+		     DR_CHAIN is of size 1.  */
+		  gcc_assert (group_size == 1);
+		  op = vect_get_store_rhs (first_stmt_info);
+		  vect_get_vec_defs_for_operand (vinfo, first_stmt_info,
+						 ncopies, op, gvec_oprnds[0]);
+		  vec_oprnd = (*gvec_oprnds[0])[0];
+		  dr_chain.quick_push (vec_oprnd);
+		  if (mask)
+		    {
+		      vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
+						     mask, &vec_masks,
+						     mask_vectype);
+		      vec_mask = vec_masks[0];
+		    }
 
-	      /* We should have catched mismatched types earlier.  */
-	      gcc_assert (useless_type_conversion_p (vectype,
-						     TREE_TYPE (vec_oprnd)));
-	      if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
-		vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
-					     slp_node, &gs_info, &dataref_ptr,
-					     &vec_offsets);
-	      else
-		dataref_ptr
-		  = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
-					      NULL, offset, &dummy, gsi,
-					      &ptr_incr, false, bump);
+		  /* We should have catched mismatched types earlier.  */
+		  gcc_assert (
+		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
+		  if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+		    vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
+						 slp_node, &gs_info,
+						 &dataref_ptr, &vec_offsets);
+		  else
+		    dataref_ptr
+		      = vect_create_data_ref_ptr (vinfo, first_stmt_info,
+						  aggr_type, NULL, offset,
+						  &dummy, gsi, &ptr_incr, false,
+						  bump);
+		}
 	    }
-	  else
+	  else if (!costing_p)
 	    {
 	      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	      vec_oprnd = (*gvec_oprnds[0])[j];
@@ -9013,15 +9023,27 @@ vectorizable_store (vec_info *vinfo,
 	  tree final_mask = NULL_TREE;
 	  tree final_len = NULL_TREE;
 	  tree bias = NULL_TREE;
-	  if (loop_masks)
-	    final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
-					     ncopies, vectype, j);
-	  if (vec_mask)
-	    final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, final_mask,
-					   vec_mask, gsi);
+	  if (!costing_p)
+	    {
+	      if (loop_masks)
+		final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
+						 ncopies, vectype, j);
+	      if (vec_mask)
+		final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
+					       final_mask, vec_mask, gsi);
+	    }
 
 	  if (gs_info.ifn != IFN_LAST)
 	    {
+	      if (costing_p)
+		{
+		  unsigned int cnunits = vect_nunits_for_cost (vectype);
+		  inside_cost
+		    += record_stmt_cost (cost_vec, cnunits, scalar_store,
+					 stmt_info, 0, vect_body);
+		  continue;
+		}
+
 	      if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 		vec_offset = vec_offsets[j];
 	      tree scale = size_int (gs_info.scale);
@@ -9067,6 +9089,25 @@ vectorizable_store (vec_info *vinfo,
 	    {
 	      /* Emulated scatter.  */
 	      gcc_assert (!final_mask);
+	      if (costing_p)
+		{
+		  unsigned int cnunits = vect_nunits_for_cost (vectype);
+		  /* For emulated scatter N offset vector element extracts
+		     (we assume the scalar scaling and ptr + offset add is
+		     consumed by the load).  */
+		  inside_cost
+		    += record_stmt_cost (cost_vec, cnunits, vec_to_scalar,
+					 stmt_info, 0, vect_body);
+		  /* N scalar stores plus extracting the elements.  */
+		  inside_cost
+		    += record_stmt_cost (cost_vec, cnunits, vec_to_scalar,
+					 stmt_info, 0, vect_body);
+		  inside_cost
+		    += record_stmt_cost (cost_vec, cnunits, scalar_store,
+					 stmt_info, 0, vect_body);
+		  continue;
+		}
+
 	      unsigned HOST_WIDE_INT const_nunits = nunits.to_constant ();
 	      unsigned HOST_WIDE_INT const_offset_nunits
 		= TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype).to_constant ();
@@ -9117,6 +9158,13 @@ vectorizable_store (vec_info *vinfo,
 	    *vec_stmt = new_stmt;
 	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}
+
+      if (costing_p && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_store_cost: inside_cost = %d, "
+			 "prologue_cost = %d .\n",
+			 inside_cost, prologue_cost);
+
       return true;
     }
 
@@ -9407,8 +9455,8 @@ vectorizable_store (vec_info *vinfo,
 
   if (costing_p)
     vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-			   &gs_info, alignment_support_scheme, misalignment,
-			   vls_type, slp_node, cost_vec);
+			   alignment_support_scheme, misalignment, vls_type,
+			   slp_node, cost_vec);
 
   return true;
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (2 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:25   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch is to simplify the costing on the case
vectorizable_scan_store without calling function
vect_model_store_cost any more.

I considered if moving the costing into function
vectorizable_scan_store is a good idea, for doing
that, we have to pass several variables down which
are only used for costing, and for now we just
want to keep the costing as the previous, haven't
tried to make this costing consistent with what the
transforming does, so I think we can leave it for now.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_store): Adjust costing on
	vectorizable_scan_store without calling vect_model_store_cost
	any more.
---
 gcc/tree-vect-stmts.cc | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 3f908242fee..048c14d291c 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8432,11 +8432,23 @@ vectorizable_store (vec_info *vinfo,
   else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
     {
       gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
+      gcc_assert (!slp);
       if (costing_p)
 	{
-	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 alignment_support_scheme, misalignment,
-				 vls_type, slp_node, cost_vec);
+	  unsigned int inside_cost = 0, prologue_cost = 0;
+	  if (vls_type == VLS_STORE_INVARIANT)
+	    prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+					       stmt_info, 0, vect_prologue);
+	  vect_get_store_cost (vinfo, stmt_info, ncopies,
+			       alignment_support_scheme, misalignment,
+			       &inside_cost, cost_vec);
+
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_store_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+
 	  return true;
 	}
       return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (3 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:26   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES Kewen Lin
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch adjusts the cost handling on VMAT_ELEMENTWISE
and VMAT_STRIDED_SLP in function vectorizable_store.  We
don't call function vect_model_store_cost for them any more.

Like what we improved for PR82255 on load side, this change
helps us to get rid of unnecessary vec_to_scalar costing
for some case with VMAT_STRIDED_SLP.  One typical test case
gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c has been
associated.  And it helps some cases with some inconsistent
costing too.

Besides, this also special-cases the interleaving stores
for these two affected memory access types, since for the
interleaving stores the whole chain is vectorized when the
last store in the chain is reached, the other stores in the
group would be skipped.  To keep consistent with this and
follows the transforming handlings like iterating the whole
group, it only costs for the first store in the group.
Ideally we can only cost for the last one but it's not
trivial and using the first one is actually equivalent.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_store_cost): Assert it won't get
	VMAT_ELEMENTWISE and VMAT_STRIDED_SLP any more, and remove their
	related handlings.
	(vectorizable_store): Adjust the cost handling on VMAT_ELEMENTWISE
	and VMAT_STRIDED_SLP without calling vect_model_store_cost.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c: New test.
---
 .../costmodel/ppc/costmodel-vect-store-1.c    |  23 +++
 gcc/tree-vect-stmts.cc                        | 160 +++++++++++-------
 2 files changed, 120 insertions(+), 63 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
new file mode 100644
index 00000000000..ab5f3301492
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" }
+
+/* This test case is partially extracted from case
+   gcc.dg/vect/vect-avg-16.c, it's to verify we don't
+   cost a store with vec_to_scalar when we shouldn't.  */
+
+void
+test (signed char *restrict a, signed char *restrict b, signed char *restrict c,
+      int n)
+{
+  for (int j = 0; j < n; ++j)
+    {
+      for (int i = 0; i < 16; ++i)
+	a[i] = (b[i] + c[i]) >> 1;
+      a += 20;
+      b += 20;
+      c += 20;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "vec_to_scalar" 0 "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 048c14d291c..3d01168080a 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -964,7 +964,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 		       vec_load_store_type vls_type, slp_tree slp_node,
 		       stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER);
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
+	      && memory_access_type != VMAT_ELEMENTWISE
+	      && memory_access_type != VMAT_STRIDED_SLP);
   unsigned int inside_cost = 0, prologue_cost = 0;
   stmt_vec_info first_stmt_info = stmt_info;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1010,29 +1012,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
                          group_size);
     }
 
-  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   /* Costs of the stores.  */
-  if (memory_access_type == VMAT_ELEMENTWISE)
-    {
-      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
-      /* N scalar stores plus extracting the elements.  */
-      inside_cost += record_stmt_cost (cost_vec,
-				       ncopies * assumed_nunits,
-				       scalar_store, stmt_info, 0, vect_body);
-    }
-  else
-    vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
-			 misalignment, &inside_cost, cost_vec);
-
-  if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_STRIDED_SLP)
-    {
-      /* N scalar stores plus extracting the elements.  */
-      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
-      inside_cost += record_stmt_cost (cost_vec,
-				       ncopies * assumed_nunits,
-				       vec_to_scalar, stmt_info, 0, vect_body);
-    }
+  vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
+		       misalignment, &inside_cost, cost_vec);
 
   /* When vectorizing a store into the function result assign
      a penalty if the function returns in a multi-register location.
@@ -8416,6 +8398,18 @@ vectorizable_store (vec_info *vinfo,
 			 "Vectorizing an unaligned access.\n");
 
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
+
+      /* As function vect_transform_stmt shows, for interleaving stores
+	 the whole chain is vectorized when the last store in the chain
+	 is reached, the other stores in the group are skipped.  So we
+	 want to only cost the last one here, but it's not trivial to
+	 get the last, as it's equivalent to use the first one for
+	 costing, use the first one instead.  */
+      if (grouped_store
+	  && !slp
+	  && first_stmt_info != stmt_info
+	  && memory_access_type == VMAT_ELEMENTWISE)
+	return true;
     }
   gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
 
@@ -8488,14 +8482,7 @@ vectorizable_store (vec_info *vinfo,
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
     {
-      if (costing_p)
-	{
-	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 alignment_support_scheme, misalignment,
-				 vls_type, slp_node, cost_vec);
-	  return true;
-	}
-
+      unsigned inside_cost = 0, prologue_cost = 0;
       gimple_stmt_iterator incr_gsi;
       bool insert_after;
       gimple *incr;
@@ -8503,7 +8490,7 @@ vectorizable_store (vec_info *vinfo,
       tree ivstep;
       tree running_off;
       tree stride_base, stride_step, alias_off;
-      tree vec_oprnd;
+      tree vec_oprnd = NULL_TREE;
       tree dr_offset;
       unsigned int g;
       /* Checked by get_load_store_type.  */
@@ -8609,26 +8596,30 @@ vectorizable_store (vec_info *vinfo,
 		  lnel = const_nunits;
 		  ltype = vectype;
 		  lvectype = vectype;
+		  alignment_support_scheme = dr_align;
+		  misalignment = mis_align;
 		}
 	    }
 	  ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type));
 	  ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
 	}
 
-      ivstep = stride_step;
-      ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (ivstep), ivstep,
-			    build_int_cst (TREE_TYPE (ivstep), vf));
+      if (!costing_p)
+	{
+	  ivstep = stride_step;
+	  ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (ivstep), ivstep,
+				build_int_cst (TREE_TYPE (ivstep), vf));
 
-      standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+	  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
 
-      stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
-      ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
-      create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
-		 loop, &incr_gsi, insert_after,
-		 &offvar, NULL);
-      incr = gsi_stmt (incr_gsi);
+	  stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
+	  ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
+	  create_iv (stride_base, PLUS_EXPR, ivstep, NULL, loop, &incr_gsi,
+		     insert_after, &offvar, NULL);
+	  incr = gsi_stmt (incr_gsi);
 
-      stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
+	  stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
+	}
 
       alias_off = build_int_cst (ref_type, 0);
       stmt_vec_info next_stmt_info = first_stmt_info;
@@ -8636,39 +8627,76 @@ vectorizable_store (vec_info *vinfo,
       for (g = 0; g < group_size; g++)
 	{
 	  running_off = offvar;
-	  if (g)
+	  if (!costing_p)
 	    {
-	      tree size = TYPE_SIZE_UNIT (ltype);
-	      tree pos = fold_build2 (MULT_EXPR, sizetype, size_int (g),
-				      size);
-	      tree newoff = copy_ssa_name (running_off, NULL);
-	      incr = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
-					  running_off, pos);
-	      vect_finish_stmt_generation (vinfo, stmt_info, incr, gsi);
-	      running_off = newoff;
+	      if (g)
+		{
+		  tree size = TYPE_SIZE_UNIT (ltype);
+		  tree pos
+		    = fold_build2 (MULT_EXPR, sizetype, size_int (g), size);
+		  tree newoff = copy_ssa_name (running_off, NULL);
+		  incr = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
+					      running_off, pos);
+		  vect_finish_stmt_generation (vinfo, stmt_info, incr, gsi);
+		  running_off = newoff;
+		}
 	    }
 	  if (!slp)
 	    op = vect_get_store_rhs (next_stmt_info);
-	  vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies,
-			     op, &vec_oprnds);
+	  if (!costing_p)
+	    vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies, op,
+			       &vec_oprnds);
+	  else if (!slp)
+	    {
+	      enum vect_def_type cdt;
+	      gcc_assert (vect_is_simple_use (op, vinfo, &cdt));
+	      if (cdt == vect_constant_def || cdt == vect_external_def)
+		prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+						   stmt_info, 0, vect_prologue);
+	    }
 	  unsigned int group_el = 0;
 	  unsigned HOST_WIDE_INT
 	    elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
 	  for (j = 0; j < ncopies; j++)
 	    {
-	      vec_oprnd = vec_oprnds[j];
-	      /* Pun the vector to extract from if necessary.  */
-	      if (lvectype != vectype)
+	      if (!costing_p)
 		{
-		  tree tem = make_ssa_name (lvectype);
-		  gimple *pun
-		    = gimple_build_assign (tem, build1 (VIEW_CONVERT_EXPR,
-							lvectype, vec_oprnd));
-		  vect_finish_stmt_generation (vinfo, stmt_info, pun, gsi);
-		  vec_oprnd = tem;
+		  vec_oprnd = vec_oprnds[j];
+		  /* Pun the vector to extract from if necessary.  */
+		  if (lvectype != vectype)
+		    {
+		      tree tem = make_ssa_name (lvectype);
+		      tree cvt
+			= build1 (VIEW_CONVERT_EXPR, lvectype, vec_oprnd);
+		      gimple *pun = gimple_build_assign (tem, cvt);
+		      vect_finish_stmt_generation (vinfo, stmt_info, pun, gsi);
+		      vec_oprnd = tem;
+		    }
 		}
 	      for (i = 0; i < nstores; i++)
 		{
+		  if (costing_p)
+		    {
+		      /* Only need vector extracting when there are more
+			 than one stores.  */
+		      if (nstores > 1)
+			inside_cost
+			  += record_stmt_cost (cost_vec, 1, vec_to_scalar,
+					       stmt_info, 0, vect_body);
+		      /* Take a single lane vector type store as scalar
+			 store to avoid ICE like 110776.  */
+		      if (VECTOR_TYPE_P (ltype)
+			  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
+			vect_get_store_cost (vinfo, stmt_info, 1,
+					     alignment_support_scheme,
+					     misalignment, &inside_cost,
+					     cost_vec);
+		      else
+			inside_cost
+			  += record_stmt_cost (cost_vec, 1, scalar_store,
+					       stmt_info, 0, vect_body);
+		      continue;
+		    }
 		  tree newref, newoff;
 		  gimple *incr, *assign;
 		  tree size = TYPE_SIZE (ltype);
@@ -8719,6 +8747,12 @@ vectorizable_store (vec_info *vinfo,
 	    break;
 	}
 
+      if (costing_p && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_store_cost: inside_cost = %d, "
+			 "prologue_cost = %d .\n",
+			 inside_cost, prologue_cost);
+
       return true;
     }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (4 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:27   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch adjusts the cost handling on VMAT_LOAD_STORE_LANES
in function vectorizable_store.  We don't call function
vect_model_store_cost for it any more.  It's the case of
interleaving stores, so it skips all stmts excepting for
first_stmt_info, consider the whole group when costing
first_stmt_info.  This patch shouldn't have any functional
changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_store_cost): Assert it will never
	get VMAT_LOAD_STORE_LANES.
	(vectorizable_store): Adjust the cost handling on VMAT_LOAD_STORE_LANES
	without calling vect_model_store_cost.  Factor out new lambda function
	update_prologue_cost.
---
 gcc/tree-vect-stmts.cc | 110 ++++++++++++++++++++++++++++-------------
 1 file changed, 75 insertions(+), 35 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 3d01168080a..fbd16b8a487 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -966,7 +966,8 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 {
   gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
 	      && memory_access_type != VMAT_ELEMENTWISE
-	      && memory_access_type != VMAT_STRIDED_SLP);
+	      && memory_access_type != VMAT_STRIDED_SLP
+	      && memory_access_type != VMAT_LOAD_STORE_LANES);
   unsigned int inside_cost = 0, prologue_cost = 0;
   stmt_vec_info first_stmt_info = stmt_info;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -8408,7 +8409,8 @@ vectorizable_store (vec_info *vinfo,
       if (grouped_store
 	  && !slp
 	  && first_stmt_info != stmt_info
-	  && memory_access_type == VMAT_ELEMENTWISE)
+	  && (memory_access_type == VMAT_ELEMENTWISE
+	      || memory_access_type == VMAT_LOAD_STORE_LANES))
 	return true;
     }
   gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
@@ -8479,6 +8481,31 @@ vectorizable_store (vec_info *vinfo,
     dump_printf_loc (MSG_NOTE, vect_location, "transform store. ncopies = %d\n",
 		     ncopies);
 
+  /* Check if we need to update prologue cost for invariant,
+     and update it accordingly if so.  If it's not for
+     interleaving store, we can just check vls_type; but if
+     it's for interleaving store, need to check the def_type
+     of the stored value since the current vls_type is just
+     for first_stmt_info.  */
+  auto update_prologue_cost = [&](unsigned *prologue_cost, tree store_rhs)
+  {
+    gcc_assert (costing_p);
+    if (slp)
+      return;
+    if (grouped_store)
+      {
+	gcc_assert (store_rhs);
+	enum vect_def_type cdt;
+	gcc_assert (vect_is_simple_use (store_rhs, vinfo, &cdt));
+	if (cdt != vect_constant_def && cdt != vect_external_def)
+	  return;
+      }
+    else if (vls_type != VLS_STORE_INVARIANT)
+      return;
+    *prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec, stmt_info,
+					0, vect_prologue);
+  };
+
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
     {
@@ -8646,14 +8673,8 @@ vectorizable_store (vec_info *vinfo,
 	  if (!costing_p)
 	    vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies, op,
 			       &vec_oprnds);
-	  else if (!slp)
-	    {
-	      enum vect_def_type cdt;
-	      gcc_assert (vect_is_simple_use (op, vinfo, &cdt));
-	      if (cdt == vect_constant_def || cdt == vect_external_def)
-		prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
-						   stmt_info, 0, vect_prologue);
-	    }
+	  else
+	    update_prologue_cost (&prologue_cost, op);
 	  unsigned int group_el = 0;
 	  unsigned HOST_WIDE_INT
 	    elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
@@ -8857,13 +8878,7 @@ vectorizable_store (vec_info *vinfo,
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
       gcc_assert (!slp && grouped_store);
-      if (costing_p)
-	{
-	  vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-				 alignment_support_scheme, misalignment,
-				 vls_type, slp_node, cost_vec);
-	  return true;
-	}
+      unsigned inside_cost = 0, prologue_cost = 0;
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
@@ -8879,29 +8894,39 @@ vectorizable_store (vec_info *vinfo,
 		     DR_GROUP_SIZE is the exact number of stmts in the
 		     chain. Therefore, NEXT_STMT_INFO can't be NULL_TREE.  */
 		  op = vect_get_store_rhs (next_stmt_info);
-		  vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
-						 op, gvec_oprnds[i]);
-		  vec_oprnd = (*gvec_oprnds[i])[0];
-		  dr_chain.quick_push (vec_oprnd);
+		  if (costing_p)
+		    update_prologue_cost (&prologue_cost, op);
+		  else
+		    {
+		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
+						     ncopies, op,
+						     gvec_oprnds[i]);
+		      vec_oprnd = (*gvec_oprnds[i])[0];
+		      dr_chain.quick_push (vec_oprnd);
+		    }
 		  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
 		}
-	      if (mask)
+
+	      if (!costing_p)
 		{
-		  vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-						 mask, &vec_masks,
-						 mask_vectype);
-		  vec_mask = vec_masks[0];
-		}
+		  if (mask)
+		    {
+		      vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
+						     mask, &vec_masks,
+						     mask_vectype);
+		      vec_mask = vec_masks[0];
+		    }
 
-	      /* We should have catched mismatched types earlier.  */
-	      gcc_assert (
-		useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
-	      dataref_ptr
-		= vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
-					    NULL, offset, &dummy, gsi,
-					    &ptr_incr, false, bump);
+		  /* We should have catched mismatched types earlier.  */
+		  gcc_assert (
+		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
+		  dataref_ptr
+		    = vect_create_data_ref_ptr (vinfo, first_stmt_info,
+						aggr_type, NULL, offset, &dummy,
+						gsi, &ptr_incr, false, bump);
+		}
 	    }
-	  else
+	  else if (!costing_p)
 	    {
 	      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	      /* DR_CHAIN is then used as an input to
@@ -8917,6 +8942,15 @@ vectorizable_store (vec_info *vinfo,
 					     stmt_info, bump);
 	    }
 
+	  if (costing_p)
+	    {
+	      for (i = 0; i < vec_num; i++)
+		vect_get_store_cost (vinfo, stmt_info, 1,
+				     alignment_support_scheme, misalignment,
+				     &inside_cost, cost_vec);
+	      continue;
+	    }
+
 	  /* Get an array into which we can store the individual vectors.  */
 	  tree vec_array = create_vector_array (vectype, vec_num);
 
@@ -9003,6 +9037,12 @@ vectorizable_store (vec_info *vinfo,
 	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}
 
+      if (costing_p && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_store_cost: inside_cost = %d, "
+			 "prologue_cost = %d .\n",
+			 inside_cost, prologue_cost);
+
       return true;
     }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (5 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:28   ` Richard Biener
  2023-09-14  3:11 ` [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost Kewen Lin
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch adjusts the cost handling on VMAT_CONTIGUOUS_PERMUTE
in function vectorizable_store.  We don't call function
vect_model_store_cost for it any more.  It's the case of
interleaving stores, so it skips all stmts excepting for
first_stmt_info, consider the whole group when costing
first_stmt_info.  This patch shouldn't have any functional
changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_store_cost): Assert it will never
	get VMAT_CONTIGUOUS_PERMUTE and remove VMAT_CONTIGUOUS_PERMUTE related
	handlings.
	(vectorizable_store): Adjust the cost handling on
	VMAT_CONTIGUOUS_PERMUTE without calling vect_model_store_cost.
---
 gcc/tree-vect-stmts.cc | 128 ++++++++++++++++++++++++-----------------
 1 file changed, 74 insertions(+), 54 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fbd16b8a487..e3ba8077091 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -967,10 +967,10 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
   gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
 	      && memory_access_type != VMAT_ELEMENTWISE
 	      && memory_access_type != VMAT_STRIDED_SLP
-	      && memory_access_type != VMAT_LOAD_STORE_LANES);
+	      && memory_access_type != VMAT_LOAD_STORE_LANES
+	      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE);
+
   unsigned int inside_cost = 0, prologue_cost = 0;
-  stmt_vec_info first_stmt_info = stmt_info;
-  bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
 
   /* ???  Somehow we need to fix this at the callers.  */
   if (slp_node)
@@ -983,35 +983,6 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
 					   stmt_info, 0, vect_prologue);
     }
 
-  /* Grouped stores update all elements in the group at once,
-     so we want the DR for the first statement.  */
-  if (!slp_node && grouped_access_p)
-    first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-
-  /* True if we should include any once-per-group costs as well as
-     the cost of the statement itself.  For SLP we only get called
-     once per group anyhow.  */
-  bool first_stmt_p = (first_stmt_info == stmt_info);
-
-  /* We assume that the cost of a single store-lanes instruction is
-     equivalent to the cost of DR_GROUP_SIZE separate stores.  If a grouped
-     access is instead being provided by a permute-and-store operation,
-     include the cost of the permutes.  */
-  if (first_stmt_p
-      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
-    {
-      /* Uses a high and low interleave or shuffle operations for each
-	 needed permute.  */
-      int group_size = DR_GROUP_SIZE (first_stmt_info);
-      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
-      inside_cost = record_stmt_cost (cost_vec, nstmts, vec_perm,
-				      stmt_info, 0, vect_body);
-
-      if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
-                         "vect_model_store_cost: strided group_size = %d .\n",
-                         group_size);
-    }
 
   /* Costs of the stores.  */
   vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
@@ -8408,9 +8379,7 @@ vectorizable_store (vec_info *vinfo,
 	 costing, use the first one instead.  */
       if (grouped_store
 	  && !slp
-	  && first_stmt_info != stmt_info
-	  && (memory_access_type == VMAT_ELEMENTWISE
-	      || memory_access_type == VMAT_LOAD_STORE_LANES))
+	  && first_stmt_info != stmt_info)
 	return true;
     }
   gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
@@ -9254,14 +9223,15 @@ vectorizable_store (vec_info *vinfo,
       return true;
     }
 
+  unsigned inside_cost = 0, prologue_cost = 0;
   auto_vec<tree> result_chain (group_size);
   auto_vec<tree, 1> vec_oprnds;
   for (j = 0; j < ncopies; j++)
     {
       gimple *new_stmt;
-      if (j == 0 && !costing_p)
+      if (j == 0)
 	{
-	  if (slp)
+	  if (slp && !costing_p)
 	    {
 	      /* Get vectorized arguments for SLP_NODE.  */
 	      vect_get_vec_defs (vinfo, stmt_info, slp_node, 1, op,
@@ -9287,13 +9257,20 @@ vectorizable_store (vec_info *vinfo,
 		     that there is no interleaving, DR_GROUP_SIZE is 1,
 		     and only one iteration of the loop will be executed.  */
 		  op = vect_get_store_rhs (next_stmt_info);
-		  vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
-						 op, gvec_oprnds[i]);
-		  vec_oprnd = (*gvec_oprnds[i])[0];
-		  dr_chain.quick_push (vec_oprnd);
+		  if (costing_p
+		      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
+		    update_prologue_cost (&prologue_cost, op);
+		  else if (!costing_p)
+		    {
+		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
+						     ncopies, op,
+						     gvec_oprnds[i]);
+		      vec_oprnd = (*gvec_oprnds[i])[0];
+		      dr_chain.quick_push (vec_oprnd);
+		    }
 		  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
 		}
-	      if (mask)
+	      if (mask && !costing_p)
 		{
 		  vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
 						 mask, &vec_masks,
@@ -9303,11 +9280,13 @@ vectorizable_store (vec_info *vinfo,
 	    }
 
 	  /* We should have catched mismatched types earlier.  */
-	  gcc_assert (useless_type_conversion_p (vectype,
-						 TREE_TYPE (vec_oprnd)));
+	  gcc_assert (costing_p
+		      || useless_type_conversion_p (vectype,
+						    TREE_TYPE (vec_oprnd)));
 	  bool simd_lane_access_p
 	    = STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) != 0;
-	  if (simd_lane_access_p
+	  if (!costing_p
+	      && simd_lane_access_p
 	      && !loop_masks
 	      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
 	      && VAR_P (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
@@ -9319,7 +9298,7 @@ vectorizable_store (vec_info *vinfo,
 	      dataref_ptr = unshare_expr (DR_BASE_ADDRESS (first_dr_info->dr));
 	      dataref_offset = build_int_cst (ref_type, 0);
 	    }
-	  else
+	  else if (!costing_p)
 	    dataref_ptr
 	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
 					  simd_lane_access_p ? loop : NULL,
@@ -9347,16 +9326,46 @@ vectorizable_store (vec_info *vinfo,
 	}
 
       new_stmt = NULL;
-      if (!costing_p && grouped_store)
-	/* Permute.  */
-	vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info, gsi,
-				  &result_chain);
+      if (grouped_store)
+	{
+	  /* Permute.  */
+	  gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
+	  if (costing_p)
+	    {
+	      int group_size = DR_GROUP_SIZE (first_stmt_info);
+	      int nstmts = ceil_log2 (group_size) * group_size;
+	      inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
+					       stmt_info, 0, vect_body);
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "vect_model_store_cost: "
+				 "strided group_size = %d .\n",
+				 group_size);
+	    }
+	  else
+	    vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info,
+				      gsi, &result_chain);
+	}
 
       stmt_vec_info next_stmt_info = first_stmt_info;
       for (i = 0; i < vec_num; i++)
 	{
 	  if (costing_p)
-	    continue;
+	    {
+	      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
+		vect_get_store_cost (vinfo, stmt_info, 1,
+				     alignment_support_scheme, misalignment,
+				     &inside_cost, cost_vec);
+
+	      if (!slp)
+		{
+		  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
+		  if (!next_stmt_info)
+		    break;
+		}
+
+	      continue;
+	    }
 	  unsigned misalign;
 	  unsigned HOST_WIDE_INT align;
 
@@ -9540,9 +9549,20 @@ vectorizable_store (vec_info *vinfo,
     }
 
   if (costing_p)
-    vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-			   alignment_support_scheme, misalignment, vls_type,
-			   slp_node, cost_vec);
+    {
+      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_store_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+	}
+      else
+	vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
+			       alignment_support_scheme, misalignment, vls_type,
+			       slp_node, cost_vec);
+    }
 
   return true;
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (6 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-18  8:41   ` Richard Sandiford
  2023-09-14  3:11 ` [PATCH 09/10] vect: Get rid of vect_model_store_cost Kewen Lin
  2023-09-14  3:11 ` [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE Kewen Lin
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This costing adjustment patch series exposes one issue in
aarch64 specific costing adjustment for STP sequence.  It
causes the below test cases to fail:

  - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
  - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c

Take the below function extracted from ldp_stp_15.c as
example:

void
dup_8_int32_t (int32_t *x, int32_t val)
{
    for (int i = 0; i < 8; ++i)
          x[i] = val;
}

Without my patch series, during slp1 it gets:

  val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
  node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue

then the final vector cost is 3.

With my patch series, during slp1 it gets:

  val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
  val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
  node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue

but the final vector cost is 17.  The unaligned_store count is
actually unchanged, but the final vector costs become different,
it's because the below aarch64 special handling makes the
different costs:

  /* Apply the heuristic described above m_stp_sequence_cost.  */
  if (m_stp_sequence_cost != ~0U)
    {
      uint64_t cost = aarch64_stp_sequence_cost (count, kind,
						 stmt_info, vectype);
      m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
    }

For the former, since the count is 2, function
aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
While for the latter, it's separated into twice calls with
count 1, aarch64_stp_sequence_cost returns 2 for each time,
so it returns 4 in total.

For this case, the stmt with scalar_to_vec also contributes
4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
are 6 (2+4) vs. 8 (4+4).

Considering scalar_costs->m_stp_sequence_cost is 8 and below
checking and re-assigning:

  else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
    m_costs[vect_body] = 2 * scalar_costs->total_cost ();

For the former, the body cost of vector isn't changed; but
for the latter, the body cost of vector is double of scalar
cost which is 8 for this case, then it becomes 16 which is
bigger than what we expect.

I'm not sure why it adopts CEIL for the return value for
case unaligned_store in function aarch64_stp_sequence_cost,
but I tried to modify it with "return count;" (as it can
get back to previous cost), there is no failures exposed
in regression testing.  I expected that if the previous
unaligned_store count is even, this adjustment doesn't
change anything, if it's odd, the adjustment may reduce
it by one, but I'd guess it would be few.  Besides, as
the comments for m_stp_sequence_cost, the current
handlings seems temporary, maybe a tweak like this can be
accepted, so I posted this RFC/PATCH to request comments.
this one line change is considered.

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
	count directly instead of the adjusted value computed with CEIL.
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37d414021ca..9fb4fbd883d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
 	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
 	    return count * 2;
 	}
-      return CEIL (count, 2) * 2;
+      return count;
 
     case scalar_store:
       if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 09/10] vect: Get rid of vect_model_store_cost
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (7 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:29   ` Richard Biener
  2023-09-14  3:11 ` [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE Kewen Lin
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

This patch is to eventually get rid of vect_model_store_cost,
it adjusts the costing for the remaining memory access types
VMAT_CONTIGUOUS{, _DOWN, _REVERSE} by moving costing close
to the transform code.  Note that in vect_model_store_cost,
there is one special handling for vectorizing a store into
the function result, since it's extra penalty and the
transform part doesn't have it, this patch keep it alone.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_store_cost): Remove.
	(vectorizable_store): Adjust the costing for the remaining memory
	access types VMAT_CONTIGUOUS{, _DOWN, _REVERSE}.
---
 gcc/tree-vect-stmts.cc | 137 +++++++++++++----------------------------
 1 file changed, 44 insertions(+), 93 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index e3ba8077091..3d451c80bca 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -951,81 +951,6 @@ cfun_returns (tree decl)
   return false;
 }
 
-/* Function vect_model_store_cost
-
-   Models cost for stores.  In the case of grouped accesses, one access
-   has the overhead of the grouped access attributed to it.  */
-
-static void
-vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
-		       vect_memory_access_type memory_access_type,
-		       dr_alignment_support alignment_support_scheme,
-		       int misalignment,
-		       vec_load_store_type vls_type, slp_tree slp_node,
-		       stmt_vector_for_cost *cost_vec)
-{
-  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
-	      && memory_access_type != VMAT_ELEMENTWISE
-	      && memory_access_type != VMAT_STRIDED_SLP
-	      && memory_access_type != VMAT_LOAD_STORE_LANES
-	      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE);
-
-  unsigned int inside_cost = 0, prologue_cost = 0;
-
-  /* ???  Somehow we need to fix this at the callers.  */
-  if (slp_node)
-    ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-
-  if (vls_type == VLS_STORE_INVARIANT)
-    {
-      if (!slp_node)
-	prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
-					   stmt_info, 0, vect_prologue);
-    }
-
-
-  /* Costs of the stores.  */
-  vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
-		       misalignment, &inside_cost, cost_vec);
-
-  /* When vectorizing a store into the function result assign
-     a penalty if the function returns in a multi-register location.
-     In this case we assume we'll end up with having to spill the
-     vector result and do piecewise loads as a conservative estimate.  */
-  tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref);
-  if (base
-      && (TREE_CODE (base) == RESULT_DECL
-	  || (DECL_P (base) && cfun_returns (base)))
-      && !aggregate_value_p (base, cfun->decl))
-    {
-      rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
-      /* ???  Handle PARALLEL in some way.  */
-      if (REG_P (reg))
-	{
-	  int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
-	  /* Assume that a single reg-reg move is possible and cheap,
-	     do not account for vector to gp register move cost.  */
-	  if (nregs > 1)
-	    {
-	      /* Spill.  */
-	      prologue_cost += record_stmt_cost (cost_vec, ncopies,
-						 vector_store,
-						 stmt_info, 0, vect_epilogue);
-	      /* Loads.  */
-	      prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs,
-						 scalar_load,
-						 stmt_info, 0, vect_epilogue);
-	    }
-	}
-    }
-
-  if (dump_enabled_p ())
-    dump_printf_loc (MSG_NOTE, vect_location,
-                     "vect_model_store_cost: inside_cost = %d, "
-                     "prologue_cost = %d .\n", inside_cost, prologue_cost);
-}
-
-
 /* Calculate cost of DR's memory access.  */
 void
 vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
@@ -9223,6 +9148,11 @@ vectorizable_store (vec_info *vinfo,
       return true;
     }
 
+  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
+	      || memory_access_type == VMAT_CONTIGUOUS_DOWN
+	      || memory_access_type == VMAT_CONTIGUOUS_PERMUTE
+	      || memory_access_type == VMAT_CONTIGUOUS_REVERSE);
+
   unsigned inside_cost = 0, prologue_cost = 0;
   auto_vec<tree> result_chain (group_size);
   auto_vec<tree, 1> vec_oprnds;
@@ -9257,10 +9187,9 @@ vectorizable_store (vec_info *vinfo,
 		     that there is no interleaving, DR_GROUP_SIZE is 1,
 		     and only one iteration of the loop will be executed.  */
 		  op = vect_get_store_rhs (next_stmt_info);
-		  if (costing_p
-		      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
+		  if (costing_p)
 		    update_prologue_cost (&prologue_cost, op);
-		  else if (!costing_p)
+		  else
 		    {
 		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
 						     ncopies, op,
@@ -9352,10 +9281,9 @@ vectorizable_store (vec_info *vinfo,
 	{
 	  if (costing_p)
 	    {
-	      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
-		vect_get_store_cost (vinfo, stmt_info, 1,
-				     alignment_support_scheme, misalignment,
-				     &inside_cost, cost_vec);
+	      vect_get_store_cost (vinfo, stmt_info, 1,
+				   alignment_support_scheme, misalignment,
+				   &inside_cost, cost_vec);
 
 	      if (!slp)
 		{
@@ -9550,18 +9478,41 @@ vectorizable_store (vec_info *vinfo,
 
   if (costing_p)
     {
-      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_NOTE, vect_location,
-			     "vect_model_store_cost: inside_cost = %d, "
-			     "prologue_cost = %d .\n",
-			     inside_cost, prologue_cost);
+      /* When vectorizing a store into the function result assign
+	 a penalty if the function returns in a multi-register location.
+	 In this case we assume we'll end up with having to spill the
+	 vector result and do piecewise loads as a conservative estimate.  */
+      tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref);
+      if (base
+	  && (TREE_CODE (base) == RESULT_DECL
+	      || (DECL_P (base) && cfun_returns (base)))
+	  && !aggregate_value_p (base, cfun->decl))
+	{
+	  rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
+	  /* ???  Handle PARALLEL in some way.  */
+	  if (REG_P (reg))
+	    {
+	      int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
+	      /* Assume that a single reg-reg move is possible and cheap,
+		 do not account for vector to gp register move cost.  */
+	      if (nregs > 1)
+		{
+		  /* Spill.  */
+		  prologue_cost
+		    += record_stmt_cost (cost_vec, ncopies, vector_store,
+					 stmt_info, 0, vect_epilogue);
+		  /* Loads.  */
+		  prologue_cost
+		    += record_stmt_cost (cost_vec, ncopies * nregs, scalar_load,
+					 stmt_info, 0, vect_epilogue);
+		}
+	    }
 	}
-      else
-	vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
-			       alignment_support_scheme, misalignment, vls_type,
-			       slp_node, cost_vec);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_store_cost: inside_cost = %d, "
+			 "prologue_cost = %d .\n",
+			 inside_cost, prologue_cost);
     }
 
   return true;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE
  2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
                   ` (8 preceding siblings ...)
  2023-09-14  3:11 ` [PATCH 09/10] vect: Get rid of vect_model_store_cost Kewen Lin
@ 2023-09-14  3:11 ` Kewen Lin
  2023-09-27 11:30   ` Richard Biener
  9 siblings, 1 reply; 23+ messages in thread
From: Kewen Lin @ 2023-09-14  3:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford

For VMAT_CONTIGUOUS_REVERSE, the transform code in function
vectorizable_store generates a VEC_PERM_EXPR stmt before
storing, but it's never considered in costing.

This patch is to make it consider vec_perm in costing, it
adjusts the order of transform code a bit to make it easy
to early return for costing_p.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_store): Consider generated
	VEC_PERM_EXPR stmt for VMAT_CONTIGUOUS_REVERSE in costing as
	vec_perm.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c: New test.
---
 .../costmodel/ppc/costmodel-vect-store-2.c    | 29 +++++++++
 gcc/tree-vect-stmts.cc                        | 63 +++++++++++--------
 2 files changed, 65 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c
new file mode 100644
index 00000000000..72b67cf9040
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-additional-options "-mvsx" } */
+
+/* Verify we do cost the required vec_perm.  */
+
+int
+foo (int *a, int *b, int len)
+{
+  int i;
+  int *a1 = a;
+  int *a0 = a1 - 4;
+  for (i = 0; i < len; i++)
+    {
+      *b = *a0 + *a1;
+      b--;
+      a0++;
+      a1++;
+    }
+  return 0;
+}
+
+/* The reason why it doesn't check the exact count is that
+   we can get more than 1 vec_perm when it's compiled with
+   partial vector capability like Power10 (retrying for
+   the epilogue) or it's complied without unaligned vector
+   memory access support (realign).  */
+/* { dg-final { scan-tree-dump {\mvec_perm\M} "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 3d451c80bca..ce925cc1d53 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9279,6 +9279,40 @@ vectorizable_store (vec_info *vinfo,
       stmt_vec_info next_stmt_info = first_stmt_info;
       for (i = 0; i < vec_num; i++)
 	{
+	  if (!costing_p)
+	    {
+	      if (slp)
+		vec_oprnd = vec_oprnds[i];
+	      else if (grouped_store)
+		/* For grouped stores vectorized defs are interleaved in
+		   vect_permute_store_chain().  */
+		vec_oprnd = result_chain[i];
+	    }
+
+	  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
+	    {
+	      if (costing_p)
+		inside_cost += record_stmt_cost (cost_vec, 1, vec_perm,
+						 stmt_info, 0, vect_body);
+	      else
+		{
+		  tree perm_mask = perm_mask_for_reverse (vectype);
+		  tree perm_dest = vect_create_destination_var (
+		    vect_get_store_rhs (stmt_info), vectype);
+		  tree new_temp = make_ssa_name (perm_dest);
+
+		  /* Generate the permute statement.  */
+		  gimple *perm_stmt
+		    = gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd,
+					   vec_oprnd, perm_mask);
+		  vect_finish_stmt_generation (vinfo, stmt_info, perm_stmt,
+					       gsi);
+
+		  perm_stmt = SSA_NAME_DEF_STMT (new_temp);
+		  vec_oprnd = new_temp;
+		}
+	    }
+
 	  if (costing_p)
 	    {
 	      vect_get_store_cost (vinfo, stmt_info, 1,
@@ -9294,8 +9328,6 @@ vectorizable_store (vec_info *vinfo,
 
 	      continue;
 	    }
-	  unsigned misalign;
-	  unsigned HOST_WIDE_INT align;
 
 	  tree final_mask = NULL_TREE;
 	  tree final_len = NULL_TREE;
@@ -9315,13 +9347,8 @@ vectorizable_store (vec_info *vinfo,
 	    dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi,
 					   stmt_info, bump);
 
-	  if (slp)
-	    vec_oprnd = vec_oprnds[i];
-	  else if (grouped_store)
-	    /* For grouped stores vectorized defs are interleaved in
-	       vect_permute_store_chain().  */
-	    vec_oprnd = result_chain[i];
-
+	  unsigned misalign;
+	  unsigned HOST_WIDE_INT align;
 	  align = known_alignment (DR_TARGET_ALIGNMENT (first_dr_info));
 	  if (alignment_support_scheme == dr_aligned)
 	    misalign = 0;
@@ -9338,24 +9365,6 @@ vectorizable_store (vec_info *vinfo,
 				    misalign);
 	  align = least_bit_hwi (misalign | align);
 
-	  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
-	    {
-	      tree perm_mask = perm_mask_for_reverse (vectype);
-	      tree perm_dest
-		= vect_create_destination_var (vect_get_store_rhs (stmt_info),
-					       vectype);
-	      tree new_temp = make_ssa_name (perm_dest);
-
-	      /* Generate the permute statement.  */
-	      gimple *perm_stmt
-		= gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd,
-				       vec_oprnd, perm_mask);
-	      vect_finish_stmt_generation (vinfo, stmt_info, perm_stmt, gsi);
-
-	      perm_stmt = SSA_NAME_DEF_STMT (new_temp);
-	      vec_oprnd = new_temp;
-	    }
-
 	  /* Compute IFN when LOOP_LENS or final_mask valid.  */
 	  machine_mode vmode = TYPE_MODE (vectype);
 	  machine_mode new_vmode = vmode;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
  2023-09-14  3:11 ` [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost Kewen Lin
@ 2023-09-18  8:41   ` Richard Sandiford
  2023-09-18  8:53     ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2023-09-18  8:41 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.guenther

Kewen Lin <linkw@linux.ibm.com> writes:
> This costing adjustment patch series exposes one issue in
> aarch64 specific costing adjustment for STP sequence.  It
> causes the below test cases to fail:
>
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>
> Take the below function extracted from ldp_stp_15.c as
> example:
>
> void
> dup_8_int32_t (int32_t *x, int32_t val)
> {
>     for (int i = 0; i < 8; ++i)
>           x[i] = val;
> }
>
> Without my patch series, during slp1 it gets:
>
>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>
> then the final vector cost is 3.
>
> With my patch series, during slp1 it gets:
>
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>
> but the final vector cost is 17.  The unaligned_store count is
> actually unchanged, but the final vector costs become different,
> it's because the below aarch64 special handling makes the
> different costs:
>
>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>   if (m_stp_sequence_cost != ~0U)
>     {
>       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> 						 stmt_info, vectype);
>       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
>     }
>
> For the former, since the count is 2, function
> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> While for the latter, it's separated into twice calls with
> count 1, aarch64_stp_sequence_cost returns 2 for each time,
> so it returns 4 in total.
>
> For this case, the stmt with scalar_to_vec also contributes
> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> are 6 (2+4) vs. 8 (4+4).
>
> Considering scalar_costs->m_stp_sequence_cost is 8 and below
> checking and re-assigning:
>
>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
>     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>
> For the former, the body cost of vector isn't changed; but
> for the latter, the body cost of vector is double of scalar
> cost which is 8 for this case, then it becomes 16 which is
> bigger than what we expect.
>
> I'm not sure why it adopts CEIL for the return value for
> case unaligned_store in function aarch64_stp_sequence_cost,
> but I tried to modify it with "return count;" (as it can
> get back to previous cost), there is no failures exposed
> in regression testing.  I expected that if the previous
> unaligned_store count is even, this adjustment doesn't
> change anything, if it's odd, the adjustment may reduce
> it by one, but I'd guess it would be few.  Besides, as
> the comments for m_stp_sequence_cost, the current
> handlings seems temporary, maybe a tweak like this can be
> accepted, so I posted this RFC/PATCH to request comments.
> this one line change is considered.

It's unfortunate that doing this didn't show up a regression.
I guess it's not a change we explicitly added tests to guard against.

But the point of the condition is to estimate how many single stores
(STRs) and how many paired stores (STPs) would be generated.  As far
as this heuristic goes, STP (storing two values) is as cheap as STR
(storing only one value).  So the point of the CEIL is to count 1 store
as having equal cost to 2, 3 as having equal cost to 4, etc.

For a heuristic like that, costing a vector stmt once with count 2
is different from costing 2 vector stmts with count 1.  The former
makes it obvious that the 2 vector stmts are associated with the
same scalar stmt, and are highly likely to be consecutive.  The latter
(costing 2 stmts with count 1) could also happen for unrelated stmts.

ISTM that costing once with count N provides strictly more information
to targets than costing N time with count 1.  Is there no way we can
keep the current behaviour?  E.g. rather than costing a stmt immediately
within a loop, could we just increment a counter and cost once at the end?

Thanks,
Richard

> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
> 	count directly instead of the adjusted value computed with CEIL.
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37d414021ca..9fb4fbd883d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
>  	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>  	    return count * 2;
>  	}
> -      return CEIL (count, 2) * 2;
> +      return count;
>  
>      case scalar_store:
>        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
  2023-09-18  8:41   ` Richard Sandiford
@ 2023-09-18  8:53     ` Richard Biener
  2023-09-20  2:40       ` Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2023-09-18  8:53 UTC (permalink / raw)
  To: Kewen Lin, gcc-patches, richard.guenther, richard.sandiford

On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Kewen Lin <linkw@linux.ibm.com> writes:
> > This costing adjustment patch series exposes one issue in
> > aarch64 specific costing adjustment for STP sequence.  It
> > causes the below test cases to fail:
> >
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
> >
> > Take the below function extracted from ldp_stp_15.c as
> > example:
> >
> > void
> > dup_8_int32_t (int32_t *x, int32_t val)
> > {
> >     for (int i = 0; i < 8; ++i)
> >           x[i] = val;
> > }
> >
> > Without my patch series, during slp1 it gets:
> >
> >   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
> >   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
> >
> > then the final vector cost is 3.
> >
> > With my patch series, during slp1 it gets:
> >
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
> >
> > but the final vector cost is 17.  The unaligned_store count is
> > actually unchanged, but the final vector costs become different,
> > it's because the below aarch64 special handling makes the
> > different costs:
> >
> >   /* Apply the heuristic described above m_stp_sequence_cost.  */
> >   if (m_stp_sequence_cost != ~0U)
> >     {
> >       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> >                                                stmt_info, vectype);
> >       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> >     }
> >
> > For the former, since the count is 2, function
> > aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> > While for the latter, it's separated into twice calls with
> > count 1, aarch64_stp_sequence_cost returns 2 for each time,
> > so it returns 4 in total.
> >
> > For this case, the stmt with scalar_to_vec also contributes
> > 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> > are 6 (2+4) vs. 8 (4+4).
> >
> > Considering scalar_costs->m_stp_sequence_cost is 8 and below
> > checking and re-assigning:
> >
> >   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> >     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
> >
> > For the former, the body cost of vector isn't changed; but
> > for the latter, the body cost of vector is double of scalar
> > cost which is 8 for this case, then it becomes 16 which is
> > bigger than what we expect.
> >
> > I'm not sure why it adopts CEIL for the return value for
> > case unaligned_store in function aarch64_stp_sequence_cost,
> > but I tried to modify it with "return count;" (as it can
> > get back to previous cost), there is no failures exposed
> > in regression testing.  I expected that if the previous
> > unaligned_store count is even, this adjustment doesn't
> > change anything, if it's odd, the adjustment may reduce
> > it by one, but I'd guess it would be few.  Besides, as
> > the comments for m_stp_sequence_cost, the current
> > handlings seems temporary, maybe a tweak like this can be
> > accepted, so I posted this RFC/PATCH to request comments.
> > this one line change is considered.
>
> It's unfortunate that doing this didn't show up a regression.
> I guess it's not a change we explicitly added tests to guard against.
>
> But the point of the condition is to estimate how many single stores
> (STRs) and how many paired stores (STPs) would be generated.  As far
> as this heuristic goes, STP (storing two values) is as cheap as STR
> (storing only one value).  So the point of the CEIL is to count 1 store
> as having equal cost to 2, 3 as having equal cost to 4, etc.
>
> For a heuristic like that, costing a vector stmt once with count 2
> is different from costing 2 vector stmts with count 1.  The former
> makes it obvious that the 2 vector stmts are associated with the
> same scalar stmt, and are highly likely to be consecutive.  The latter
> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>
> ISTM that costing once with count N provides strictly more information
> to targets than costing N time with count 1.  Is there no way we can
> keep the current behaviour?  E.g. rather than costing a stmt immediately
> within a loop, could we just increment a counter and cost once at the end?

I suppose we can.  But isn't the logic currently (or before the series) cheated
for variable-strided stores with ncopies > 1?  That is, while it sounds like
reasonable heuristics you can't really rely on this as the vectorizer doesn't
currently provide the info whether two vector loads/stores are adjacent?

Making sure we only pass count > 1 for adjacent load/store would be possible
though.  We should document this with comments though.

Richard.

>
> Thanks,
> Richard
>
> > gcc/ChangeLog:
> >
> >       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
> >       count directly instead of the adjusted value computed with CEIL.
> > ---
> >  gcc/config/aarch64/aarch64.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 37d414021ca..9fb4fbd883d 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
> >         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
> >           return count * 2;
> >       }
> > -      return CEIL (count, 2) * 2;
> > +      return count;
> >
> >      case scalar_store:
> >        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost
  2023-09-18  8:53     ` Richard Biener
@ 2023-09-20  2:40       ` Kewen.Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Kewen.Lin @ 2023-09-20  2:40 UTC (permalink / raw)
  To: Richard Biener, richard.sandiford; +Cc: gcc-patches

Hi,

on 2023/9/18 16:53, Richard Biener wrote:
> On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Kewen Lin <linkw@linux.ibm.com> writes:
>>> This costing adjustment patch series exposes one issue in
>>> aarch64 specific costing adjustment for STP sequence.  It
>>> causes the below test cases to fail:
>>>
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>>>   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>>>
>>> Take the below function extracted from ldp_stp_15.c as
>>> example:
>>>
>>> void
>>> dup_8_int32_t (int32_t *x, int32_t val)
>>> {
>>>     for (int i = 0; i < 8; ++i)
>>>           x[i] = val;
>>> }
>>>
>>> Without my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
>>>   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
>>>
>>> then the final vector cost is 3.
>>>
>>> With my patch series, during slp1 it gets:
>>>
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
>>>   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
>>>
>>> but the final vector cost is 17.  The unaligned_store count is
>>> actually unchanged, but the final vector costs become different,
>>> it's because the below aarch64 special handling makes the
>>> different costs:
>>>
>>>   /* Apply the heuristic described above m_stp_sequence_cost.  */
>>>   if (m_stp_sequence_cost != ~0U)
>>>     {
>>>       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
>>>                                                stmt_info, vectype);
>>>       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
>>>     }
>>>
>>> For the former, since the count is 2, function
>>> aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
>>> While for the latter, it's separated into twice calls with
>>> count 1, aarch64_stp_sequence_cost returns 2 for each time,
>>> so it returns 4 in total.
>>>
>>> For this case, the stmt with scalar_to_vec also contributes
>>> 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
>>> are 6 (2+4) vs. 8 (4+4).
>>>
>>> Considering scalar_costs->m_stp_sequence_cost is 8 and below
>>> checking and re-assigning:
>>>
>>>   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
>>>     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
>>>
>>> For the former, the body cost of vector isn't changed; but
>>> for the latter, the body cost of vector is double of scalar
>>> cost which is 8 for this case, then it becomes 16 which is
>>> bigger than what we expect.
>>>
>>> I'm not sure why it adopts CEIL for the return value for
>>> case unaligned_store in function aarch64_stp_sequence_cost,
>>> but I tried to modify it with "return count;" (as it can
>>> get back to previous cost), there is no failures exposed
>>> in regression testing.  I expected that if the previous
>>> unaligned_store count is even, this adjustment doesn't
>>> change anything, if it's odd, the adjustment may reduce
>>> it by one, but I'd guess it would be few.  Besides, as
>>> the comments for m_stp_sequence_cost, the current
>>> handlings seems temporary, maybe a tweak like this can be
>>> accepted, so I posted this RFC/PATCH to request comments.
>>> this one line change is considered.
>>
>> It's unfortunate that doing this didn't show up a regression.
>> I guess it's not a change we explicitly added tests to guard against.
>>
>> But the point of the condition is to estimate how many single stores
>> (STRs) and how many paired stores (STPs) would be generated.  As far
>> as this heuristic goes, STP (storing two values) is as cheap as STR
>> (storing only one value).  So the point of the CEIL is to count 1 store
>> as having equal cost to 2, 3 as having equal cost to 4, etc.
>>

Thanks for the explanation and ...

>> For a heuristic like that, costing a vector stmt once with count 2
>> is different from costing 2 vector stmts with count 1.  The former
>> makes it obvious that the 2 vector stmts are associated with the
>> same scalar stmt, and are highly likely to be consecutive.  The latter
>> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>>
>> ISTM that costing once with count N provides strictly more information
>> to targets than costing N time with count 1.  Is there no way we can
>> keep the current behaviour?  E.g. rather than costing a stmt immediately
>> within a loop, could we just increment a counter and cost once at the end?
> 
> I suppose we can.  But isn't the logic currently (or before the series) cheated
> for variable-strided stores with ncopies > 1?  That is, while it sounds like
> reasonable heuristics you can't really rely on this as the vectorizer doesn't
> currently provide the info whether two vector loads/stores are adjacent?
> 
> Making sure we only pass count > 1 for adjacent load/store would be possible
> though.  We should document this with comments though.

... the suggestion!

This sounds applied for both load and store, if the others in this series look
fine, I guess we can file a bug for the exposed test cases first then fix it
with a follow-up patch for both load and store.

BR,
Kewen

> 
> Richard.
> 
>>
>> Thanks,
>> Richard
>>
>>> gcc/ChangeLog:
>>>
>>>       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
>>>       count directly instead of the adjusted value computed with CEIL.
>>> ---
>>>  gcc/config/aarch64/aarch64.cc | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 37d414021ca..9fb4fbd883d 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
>>>         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
>>>           return count * 2;
>>>       }
>>> -      return CEIL (count, 2) * 2;
>>> +      return count;
>>>
>>>      case scalar_store:
>>>        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case
  2023-09-14  3:11 ` [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case Kewen Lin
@ 2023-09-27 11:22   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:22 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> When making/testing patches to move costing next to the
> transform code for vectorizable_store, some ICEs got
> exposed when I further refined the costing handlings on
> VMAT_ELEMENTWISE.  The apparent cause is triggering the
> assertion in rs6000 specific function for costing
> rs6000_builtin_vectorization_cost:
>
>   if (TARGET_ALTIVEC)
>      /* Misaligned stores are not supported.  */
>      gcc_unreachable ();
>
> I used vect_get_store_cost instead of the original way by
> record_stmt_cost with scalar_store for costing, that is to
> use one unaligned_store instead, it matches what we use in
> transforming, it's a vector store as below:
>
>   else if (group_size >= const_nunits
>            && group_size % const_nunits == 0)
>     {
>        nstores = 1;
>        lnel = const_nunits;
>        ltype = vectype;
>        lvectype = vectype;
>     }
>
> So IMHO it's more consistent with vector store instead of
> scalar store, with the given compilation option
> -mno-allow-movmisalign, the misaligned vector store is
> unexpected to be used in vectorizer, but why it's still
> adopted?  In the current implementation of function
> get_group_load_store_type, we always set alignment support
> scheme as dr_unaligned_supported for VMAT_ELEMENTWISE, it
> is true if we always adopt scalar stores, but as the above
> code shows, we could use vector stores for some cases, so
> we should use the correct alignment support scheme for it.
>
> This patch is to ensure the vector store is supported by
> further checking with vect_supportable_dr_alignment.  The
> ICEs got exposed with patches moving costing next to the
> transform but they haven't been landed, the test coverage
> would be there once they get landed.  The affected test
> cases are:
>   - gcc.dg/vect/slp-45.c
>   - gcc.dg/vect/vect-alias-check-{10,11,12}.c
>
> btw, I tried to make some correctness test case, but I
> realized that -mno-allow-movmisalign is mainly for noting
> movmisalign optab and it doesn't guard for the actual hw
> vector memory access insns, so I failed to make it unless
> I also altered some conditions for them as it.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vectorizable_store): Ensure the generated
>         vector store for some case of VMAT_ELEMENTWISE is supported.
> ---
>  gcc/tree-vect-stmts.cc | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index cd7c1090d88..a5caaf0bca2 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8558,10 +8558,18 @@ vectorizable_store (vec_info *vinfo,
>           else if (group_size >= const_nunits
>                    && group_size % const_nunits == 0)
>             {
> -             nstores = 1;
> -             lnel = const_nunits;
> -             ltype = vectype;
> -             lvectype = vectype;
> +             int mis_align = dr_misalignment (first_dr_info, vectype);
> +             dr_alignment_support dr_align
> +               = vect_supportable_dr_alignment (vinfo, dr_info, vectype,
> +                                                mis_align);
> +             if (dr_align == dr_aligned
> +                 || dr_align == dr_unaligned_supported)
> +               {
> +                 nstores = 1;
> +                 lnel = const_nunits;
> +                 ltype = vectype;
> +                 lvectype = vectype;
> +               }
>             }
>           ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type));
>           ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store
  2023-09-14  3:11 ` [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store Kewen Lin
@ 2023-09-27 11:23   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:23 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch is an initial patch to move costing next to the
> transform, it still adopts vect_model_store_cost for costing
> but moves and duplicates it down according to the handlings
> of different vect_memory_access_types or some special
> handling need, hope it can make the subsequent patches easy
> to review.  This patch should not have any functional
> changes.

OK

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vectorizable_store): Move and duplicate the call
>         to vect_model_store_cost down to some different transform paths
>         according to the handlings of different vect_memory_access_types
>         or some special handling need.
> ---
>  gcc/tree-vect-stmts.cc | 79 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 60 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index a5caaf0bca2..36f7c5b9f4b 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8372,7 +8372,8 @@ vectorizable_store (vec_info *vinfo,
>         return false;
>      }
>
> -  if (!vec_stmt) /* transformation not required.  */
> +  bool costing_p = !vec_stmt;
> +  if (costing_p) /* transformation not required.  */
>      {
>        STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
>
> @@ -8401,11 +8402,6 @@ vectorizable_store (vec_info *vinfo,
>                          "Vectorizing an unaligned access.\n");
>
>        STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
> -      vect_model_store_cost (vinfo, stmt_info, ncopies,
> -                            memory_access_type, &gs_info,
> -                            alignment_support_scheme,
> -                            misalignment, vls_type, slp_node, cost_vec);
> -      return true;
>      }
>    gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
>
> @@ -8415,12 +8411,27 @@ vectorizable_store (vec_info *vinfo,
>
>    if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
>      {
> -      vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
> -                                     &gs_info, mask);
> +      if (costing_p)
> +       vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                              &gs_info, alignment_support_scheme, misalignment,
> +                              vls_type, slp_node, cost_vec);
> +      else
> +       vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
> +                                       &gs_info, mask);
>        return true;
>      }
>    else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
> -    return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
> +    {
> +      gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
> +      if (costing_p)
> +       {
> +         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                                &gs_info, alignment_support_scheme,
> +                                misalignment, vls_type, slp_node, cost_vec);
> +         return true;
> +       }
> +      return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
> +    }
>
>    if (grouped_store)
>      {
> @@ -8449,13 +8460,21 @@ vectorizable_store (vec_info *vinfo,
>    else
>      ref_type = reference_alias_ptr_type (DR_REF (first_dr_info->dr));
>
> -  if (dump_enabled_p ())
> -    dump_printf_loc (MSG_NOTE, vect_location,
> -                     "transform store. ncopies = %d\n", ncopies);
> +  if (!costing_p && dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "transform store. ncopies = %d\n",
> +                    ncopies);
>
>    if (memory_access_type == VMAT_ELEMENTWISE
>        || memory_access_type == VMAT_STRIDED_SLP)
>      {
> +      if (costing_p)
> +       {
> +         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                                &gs_info, alignment_support_scheme,
> +                                misalignment, vls_type, slp_node, cost_vec);
> +         return true;
> +       }
> +
>        gimple_stmt_iterator incr_gsi;
>        bool insert_after;
>        gimple *incr;
> @@ -8718,8 +8737,9 @@ vectorizable_store (vec_info *vinfo,
>    else if (memory_access_type == VMAT_GATHER_SCATTER)
>      {
>        aggr_type = elem_type;
> -      vect_get_strided_load_store_ops (stmt_info, loop_vinfo, gsi, &gs_info,
> -                                      &bump, &vec_offset, loop_lens);
> +      if (!costing_p)
> +       vect_get_strided_load_store_ops (stmt_info, loop_vinfo, gsi, &gs_info,
> +                                        &bump, &vec_offset, loop_lens);
>      }
>    else
>      {
> @@ -8731,7 +8751,7 @@ vectorizable_store (vec_info *vinfo,
>                                           memory_access_type, loop_lens);
>      }
>
> -  if (mask)
> +  if (mask && !costing_p)
>      LOOP_VINFO_HAS_MASK_STORE (loop_vinfo) = true;
>
>    /* In case the vectorization factor (VF) is bigger than the number
> @@ -8782,6 +8802,13 @@ vectorizable_store (vec_info *vinfo,
>    if (memory_access_type == VMAT_LOAD_STORE_LANES)
>      {
>        gcc_assert (!slp && grouped_store);
> +      if (costing_p)
> +       {
> +         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                                &gs_info, alignment_support_scheme,
> +                                misalignment, vls_type, slp_node, cost_vec);
> +         return true;
> +       }
>        for (j = 0; j < ncopies; j++)
>         {
>           gimple *new_stmt;
> @@ -8927,6 +8954,13 @@ vectorizable_store (vec_info *vinfo,
>    if (memory_access_type == VMAT_GATHER_SCATTER)
>      {
>        gcc_assert (!slp && !grouped_store);
> +      if (costing_p)
> +       {
> +         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                                &gs_info, alignment_support_scheme,
> +                                misalignment, vls_type, slp_node, cost_vec);
> +         return true;
> +       }
>        auto_vec<tree> vec_offsets;
>        for (j = 0; j < ncopies; j++)
>         {
> @@ -9091,7 +9125,7 @@ vectorizable_store (vec_info *vinfo,
>    for (j = 0; j < ncopies; j++)
>      {
>        gimple *new_stmt;
> -      if (j == 0)
> +      if (j == 0 && !costing_p)
>         {
>           if (slp)
>             {
> @@ -9158,7 +9192,7 @@ vectorizable_store (vec_info *vinfo,
>                                           offset, &dummy, gsi, &ptr_incr,
>                                           simd_lane_access_p, bump);
>         }
> -      else
> +      else if (!costing_p)
>         {
>           gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>           /* DR_CHAIN is then used as an input to vect_permute_store_chain().
> @@ -9179,7 +9213,7 @@ vectorizable_store (vec_info *vinfo,
>         }
>
>        new_stmt = NULL;
> -      if (grouped_store)
> +      if (!costing_p && grouped_store)
>         /* Permute.  */
>         vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info, gsi,
>                                   &result_chain);
> @@ -9187,6 +9221,8 @@ vectorizable_store (vec_info *vinfo,
>        stmt_vec_info next_stmt_info = first_stmt_info;
>        for (i = 0; i < vec_num; i++)
>         {
> +         if (costing_p)
> +           continue;
>           unsigned misalign;
>           unsigned HOST_WIDE_INT align;
>
> @@ -9361,7 +9397,7 @@ vectorizable_store (vec_info *vinfo,
>           if (!next_stmt_info)
>             break;
>         }
> -      if (!slp)
> +      if (!slp && !costing_p)
>         {
>           if (j == 0)
>             *vec_stmt = new_stmt;
> @@ -9369,6 +9405,11 @@ vectorizable_store (vec_info *vinfo,
>         }
>      }
>
> +  if (costing_p)
> +    vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                          &gs_info, alignment_support_scheme, misalignment,
> +                          vls_type, slp_node, cost_vec);
> +
>    return true;
>  }
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER
  2023-09-14  3:11 ` [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER Kewen Lin
@ 2023-09-27 11:24   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:24 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adjusts the cost handling on VMAT_GATHER_SCATTER
> in function vectorizable_store (all three cases), then we
> won't depend on vect_model_load_store for its costing any
> more.  This patch shouldn't have any functional changes.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_store_cost): Assert it won't get
>         VMAT_GATHER_SCATTER any more, remove VMAT_GATHER_SCATTER related
>         handlings and the related parameter gs_info.
>         (vect_build_scatter_store_calls): Add the handlings on costing with
>         one more argument cost_vec.
>         (vectorizable_store): Adjust the cost handling on VMAT_GATHER_SCATTER
>         without calling vect_model_store_cost any more.
> ---
>  gcc/tree-vect-stmts.cc | 188 ++++++++++++++++++++++++++---------------
>  1 file changed, 118 insertions(+), 70 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 36f7c5b9f4b..3f908242fee 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -959,12 +959,12 @@ cfun_returns (tree decl)
>  static void
>  vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>                        vect_memory_access_type memory_access_type,
> -                      gather_scatter_info *gs_info,
>                        dr_alignment_support alignment_support_scheme,
>                        int misalignment,
>                        vec_load_store_type vls_type, slp_tree slp_node,
>                        stmt_vector_for_cost *cost_vec)
>  {
> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER);
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    stmt_vec_info first_stmt_info = stmt_info;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -1012,18 +1012,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>
>    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>    /* Costs of the stores.  */
> -  if (memory_access_type == VMAT_ELEMENTWISE
> -      || memory_access_type == VMAT_GATHER_SCATTER)
> +  if (memory_access_type == VMAT_ELEMENTWISE)
>      {
>        unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> -      if (memory_access_type == VMAT_GATHER_SCATTER
> -         && gs_info->ifn == IFN_LAST && !gs_info->decl)
> -       /* For emulated scatter N offset vector element extracts
> -          (we assume the scalar scaling and ptr + offset add is consumed by
> -          the load).  */
> -       inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits,
> -                                        vec_to_scalar, stmt_info, 0,
> -                                        vect_body);
>        /* N scalar stores plus extracting the elements.  */
>        inside_cost += record_stmt_cost (cost_vec,
>                                        ncopies * assumed_nunits,
> @@ -1034,9 +1025,7 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>                          misalignment, &inside_cost, cost_vec);
>
>    if (memory_access_type == VMAT_ELEMENTWISE
> -      || memory_access_type == VMAT_STRIDED_SLP
> -      || (memory_access_type == VMAT_GATHER_SCATTER
> -         && gs_info->ifn == IFN_LAST && !gs_info->decl))
> +      || memory_access_type == VMAT_STRIDED_SLP)
>      {
>        /* N scalar stores plus extracting the elements.  */
>        unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> @@ -2999,7 +2988,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>  static void
>  vect_build_scatter_store_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>                                 gimple_stmt_iterator *gsi, gimple **vec_stmt,
> -                               gather_scatter_info *gs_info, tree mask)
> +                               gather_scatter_info *gs_info, tree mask,
> +                               stmt_vector_for_cost *cost_vec)
>  {
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
>    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> @@ -3009,6 +2999,30 @@ vect_build_scatter_store_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>    poly_uint64 scatter_off_nunits
>      = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
>
> +  /* FIXME: Keep the previous costing way in vect_model_store_cost by
> +     costing N scalar stores, but it should be tweaked to use target
> +     specific costs on related scatter store calls.  */
> +  if (cost_vec)
> +    {
> +      tree op = vect_get_store_rhs (stmt_info);
> +      enum vect_def_type dt;
> +      gcc_assert (vect_is_simple_use (op, vinfo, &dt));
> +      unsigned int inside_cost, prologue_cost = 0;
> +      if (dt == vect_constant_def || dt == vect_external_def)
> +       prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                          stmt_info, 0, vect_prologue);
> +      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> +      inside_cost = record_stmt_cost (cost_vec, ncopies * assumed_nunits,
> +                                     scalar_store, stmt_info, 0, vect_body);
> +
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_store_cost: inside_cost = %d, "
> +                        "prologue_cost = %d .\n",
> +                        inside_cost, prologue_cost);
> +      return;
> +    }
> +
>    tree perm_mask = NULL_TREE, mask_halfvectype = NULL_TREE;
>    if (known_eq (nunits, scatter_off_nunits))
>      modifier = NONE;
> @@ -8411,13 +8425,8 @@ vectorizable_store (vec_info *vinfo,
>
>    if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
>      {
> -      if (costing_p)
> -       vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                              &gs_info, alignment_support_scheme, misalignment,
> -                              vls_type, slp_node, cost_vec);
> -      else
> -       vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt,
> -                                       &gs_info, mask);
> +      vect_build_scatter_store_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
> +                                     mask, cost_vec);
>        return true;
>      }
>    else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
> @@ -8426,8 +8435,8 @@ vectorizable_store (vec_info *vinfo,
>        if (costing_p)
>         {
>           vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                &gs_info, alignment_support_scheme,
> -                                misalignment, vls_type, slp_node, cost_vec);
> +                                alignment_support_scheme, misalignment,
> +                                vls_type, slp_node, cost_vec);
>           return true;
>         }
>        return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
> @@ -8470,8 +8479,8 @@ vectorizable_store (vec_info *vinfo,
>        if (costing_p)
>         {
>           vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                &gs_info, alignment_support_scheme,
> -                                misalignment, vls_type, slp_node, cost_vec);
> +                                alignment_support_scheme, misalignment,
> +                                vls_type, slp_node, cost_vec);
>           return true;
>         }
>
> @@ -8805,8 +8814,8 @@ vectorizable_store (vec_info *vinfo,
>        if (costing_p)
>         {
>           vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                &gs_info, alignment_support_scheme,
> -                                misalignment, vls_type, slp_node, cost_vec);
> +                                alignment_support_scheme, misalignment,
> +                                vls_type, slp_node, cost_vec);
>           return true;
>         }
>        for (j = 0; j < ncopies; j++)
> @@ -8954,49 +8963,50 @@ vectorizable_store (vec_info *vinfo,
>    if (memory_access_type == VMAT_GATHER_SCATTER)
>      {
>        gcc_assert (!slp && !grouped_store);
> -      if (costing_p)
> -       {
> -         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                &gs_info, alignment_support_scheme,
> -                                misalignment, vls_type, slp_node, cost_vec);
> -         return true;
> -       }
>        auto_vec<tree> vec_offsets;
> +      unsigned int inside_cost = 0, prologue_cost = 0;
>        for (j = 0; j < ncopies; j++)
>         {
>           gimple *new_stmt;
>           if (j == 0)
>             {
> -             /* Since the store is not grouped, DR_GROUP_SIZE is 1, and
> -                DR_CHAIN is of size 1.  */
> -             gcc_assert (group_size == 1);
> -             op = vect_get_store_rhs (first_stmt_info);
> -             vect_get_vec_defs_for_operand (vinfo, first_stmt_info, ncopies,
> -                                            op, gvec_oprnds[0]);
> -             vec_oprnd = (*gvec_oprnds[0])[0];
> -             dr_chain.quick_push (vec_oprnd);
> -             if (mask)
> +             if (costing_p && vls_type == VLS_STORE_INVARIANT)
> +               prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                                  stmt_info, 0, vect_prologue);
> +             else if (!costing_p)
>                 {
> -                 vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
> -                                                mask, &vec_masks,
> -                                                mask_vectype);
> -                 vec_mask = vec_masks[0];
> -               }
> +                 /* Since the store is not grouped, DR_GROUP_SIZE is 1, and
> +                    DR_CHAIN is of size 1.  */
> +                 gcc_assert (group_size == 1);
> +                 op = vect_get_store_rhs (first_stmt_info);
> +                 vect_get_vec_defs_for_operand (vinfo, first_stmt_info,
> +                                                ncopies, op, gvec_oprnds[0]);
> +                 vec_oprnd = (*gvec_oprnds[0])[0];
> +                 dr_chain.quick_push (vec_oprnd);
> +                 if (mask)
> +                   {
> +                     vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
> +                                                    mask, &vec_masks,
> +                                                    mask_vectype);
> +                     vec_mask = vec_masks[0];
> +                   }
>
> -             /* We should have catched mismatched types earlier.  */
> -             gcc_assert (useless_type_conversion_p (vectype,
> -                                                    TREE_TYPE (vec_oprnd)));
> -             if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> -               vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
> -                                            slp_node, &gs_info, &dataref_ptr,
> -                                            &vec_offsets);
> -             else
> -               dataref_ptr
> -                 = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> -                                             NULL, offset, &dummy, gsi,
> -                                             &ptr_incr, false, bump);
> +                 /* We should have catched mismatched types earlier.  */
> +                 gcc_assert (
> +                   useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
> +                 if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> +                   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
> +                                                slp_node, &gs_info,
> +                                                &dataref_ptr, &vec_offsets);
> +                 else
> +                   dataref_ptr
> +                     = vect_create_data_ref_ptr (vinfo, first_stmt_info,
> +                                                 aggr_type, NULL, offset,
> +                                                 &dummy, gsi, &ptr_incr, false,
> +                                                 bump);
> +               }
>             }
> -         else
> +         else if (!costing_p)
>             {
>               gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>               vec_oprnd = (*gvec_oprnds[0])[j];
> @@ -9013,15 +9023,27 @@ vectorizable_store (vec_info *vinfo,
>           tree final_mask = NULL_TREE;
>           tree final_len = NULL_TREE;
>           tree bias = NULL_TREE;
> -         if (loop_masks)
> -           final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
> -                                            ncopies, vectype, j);
> -         if (vec_mask)
> -           final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, final_mask,
> -                                          vec_mask, gsi);
> +         if (!costing_p)
> +           {
> +             if (loop_masks)
> +               final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
> +                                                ncopies, vectype, j);
> +             if (vec_mask)
> +               final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
> +                                              final_mask, vec_mask, gsi);
> +           }
>
>           if (gs_info.ifn != IFN_LAST)
>             {
> +             if (costing_p)
> +               {
> +                 unsigned int cnunits = vect_nunits_for_cost (vectype);
> +                 inside_cost
> +                   += record_stmt_cost (cost_vec, cnunits, scalar_store,
> +                                        stmt_info, 0, vect_body);
> +                 continue;
> +               }
> +
>               if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>                 vec_offset = vec_offsets[j];
>               tree scale = size_int (gs_info.scale);
> @@ -9067,6 +9089,25 @@ vectorizable_store (vec_info *vinfo,
>             {
>               /* Emulated scatter.  */
>               gcc_assert (!final_mask);
> +             if (costing_p)
> +               {
> +                 unsigned int cnunits = vect_nunits_for_cost (vectype);
> +                 /* For emulated scatter N offset vector element extracts
> +                    (we assume the scalar scaling and ptr + offset add is
> +                    consumed by the load).  */
> +                 inside_cost
> +                   += record_stmt_cost (cost_vec, cnunits, vec_to_scalar,
> +                                        stmt_info, 0, vect_body);
> +                 /* N scalar stores plus extracting the elements.  */
> +                 inside_cost
> +                   += record_stmt_cost (cost_vec, cnunits, vec_to_scalar,
> +                                        stmt_info, 0, vect_body);
> +                 inside_cost
> +                   += record_stmt_cost (cost_vec, cnunits, scalar_store,
> +                                        stmt_info, 0, vect_body);
> +                 continue;
> +               }
> +
>               unsigned HOST_WIDE_INT const_nunits = nunits.to_constant ();
>               unsigned HOST_WIDE_INT const_offset_nunits
>                 = TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype).to_constant ();
> @@ -9117,6 +9158,13 @@ vectorizable_store (vec_info *vinfo,
>             *vec_stmt = new_stmt;
>           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>         }
> +
> +      if (costing_p && dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_store_cost: inside_cost = %d, "
> +                        "prologue_cost = %d .\n",
> +                        inside_cost, prologue_cost);
> +
>        return true;
>      }
>
> @@ -9407,8 +9455,8 @@ vectorizable_store (vec_info *vinfo,
>
>    if (costing_p)
>      vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                          &gs_info, alignment_support_scheme, misalignment,
> -                          vls_type, slp_node, cost_vec);
> +                          alignment_support_scheme, misalignment, vls_type,
> +                          slp_node, cost_vec);
>
>    return true;
>  }
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store
  2023-09-14  3:11 ` [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store Kewen Lin
@ 2023-09-27 11:25   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:25 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch is to simplify the costing on the case
> vectorizable_scan_store without calling function
> vect_model_store_cost any more.
>
> I considered if moving the costing into function
> vectorizable_scan_store is a good idea, for doing
> that, we have to pass several variables down which
> are only used for costing, and for now we just
> want to keep the costing as the previous, haven't
> tried to make this costing consistent with what the
> transforming does, so I think we can leave it for now.

OK

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vectorizable_store): Adjust costing on
>         vectorizable_scan_store without calling vect_model_store_cost
>         any more.
> ---
>  gcc/tree-vect-stmts.cc | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 3f908242fee..048c14d291c 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8432,11 +8432,23 @@ vectorizable_store (vec_info *vinfo,
>    else if (STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) >= 3)
>      {
>        gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
> +      gcc_assert (!slp);
>        if (costing_p)
>         {
> -         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                alignment_support_scheme, misalignment,
> -                                vls_type, slp_node, cost_vec);
> +         unsigned int inside_cost = 0, prologue_cost = 0;
> +         if (vls_type == VLS_STORE_INVARIANT)
> +           prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                              stmt_info, 0, vect_prologue);
> +         vect_get_store_cost (vinfo, stmt_info, ncopies,
> +                              alignment_support_scheme, misalignment,
> +                              &inside_cost, cost_vec);
> +
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "vect_model_store_cost: inside_cost = %d, "
> +                            "prologue_cost = %d .\n",
> +                            inside_cost, prologue_cost);
> +
>           return true;
>         }
>        return vectorizable_scan_store (vinfo, stmt_info, gsi, vec_stmt, ncopies);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-09-14  3:11 ` [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
@ 2023-09-27 11:26   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:26 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adjusts the cost handling on VMAT_ELEMENTWISE
> and VMAT_STRIDED_SLP in function vectorizable_store.  We
> don't call function vect_model_store_cost for them any more.
>
> Like what we improved for PR82255 on load side, this change
> helps us to get rid of unnecessary vec_to_scalar costing
> for some case with VMAT_STRIDED_SLP.  One typical test case
> gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c has been
> associated.  And it helps some cases with some inconsistent
> costing too.
>
> Besides, this also special-cases the interleaving stores
> for these two affected memory access types, since for the
> interleaving stores the whole chain is vectorized when the
> last store in the chain is reached, the other stores in the
> group would be skipped.  To keep consistent with this and
> follows the transforming handlings like iterating the whole
> group, it only costs for the first store in the group.
> Ideally we can only cost for the last one but it's not
> trivial and using the first one is actually equivalent.

OK

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_store_cost): Assert it won't get
>         VMAT_ELEMENTWISE and VMAT_STRIDED_SLP any more, and remove their
>         related handlings.
>         (vectorizable_store): Adjust the cost handling on VMAT_ELEMENTWISE
>         and VMAT_STRIDED_SLP without calling vect_model_store_cost.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c: New test.
> ---
>  .../costmodel/ppc/costmodel-vect-store-1.c    |  23 +++
>  gcc/tree-vect-stmts.cc                        | 160 +++++++++++-------
>  2 files changed, 120 insertions(+), 63 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
> new file mode 100644
> index 00000000000..ab5f3301492
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-1.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-O3" }
> +
> +/* This test case is partially extracted from case
> +   gcc.dg/vect/vect-avg-16.c, it's to verify we don't
> +   cost a store with vec_to_scalar when we shouldn't.  */
> +
> +void
> +test (signed char *restrict a, signed char *restrict b, signed char *restrict c,
> +      int n)
> +{
> +  for (int j = 0; j < n; ++j)
> +    {
> +      for (int i = 0; i < 16; ++i)
> +       a[i] = (b[i] + c[i]) >> 1;
> +      a += 20;
> +      b += 20;
> +      c += 20;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vec_to_scalar" 0 "vect" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 048c14d291c..3d01168080a 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -964,7 +964,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>                        vec_load_store_type vls_type, slp_tree slp_node,
>                        stmt_vector_for_cost *cost_vec)
>  {
> -  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER);
> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
> +             && memory_access_type != VMAT_ELEMENTWISE
> +             && memory_access_type != VMAT_STRIDED_SLP);
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    stmt_vec_info first_stmt_info = stmt_info;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -1010,29 +1012,9 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>                           group_size);
>      }
>
> -  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>    /* Costs of the stores.  */
> -  if (memory_access_type == VMAT_ELEMENTWISE)
> -    {
> -      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> -      /* N scalar stores plus extracting the elements.  */
> -      inside_cost += record_stmt_cost (cost_vec,
> -                                      ncopies * assumed_nunits,
> -                                      scalar_store, stmt_info, 0, vect_body);
> -    }
> -  else
> -    vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
> -                        misalignment, &inside_cost, cost_vec);
> -
> -  if (memory_access_type == VMAT_ELEMENTWISE
> -      || memory_access_type == VMAT_STRIDED_SLP)
> -    {
> -      /* N scalar stores plus extracting the elements.  */
> -      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> -      inside_cost += record_stmt_cost (cost_vec,
> -                                      ncopies * assumed_nunits,
> -                                      vec_to_scalar, stmt_info, 0, vect_body);
> -    }
> +  vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
> +                      misalignment, &inside_cost, cost_vec);
>
>    /* When vectorizing a store into the function result assign
>       a penalty if the function returns in a multi-register location.
> @@ -8416,6 +8398,18 @@ vectorizable_store (vec_info *vinfo,
>                          "Vectorizing an unaligned access.\n");
>
>        STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
> +
> +      /* As function vect_transform_stmt shows, for interleaving stores
> +        the whole chain is vectorized when the last store in the chain
> +        is reached, the other stores in the group are skipped.  So we
> +        want to only cost the last one here, but it's not trivial to
> +        get the last, as it's equivalent to use the first one for
> +        costing, use the first one instead.  */
> +      if (grouped_store
> +         && !slp
> +         && first_stmt_info != stmt_info
> +         && memory_access_type == VMAT_ELEMENTWISE)
> +       return true;
>      }
>    gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
>
> @@ -8488,14 +8482,7 @@ vectorizable_store (vec_info *vinfo,
>    if (memory_access_type == VMAT_ELEMENTWISE
>        || memory_access_type == VMAT_STRIDED_SLP)
>      {
> -      if (costing_p)
> -       {
> -         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                alignment_support_scheme, misalignment,
> -                                vls_type, slp_node, cost_vec);
> -         return true;
> -       }
> -
> +      unsigned inside_cost = 0, prologue_cost = 0;
>        gimple_stmt_iterator incr_gsi;
>        bool insert_after;
>        gimple *incr;
> @@ -8503,7 +8490,7 @@ vectorizable_store (vec_info *vinfo,
>        tree ivstep;
>        tree running_off;
>        tree stride_base, stride_step, alias_off;
> -      tree vec_oprnd;
> +      tree vec_oprnd = NULL_TREE;
>        tree dr_offset;
>        unsigned int g;
>        /* Checked by get_load_store_type.  */
> @@ -8609,26 +8596,30 @@ vectorizable_store (vec_info *vinfo,
>                   lnel = const_nunits;
>                   ltype = vectype;
>                   lvectype = vectype;
> +                 alignment_support_scheme = dr_align;
> +                 misalignment = mis_align;
>                 }
>             }
>           ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type));
>           ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>         }
>
> -      ivstep = stride_step;
> -      ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (ivstep), ivstep,
> -                           build_int_cst (TREE_TYPE (ivstep), vf));
> +      if (!costing_p)
> +       {
> +         ivstep = stride_step;
> +         ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (ivstep), ivstep,
> +                               build_int_cst (TREE_TYPE (ivstep), vf));
>
> -      standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> +         standard_iv_increment_position (loop, &incr_gsi, &insert_after);
>
> -      stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
> -      ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
> -      create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
> -                loop, &incr_gsi, insert_after,
> -                &offvar, NULL);
> -      incr = gsi_stmt (incr_gsi);
> +         stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
> +         ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
> +         create_iv (stride_base, PLUS_EXPR, ivstep, NULL, loop, &incr_gsi,
> +                    insert_after, &offvar, NULL);
> +         incr = gsi_stmt (incr_gsi);
>
> -      stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
> +         stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
> +       }
>
>        alias_off = build_int_cst (ref_type, 0);
>        stmt_vec_info next_stmt_info = first_stmt_info;
> @@ -8636,39 +8627,76 @@ vectorizable_store (vec_info *vinfo,
>        for (g = 0; g < group_size; g++)
>         {
>           running_off = offvar;
> -         if (g)
> +         if (!costing_p)
>             {
> -             tree size = TYPE_SIZE_UNIT (ltype);
> -             tree pos = fold_build2 (MULT_EXPR, sizetype, size_int (g),
> -                                     size);
> -             tree newoff = copy_ssa_name (running_off, NULL);
> -             incr = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
> -                                         running_off, pos);
> -             vect_finish_stmt_generation (vinfo, stmt_info, incr, gsi);
> -             running_off = newoff;
> +             if (g)
> +               {
> +                 tree size = TYPE_SIZE_UNIT (ltype);
> +                 tree pos
> +                   = fold_build2 (MULT_EXPR, sizetype, size_int (g), size);
> +                 tree newoff = copy_ssa_name (running_off, NULL);
> +                 incr = gimple_build_assign (newoff, POINTER_PLUS_EXPR,
> +                                             running_off, pos);
> +                 vect_finish_stmt_generation (vinfo, stmt_info, incr, gsi);
> +                 running_off = newoff;
> +               }
>             }
>           if (!slp)
>             op = vect_get_store_rhs (next_stmt_info);
> -         vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies,
> -                            op, &vec_oprnds);
> +         if (!costing_p)
> +           vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies, op,
> +                              &vec_oprnds);
> +         else if (!slp)
> +           {
> +             enum vect_def_type cdt;
> +             gcc_assert (vect_is_simple_use (op, vinfo, &cdt));
> +             if (cdt == vect_constant_def || cdt == vect_external_def)
> +               prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                                  stmt_info, 0, vect_prologue);
> +           }
>           unsigned int group_el = 0;
>           unsigned HOST_WIDE_INT
>             elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
>           for (j = 0; j < ncopies; j++)
>             {
> -             vec_oprnd = vec_oprnds[j];
> -             /* Pun the vector to extract from if necessary.  */
> -             if (lvectype != vectype)
> +             if (!costing_p)
>                 {
> -                 tree tem = make_ssa_name (lvectype);
> -                 gimple *pun
> -                   = gimple_build_assign (tem, build1 (VIEW_CONVERT_EXPR,
> -                                                       lvectype, vec_oprnd));
> -                 vect_finish_stmt_generation (vinfo, stmt_info, pun, gsi);
> -                 vec_oprnd = tem;
> +                 vec_oprnd = vec_oprnds[j];
> +                 /* Pun the vector to extract from if necessary.  */
> +                 if (lvectype != vectype)
> +                   {
> +                     tree tem = make_ssa_name (lvectype);
> +                     tree cvt
> +                       = build1 (VIEW_CONVERT_EXPR, lvectype, vec_oprnd);
> +                     gimple *pun = gimple_build_assign (tem, cvt);
> +                     vect_finish_stmt_generation (vinfo, stmt_info, pun, gsi);
> +                     vec_oprnd = tem;
> +                   }
>                 }
>               for (i = 0; i < nstores; i++)
>                 {
> +                 if (costing_p)
> +                   {
> +                     /* Only need vector extracting when there are more
> +                        than one stores.  */
> +                     if (nstores > 1)
> +                       inside_cost
> +                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> +                                              stmt_info, 0, vect_body);
> +                     /* Take a single lane vector type store as scalar
> +                        store to avoid ICE like 110776.  */
> +                     if (VECTOR_TYPE_P (ltype)
> +                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> +                       vect_get_store_cost (vinfo, stmt_info, 1,
> +                                            alignment_support_scheme,
> +                                            misalignment, &inside_cost,
> +                                            cost_vec);
> +                     else
> +                       inside_cost
> +                         += record_stmt_cost (cost_vec, 1, scalar_store,
> +                                              stmt_info, 0, vect_body);
> +                     continue;
> +                   }
>                   tree newref, newoff;
>                   gimple *incr, *assign;
>                   tree size = TYPE_SIZE (ltype);
> @@ -8719,6 +8747,12 @@ vectorizable_store (vec_info *vinfo,
>             break;
>         }
>
> +      if (costing_p && dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_store_cost: inside_cost = %d, "
> +                        "prologue_cost = %d .\n",
> +                        inside_cost, prologue_cost);
> +
>        return true;
>      }
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES
  2023-09-14  3:11 ` [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES Kewen Lin
@ 2023-09-27 11:27   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:27 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adjusts the cost handling on VMAT_LOAD_STORE_LANES
> in function vectorizable_store.  We don't call function
> vect_model_store_cost for it any more.  It's the case of
> interleaving stores, so it skips all stmts excepting for
> first_stmt_info, consider the whole group when costing
> first_stmt_info.  This patch shouldn't have any functional
> changes.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_store_cost): Assert it will never
>         get VMAT_LOAD_STORE_LANES.
>         (vectorizable_store): Adjust the cost handling on VMAT_LOAD_STORE_LANES
>         without calling vect_model_store_cost.  Factor out new lambda function
>         update_prologue_cost.
> ---
>  gcc/tree-vect-stmts.cc | 110 ++++++++++++++++++++++++++++-------------
>  1 file changed, 75 insertions(+), 35 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 3d01168080a..fbd16b8a487 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -966,7 +966,8 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>  {
>    gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
>               && memory_access_type != VMAT_ELEMENTWISE
> -             && memory_access_type != VMAT_STRIDED_SLP);
> +             && memory_access_type != VMAT_STRIDED_SLP
> +             && memory_access_type != VMAT_LOAD_STORE_LANES);
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    stmt_vec_info first_stmt_info = stmt_info;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -8408,7 +8409,8 @@ vectorizable_store (vec_info *vinfo,
>        if (grouped_store
>           && !slp
>           && first_stmt_info != stmt_info
> -         && memory_access_type == VMAT_ELEMENTWISE)
> +         && (memory_access_type == VMAT_ELEMENTWISE
> +             || memory_access_type == VMAT_LOAD_STORE_LANES))
>         return true;
>      }
>    gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
> @@ -8479,6 +8481,31 @@ vectorizable_store (vec_info *vinfo,
>      dump_printf_loc (MSG_NOTE, vect_location, "transform store. ncopies = %d\n",
>                      ncopies);
>
> +  /* Check if we need to update prologue cost for invariant,
> +     and update it accordingly if so.  If it's not for
> +     interleaving store, we can just check vls_type; but if
> +     it's for interleaving store, need to check the def_type
> +     of the stored value since the current vls_type is just
> +     for first_stmt_info.  */
> +  auto update_prologue_cost = [&](unsigned *prologue_cost, tree store_rhs)
> +  {
> +    gcc_assert (costing_p);
> +    if (slp)
> +      return;
> +    if (grouped_store)
> +      {
> +       gcc_assert (store_rhs);
> +       enum vect_def_type cdt;
> +       gcc_assert (vect_is_simple_use (store_rhs, vinfo, &cdt));
> +       if (cdt != vect_constant_def && cdt != vect_external_def)
> +         return;
> +      }
> +    else if (vls_type != VLS_STORE_INVARIANT)
> +      return;
> +    *prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec, stmt_info,
> +                                       0, vect_prologue);
> +  };
> +
>    if (memory_access_type == VMAT_ELEMENTWISE
>        || memory_access_type == VMAT_STRIDED_SLP)
>      {
> @@ -8646,14 +8673,8 @@ vectorizable_store (vec_info *vinfo,
>           if (!costing_p)
>             vect_get_vec_defs (vinfo, next_stmt_info, slp_node, ncopies, op,
>                                &vec_oprnds);
> -         else if (!slp)
> -           {
> -             enum vect_def_type cdt;
> -             gcc_assert (vect_is_simple_use (op, vinfo, &cdt));
> -             if (cdt == vect_constant_def || cdt == vect_external_def)
> -               prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> -                                                  stmt_info, 0, vect_prologue);
> -           }
> +         else
> +           update_prologue_cost (&prologue_cost, op);
>           unsigned int group_el = 0;
>           unsigned HOST_WIDE_INT
>             elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
> @@ -8857,13 +8878,7 @@ vectorizable_store (vec_info *vinfo,
>    if (memory_access_type == VMAT_LOAD_STORE_LANES)
>      {
>        gcc_assert (!slp && grouped_store);
> -      if (costing_p)
> -       {
> -         vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                                alignment_support_scheme, misalignment,
> -                                vls_type, slp_node, cost_vec);
> -         return true;
> -       }
> +      unsigned inside_cost = 0, prologue_cost = 0;
>        for (j = 0; j < ncopies; j++)
>         {
>           gimple *new_stmt;
> @@ -8879,29 +8894,39 @@ vectorizable_store (vec_info *vinfo,
>                      DR_GROUP_SIZE is the exact number of stmts in the
>                      chain. Therefore, NEXT_STMT_INFO can't be NULL_TREE.  */
>                   op = vect_get_store_rhs (next_stmt_info);
> -                 vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
> -                                                op, gvec_oprnds[i]);
> -                 vec_oprnd = (*gvec_oprnds[i])[0];
> -                 dr_chain.quick_push (vec_oprnd);
> +                 if (costing_p)
> +                   update_prologue_cost (&prologue_cost, op);
> +                 else
> +                   {
> +                     vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
> +                                                    ncopies, op,
> +                                                    gvec_oprnds[i]);
> +                     vec_oprnd = (*gvec_oprnds[i])[0];
> +                     dr_chain.quick_push (vec_oprnd);
> +                   }
>                   next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
>                 }
> -             if (mask)
> +
> +             if (!costing_p)
>                 {
> -                 vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
> -                                                mask, &vec_masks,
> -                                                mask_vectype);
> -                 vec_mask = vec_masks[0];
> -               }
> +                 if (mask)
> +                   {
> +                     vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
> +                                                    mask, &vec_masks,
> +                                                    mask_vectype);
> +                     vec_mask = vec_masks[0];
> +                   }
>
> -             /* We should have catched mismatched types earlier.  */
> -             gcc_assert (
> -               useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
> -             dataref_ptr
> -               = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> -                                           NULL, offset, &dummy, gsi,
> -                                           &ptr_incr, false, bump);
> +                 /* We should have catched mismatched types earlier.  */
> +                 gcc_assert (
> +                   useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
> +                 dataref_ptr
> +                   = vect_create_data_ref_ptr (vinfo, first_stmt_info,
> +                                               aggr_type, NULL, offset, &dummy,
> +                                               gsi, &ptr_incr, false, bump);
> +               }
>             }
> -         else
> +         else if (!costing_p)
>             {
>               gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>               /* DR_CHAIN is then used as an input to
> @@ -8917,6 +8942,15 @@ vectorizable_store (vec_info *vinfo,
>                                              stmt_info, bump);
>             }
>
> +         if (costing_p)
> +           {
> +             for (i = 0; i < vec_num; i++)
> +               vect_get_store_cost (vinfo, stmt_info, 1,
> +                                    alignment_support_scheme, misalignment,
> +                                    &inside_cost, cost_vec);
> +             continue;
> +           }
> +
>           /* Get an array into which we can store the individual vectors.  */
>           tree vec_array = create_vector_array (vectype, vec_num);
>
> @@ -9003,6 +9037,12 @@ vectorizable_store (vec_info *vinfo,
>           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>         }
>
> +      if (costing_p && dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_store_cost: inside_cost = %d, "
> +                        "prologue_cost = %d .\n",
> +                        inside_cost, prologue_cost);
> +
>        return true;
>      }
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE
  2023-09-14  3:11 ` [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
@ 2023-09-27 11:28   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:28 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adjusts the cost handling on VMAT_CONTIGUOUS_PERMUTE
> in function vectorizable_store.  We don't call function
> vect_model_store_cost for it any more.  It's the case of
> interleaving stores, so it skips all stmts excepting for
> first_stmt_info, consider the whole group when costing
> first_stmt_info.  This patch shouldn't have any functional
> changes.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_store_cost): Assert it will never
>         get VMAT_CONTIGUOUS_PERMUTE and remove VMAT_CONTIGUOUS_PERMUTE related
>         handlings.
>         (vectorizable_store): Adjust the cost handling on
>         VMAT_CONTIGUOUS_PERMUTE without calling vect_model_store_cost.
> ---
>  gcc/tree-vect-stmts.cc | 128 ++++++++++++++++++++++++-----------------
>  1 file changed, 74 insertions(+), 54 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index fbd16b8a487..e3ba8077091 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -967,10 +967,10 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>    gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
>               && memory_access_type != VMAT_ELEMENTWISE
>               && memory_access_type != VMAT_STRIDED_SLP
> -             && memory_access_type != VMAT_LOAD_STORE_LANES);
> +             && memory_access_type != VMAT_LOAD_STORE_LANES
> +             && memory_access_type != VMAT_CONTIGUOUS_PERMUTE);
> +
>    unsigned int inside_cost = 0, prologue_cost = 0;
> -  stmt_vec_info first_stmt_info = stmt_info;
> -  bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>
>    /* ???  Somehow we need to fix this at the callers.  */
>    if (slp_node)
> @@ -983,35 +983,6 @@ vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
>                                            stmt_info, 0, vect_prologue);
>      }
>
> -  /* Grouped stores update all elements in the group at once,
> -     so we want the DR for the first statement.  */
> -  if (!slp_node && grouped_access_p)
> -    first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
> -
> -  /* True if we should include any once-per-group costs as well as
> -     the cost of the statement itself.  For SLP we only get called
> -     once per group anyhow.  */
> -  bool first_stmt_p = (first_stmt_info == stmt_info);
> -
> -  /* We assume that the cost of a single store-lanes instruction is
> -     equivalent to the cost of DR_GROUP_SIZE separate stores.  If a grouped
> -     access is instead being provided by a permute-and-store operation,
> -     include the cost of the permutes.  */
> -  if (first_stmt_p
> -      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> -    {
> -      /* Uses a high and low interleave or shuffle operations for each
> -        needed permute.  */
> -      int group_size = DR_GROUP_SIZE (first_stmt_info);
> -      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
> -      inside_cost = record_stmt_cost (cost_vec, nstmts, vec_perm,
> -                                     stmt_info, 0, vect_body);
> -
> -      if (dump_enabled_p ())
> -        dump_printf_loc (MSG_NOTE, vect_location,
> -                         "vect_model_store_cost: strided group_size = %d .\n",
> -                         group_size);
> -    }
>
>    /* Costs of the stores.  */
>    vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
> @@ -8408,9 +8379,7 @@ vectorizable_store (vec_info *vinfo,
>          costing, use the first one instead.  */
>        if (grouped_store
>           && !slp
> -         && first_stmt_info != stmt_info
> -         && (memory_access_type == VMAT_ELEMENTWISE
> -             || memory_access_type == VMAT_LOAD_STORE_LANES))
> +         && first_stmt_info != stmt_info)
>         return true;
>      }
>    gcc_assert (memory_access_type == STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
> @@ -9254,14 +9223,15 @@ vectorizable_store (vec_info *vinfo,
>        return true;
>      }
>
> +  unsigned inside_cost = 0, prologue_cost = 0;
>    auto_vec<tree> result_chain (group_size);
>    auto_vec<tree, 1> vec_oprnds;
>    for (j = 0; j < ncopies; j++)
>      {
>        gimple *new_stmt;
> -      if (j == 0 && !costing_p)
> +      if (j == 0)
>         {
> -         if (slp)
> +         if (slp && !costing_p)
>             {
>               /* Get vectorized arguments for SLP_NODE.  */
>               vect_get_vec_defs (vinfo, stmt_info, slp_node, 1, op,
> @@ -9287,13 +9257,20 @@ vectorizable_store (vec_info *vinfo,
>                      that there is no interleaving, DR_GROUP_SIZE is 1,
>                      and only one iteration of the loop will be executed.  */
>                   op = vect_get_store_rhs (next_stmt_info);
> -                 vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
> -                                                op, gvec_oprnds[i]);
> -                 vec_oprnd = (*gvec_oprnds[i])[0];
> -                 dr_chain.quick_push (vec_oprnd);
> +                 if (costing_p
> +                     && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> +                   update_prologue_cost (&prologue_cost, op);
> +                 else if (!costing_p)
> +                   {
> +                     vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
> +                                                    ncopies, op,
> +                                                    gvec_oprnds[i]);
> +                     vec_oprnd = (*gvec_oprnds[i])[0];
> +                     dr_chain.quick_push (vec_oprnd);
> +                   }
>                   next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
>                 }
> -             if (mask)
> +             if (mask && !costing_p)
>                 {
>                   vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
>                                                  mask, &vec_masks,
> @@ -9303,11 +9280,13 @@ vectorizable_store (vec_info *vinfo,
>             }
>
>           /* We should have catched mismatched types earlier.  */
> -         gcc_assert (useless_type_conversion_p (vectype,
> -                                                TREE_TYPE (vec_oprnd)));
> +         gcc_assert (costing_p
> +                     || useless_type_conversion_p (vectype,
> +                                                   TREE_TYPE (vec_oprnd)));
>           bool simd_lane_access_p
>             = STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) != 0;
> -         if (simd_lane_access_p
> +         if (!costing_p
> +             && simd_lane_access_p
>               && !loop_masks
>               && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
>               && VAR_P (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
> @@ -9319,7 +9298,7 @@ vectorizable_store (vec_info *vinfo,
>               dataref_ptr = unshare_expr (DR_BASE_ADDRESS (first_dr_info->dr));
>               dataref_offset = build_int_cst (ref_type, 0);
>             }
> -         else
> +         else if (!costing_p)
>             dataref_ptr
>               = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
>                                           simd_lane_access_p ? loop : NULL,
> @@ -9347,16 +9326,46 @@ vectorizable_store (vec_info *vinfo,
>         }
>
>        new_stmt = NULL;
> -      if (!costing_p && grouped_store)
> -       /* Permute.  */
> -       vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info, gsi,
> -                                 &result_chain);
> +      if (grouped_store)
> +       {
> +         /* Permute.  */
> +         gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
> +         if (costing_p)
> +           {
> +             int group_size = DR_GROUP_SIZE (first_stmt_info);
> +             int nstmts = ceil_log2 (group_size) * group_size;
> +             inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
> +                                              stmt_info, 0, vect_body);
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "vect_model_store_cost: "
> +                                "strided group_size = %d .\n",
> +                                group_size);
> +           }
> +         else
> +           vect_permute_store_chain (vinfo, dr_chain, group_size, stmt_info,
> +                                     gsi, &result_chain);
> +       }
>
>        stmt_vec_info next_stmt_info = first_stmt_info;
>        for (i = 0; i < vec_num; i++)
>         {
>           if (costing_p)
> -           continue;
> +           {
> +             if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> +               vect_get_store_cost (vinfo, stmt_info, 1,
> +                                    alignment_support_scheme, misalignment,
> +                                    &inside_cost, cost_vec);
> +
> +             if (!slp)
> +               {
> +                 next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
> +                 if (!next_stmt_info)
> +                   break;
> +               }
> +
> +             continue;
> +           }
>           unsigned misalign;
>           unsigned HOST_WIDE_INT align;
>
> @@ -9540,9 +9549,20 @@ vectorizable_store (vec_info *vinfo,
>      }
>
>    if (costing_p)
> -    vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                          alignment_support_scheme, misalignment, vls_type,
> -                          slp_node, cost_vec);
> +    {
> +      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "vect_model_store_cost: inside_cost = %d, "
> +                            "prologue_cost = %d .\n",
> +                            inside_cost, prologue_cost);
> +       }
> +      else
> +       vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> +                              alignment_support_scheme, misalignment, vls_type,
> +                              slp_node, cost_vec);
> +    }
>
>    return true;
>  }
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/10] vect: Get rid of vect_model_store_cost
  2023-09-14  3:11 ` [PATCH 09/10] vect: Get rid of vect_model_store_cost Kewen Lin
@ 2023-09-27 11:29   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:29 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch is to eventually get rid of vect_model_store_cost,
> it adjusts the costing for the remaining memory access types
> VMAT_CONTIGUOUS{, _DOWN, _REVERSE} by moving costing close
> to the transform code.  Note that in vect_model_store_cost,
> there is one special handling for vectorizing a store into
> the function result, since it's extra penalty and the
> transform part doesn't have it, this patch keep it alone.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_store_cost): Remove.
>         (vectorizable_store): Adjust the costing for the remaining memory
>         access types VMAT_CONTIGUOUS{, _DOWN, _REVERSE}.
> ---
>  gcc/tree-vect-stmts.cc | 137 +++++++++++++----------------------------
>  1 file changed, 44 insertions(+), 93 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index e3ba8077091..3d451c80bca 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -951,81 +951,6 @@ cfun_returns (tree decl)
>    return false;
>  }
>
> -/* Function vect_model_store_cost
> -
> -   Models cost for stores.  In the case of grouped accesses, one access
> -   has the overhead of the grouped access attributed to it.  */
> -
> -static void
> -vect_model_store_cost (vec_info *vinfo, stmt_vec_info stmt_info, int ncopies,
> -                      vect_memory_access_type memory_access_type,
> -                      dr_alignment_support alignment_support_scheme,
> -                      int misalignment,
> -                      vec_load_store_type vls_type, slp_tree slp_node,
> -                      stmt_vector_for_cost *cost_vec)
> -{
> -  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
> -             && memory_access_type != VMAT_ELEMENTWISE
> -             && memory_access_type != VMAT_STRIDED_SLP
> -             && memory_access_type != VMAT_LOAD_STORE_LANES
> -             && memory_access_type != VMAT_CONTIGUOUS_PERMUTE);
> -
> -  unsigned int inside_cost = 0, prologue_cost = 0;
> -
> -  /* ???  Somehow we need to fix this at the callers.  */
> -  if (slp_node)
> -    ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> -
> -  if (vls_type == VLS_STORE_INVARIANT)
> -    {
> -      if (!slp_node)
> -       prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> -                                          stmt_info, 0, vect_prologue);
> -    }
> -
> -
> -  /* Costs of the stores.  */
> -  vect_get_store_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
> -                      misalignment, &inside_cost, cost_vec);
> -
> -  /* When vectorizing a store into the function result assign
> -     a penalty if the function returns in a multi-register location.
> -     In this case we assume we'll end up with having to spill the
> -     vector result and do piecewise loads as a conservative estimate.  */
> -  tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref);
> -  if (base
> -      && (TREE_CODE (base) == RESULT_DECL
> -         || (DECL_P (base) && cfun_returns (base)))
> -      && !aggregate_value_p (base, cfun->decl))
> -    {
> -      rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
> -      /* ???  Handle PARALLEL in some way.  */
> -      if (REG_P (reg))
> -       {
> -         int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
> -         /* Assume that a single reg-reg move is possible and cheap,
> -            do not account for vector to gp register move cost.  */
> -         if (nregs > 1)
> -           {
> -             /* Spill.  */
> -             prologue_cost += record_stmt_cost (cost_vec, ncopies,
> -                                                vector_store,
> -                                                stmt_info, 0, vect_epilogue);
> -             /* Loads.  */
> -             prologue_cost += record_stmt_cost (cost_vec, ncopies * nregs,
> -                                                scalar_load,
> -                                                stmt_info, 0, vect_epilogue);
> -           }
> -       }
> -    }
> -
> -  if (dump_enabled_p ())
> -    dump_printf_loc (MSG_NOTE, vect_location,
> -                     "vect_model_store_cost: inside_cost = %d, "
> -                     "prologue_cost = %d .\n", inside_cost, prologue_cost);
> -}
> -
> -
>  /* Calculate cost of DR's memory access.  */
>  void
>  vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
> @@ -9223,6 +9148,11 @@ vectorizable_store (vec_info *vinfo,
>        return true;
>      }
>
> +  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
> +             || memory_access_type == VMAT_CONTIGUOUS_DOWN
> +             || memory_access_type == VMAT_CONTIGUOUS_PERMUTE
> +             || memory_access_type == VMAT_CONTIGUOUS_REVERSE);
> +
>    unsigned inside_cost = 0, prologue_cost = 0;
>    auto_vec<tree> result_chain (group_size);
>    auto_vec<tree, 1> vec_oprnds;
> @@ -9257,10 +9187,9 @@ vectorizable_store (vec_info *vinfo,
>                      that there is no interleaving, DR_GROUP_SIZE is 1,
>                      and only one iteration of the loop will be executed.  */
>                   op = vect_get_store_rhs (next_stmt_info);
> -                 if (costing_p
> -                     && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> +                 if (costing_p)
>                     update_prologue_cost (&prologue_cost, op);
> -                 else if (!costing_p)
> +                 else
>                     {
>                       vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
>                                                      ncopies, op,
> @@ -9352,10 +9281,9 @@ vectorizable_store (vec_info *vinfo,
>         {
>           if (costing_p)
>             {
> -             if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> -               vect_get_store_cost (vinfo, stmt_info, 1,
> -                                    alignment_support_scheme, misalignment,
> -                                    &inside_cost, cost_vec);
> +             vect_get_store_cost (vinfo, stmt_info, 1,
> +                                  alignment_support_scheme, misalignment,
> +                                  &inside_cost, cost_vec);
>
>               if (!slp)
>                 {
> @@ -9550,18 +9478,41 @@ vectorizable_store (vec_info *vinfo,
>
>    if (costing_p)
>      {
> -      if (memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> -       {
> -         if (dump_enabled_p ())
> -           dump_printf_loc (MSG_NOTE, vect_location,
> -                            "vect_model_store_cost: inside_cost = %d, "
> -                            "prologue_cost = %d .\n",
> -                            inside_cost, prologue_cost);
> +      /* When vectorizing a store into the function result assign
> +        a penalty if the function returns in a multi-register location.
> +        In this case we assume we'll end up with having to spill the
> +        vector result and do piecewise loads as a conservative estimate.  */
> +      tree base = get_base_address (STMT_VINFO_DATA_REF (stmt_info)->ref);
> +      if (base
> +         && (TREE_CODE (base) == RESULT_DECL
> +             || (DECL_P (base) && cfun_returns (base)))
> +         && !aggregate_value_p (base, cfun->decl))
> +       {
> +         rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
> +         /* ???  Handle PARALLEL in some way.  */
> +         if (REG_P (reg))
> +           {
> +             int nregs = hard_regno_nregs (REGNO (reg), GET_MODE (reg));
> +             /* Assume that a single reg-reg move is possible and cheap,
> +                do not account for vector to gp register move cost.  */
> +             if (nregs > 1)
> +               {
> +                 /* Spill.  */
> +                 prologue_cost
> +                   += record_stmt_cost (cost_vec, ncopies, vector_store,
> +                                        stmt_info, 0, vect_epilogue);
> +                 /* Loads.  */
> +                 prologue_cost
> +                   += record_stmt_cost (cost_vec, ncopies * nregs, scalar_load,
> +                                        stmt_info, 0, vect_epilogue);
> +               }
> +           }
>         }
> -      else
> -       vect_model_store_cost (vinfo, stmt_info, ncopies, memory_access_type,
> -                              alignment_support_scheme, misalignment, vls_type,
> -                              slp_node, cost_vec);
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_store_cost: inside_cost = %d, "
> +                        "prologue_cost = %d .\n",
> +                        inside_cost, prologue_cost);
>      }
>
>    return true;
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE
  2023-09-14  3:11 ` [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE Kewen Lin
@ 2023-09-27 11:30   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2023-09-27 11:30 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford

On Thu, Sep 14, 2023 at 5:12 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> For VMAT_CONTIGUOUS_REVERSE, the transform code in function
> vectorizable_store generates a VEC_PERM_EXPR stmt before
> storing, but it's never considered in costing.
>
> This patch is to make it consider vec_perm in costing, it
> adjusts the order of transform code a bit to make it easy
> to early return for costing_p.

OK.

> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vectorizable_store): Consider generated
>         VEC_PERM_EXPR stmt for VMAT_CONTIGUOUS_REVERSE in costing as
>         vec_perm.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c: New test.
> ---
>  .../costmodel/ppc/costmodel-vect-store-2.c    | 29 +++++++++
>  gcc/tree-vect-stmts.cc                        | 63 +++++++++++--------
>  2 files changed, 65 insertions(+), 27 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c
> new file mode 100644
> index 00000000000..72b67cf9040
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-store-2.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-additional-options "-mvsx" } */
> +
> +/* Verify we do cost the required vec_perm.  */
> +
> +int
> +foo (int *a, int *b, int len)
> +{
> +  int i;
> +  int *a1 = a;
> +  int *a0 = a1 - 4;
> +  for (i = 0; i < len; i++)
> +    {
> +      *b = *a0 + *a1;
> +      b--;
> +      a0++;
> +      a1++;
> +    }
> +  return 0;
> +}
> +
> +/* The reason why it doesn't check the exact count is that
> +   we can get more than 1 vec_perm when it's compiled with
> +   partial vector capability like Power10 (retrying for
> +   the epilogue) or it's complied without unaligned vector
> +   memory access support (realign).  */
> +/* { dg-final { scan-tree-dump {\mvec_perm\M} "vect" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 3d451c80bca..ce925cc1d53 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -9279,6 +9279,40 @@ vectorizable_store (vec_info *vinfo,
>        stmt_vec_info next_stmt_info = first_stmt_info;
>        for (i = 0; i < vec_num; i++)
>         {
> +         if (!costing_p)
> +           {
> +             if (slp)
> +               vec_oprnd = vec_oprnds[i];
> +             else if (grouped_store)
> +               /* For grouped stores vectorized defs are interleaved in
> +                  vect_permute_store_chain().  */
> +               vec_oprnd = result_chain[i];
> +           }
> +
> +         if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> +           {
> +             if (costing_p)
> +               inside_cost += record_stmt_cost (cost_vec, 1, vec_perm,
> +                                                stmt_info, 0, vect_body);
> +             else
> +               {
> +                 tree perm_mask = perm_mask_for_reverse (vectype);
> +                 tree perm_dest = vect_create_destination_var (
> +                   vect_get_store_rhs (stmt_info), vectype);
> +                 tree new_temp = make_ssa_name (perm_dest);
> +
> +                 /* Generate the permute statement.  */
> +                 gimple *perm_stmt
> +                   = gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd,
> +                                          vec_oprnd, perm_mask);
> +                 vect_finish_stmt_generation (vinfo, stmt_info, perm_stmt,
> +                                              gsi);
> +
> +                 perm_stmt = SSA_NAME_DEF_STMT (new_temp);
> +                 vec_oprnd = new_temp;
> +               }
> +           }
> +
>           if (costing_p)
>             {
>               vect_get_store_cost (vinfo, stmt_info, 1,
> @@ -9294,8 +9328,6 @@ vectorizable_store (vec_info *vinfo,
>
>               continue;
>             }
> -         unsigned misalign;
> -         unsigned HOST_WIDE_INT align;
>
>           tree final_mask = NULL_TREE;
>           tree final_len = NULL_TREE;
> @@ -9315,13 +9347,8 @@ vectorizable_store (vec_info *vinfo,
>             dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi,
>                                            stmt_info, bump);
>
> -         if (slp)
> -           vec_oprnd = vec_oprnds[i];
> -         else if (grouped_store)
> -           /* For grouped stores vectorized defs are interleaved in
> -              vect_permute_store_chain().  */
> -           vec_oprnd = result_chain[i];
> -
> +         unsigned misalign;
> +         unsigned HOST_WIDE_INT align;
>           align = known_alignment (DR_TARGET_ALIGNMENT (first_dr_info));
>           if (alignment_support_scheme == dr_aligned)
>             misalign = 0;
> @@ -9338,24 +9365,6 @@ vectorizable_store (vec_info *vinfo,
>                                     misalign);
>           align = least_bit_hwi (misalign | align);
>
> -         if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> -           {
> -             tree perm_mask = perm_mask_for_reverse (vectype);
> -             tree perm_dest
> -               = vect_create_destination_var (vect_get_store_rhs (stmt_info),
> -                                              vectype);
> -             tree new_temp = make_ssa_name (perm_dest);
> -
> -             /* Generate the permute statement.  */
> -             gimple *perm_stmt
> -               = gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd,
> -                                      vec_oprnd, perm_mask);
> -             vect_finish_stmt_generation (vinfo, stmt_info, perm_stmt, gsi);
> -
> -             perm_stmt = SSA_NAME_DEF_STMT (new_temp);
> -             vec_oprnd = new_temp;
> -           }
> -
>           /* Compute IFN when LOOP_LENS or final_mask valid.  */
>           machine_mode vmode = TYPE_MODE (vectype);
>           machine_mode new_vmode = vmode;
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-09-27 11:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14  3:11 [PATCH 00/10] vect: Move costing next to the transform for vect store Kewen Lin
2023-09-14  3:11 ` [PATCH 01/10] vect: Ensure vect store is supported for some VMAT_ELEMENTWISE case Kewen Lin
2023-09-27 11:22   ` Richard Biener
2023-09-14  3:11 ` [PATCH 02/10] vect: Move vect_model_store_cost next to the transform in vectorizable_store Kewen Lin
2023-09-27 11:23   ` Richard Biener
2023-09-14  3:11 ` [PATCH 03/10] vect: Adjust vectorizable_store costing on VMAT_GATHER_SCATTER Kewen Lin
2023-09-27 11:24   ` Richard Biener
2023-09-14  3:11 ` [PATCH 04/10] vect: Simplify costing on vectorizable_scan_store Kewen Lin
2023-09-27 11:25   ` Richard Biener
2023-09-14  3:11 ` [PATCH 05/10] vect: Adjust vectorizable_store costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
2023-09-27 11:26   ` Richard Biener
2023-09-14  3:11 ` [PATCH 06/10] vect: Adjust vectorizable_store costing on VMAT_LOAD_STORE_LANES Kewen Lin
2023-09-27 11:27   ` Richard Biener
2023-09-14  3:11 ` [PATCH 07/10] vect: Adjust vectorizable_store costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
2023-09-27 11:28   ` Richard Biener
2023-09-14  3:11 ` [PATCH/RFC 08/10] aarch64: Don't use CEIL for vector_store in aarch64_stp_sequence_cost Kewen Lin
2023-09-18  8:41   ` Richard Sandiford
2023-09-18  8:53     ` Richard Biener
2023-09-20  2:40       ` Kewen.Lin
2023-09-14  3:11 ` [PATCH 09/10] vect: Get rid of vect_model_store_cost Kewen Lin
2023-09-27 11:29   ` Richard Biener
2023-09-14  3:11 ` [PATCH 10/10] vect: Consider vec_perm costing for VMAT_CONTIGUOUS_REVERSE Kewen Lin
2023-09-27 11:30   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).