public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/9] vect: Move costing next to the transform for vect load
@ 2023-06-13  2:03 Kewen Lin
  2023-06-13  2:03 ` [PATCH 1/9] vect: Move vect_model_load_cost next to the transform in vectorizable_load Kewen Lin
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch series follows Richi's suggestion at the link [1],
which suggest structuring vectorizable_load to make costing
next to the transform, in order to make it easier to keep
costing and the transform in sync.  For now, it's a known
issue that what we cost can be inconsistent with what we
transform, as the case in PR82255 and some other associated
test cases in the patches of this series show.

Basically this patch series makes costing not call function
vect_model_load_cost any more.  To make the review and
bisection easy, I organized the changes according to the
memory access types of vector load.  For each memory access
type, firstly it follows the handlings in the function
vect_model_load_costto avoid any missing, then refines
further by referring to the transform code, I also checked
them with some typical test cases to verify.  Hope the
subjects of patches are clear enough.

The whole series can be bootstrapped and regtested
incrementally on:
  - x86_64-redhat-linux
  - aarch64-linux-gnu
  - powerpc64-linux-gnu P7, P8 and P9
  - powerpc64le-linux-gnu P8, P9 and P10

By considering the current vector test buckets are mainly
tested without cost model, I also verified the whole patch
series was neutral for SPEC2017 int/fp on Power9 at O2,
O3 and Ofast separately.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563624.html

Kewen Lin (9):
  vect: Move vect_model_load_cost next to the transform in vectorizable_load
  vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
  vect: Adjust vectorizable_load costing on VMAT_INVARIANT
  vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER
  vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES
  vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE
  vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
  vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS

 .../vect/costmodel/ppc/costmodel-pr82255.c    |  31 +
 .../costmodel/ppc/costmodel-vect-reversed.c   |  22 +
 gcc/testsuite/gcc.target/i386/pr70021.c       |   2 +-
 gcc/tree-vect-stmts.cc                        | 651 ++++++++++--------
 4 files changed, 432 insertions(+), 274 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/9] vect: Move vect_model_load_cost next to the transform in vectorizable_load
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-13  2:03 ` [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl Kewen Lin
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch is an initial patch to move costing next to the
transform, it still adopts vect_model_load_cost for costing
but moves and duplicates it down according to the handlings
of different vect_memory_access_types, hope it can make the
subsequent patches easy to review.  This patch should not
have any functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_load): Move and duplicate the call
	to vect_model_load_cost down to some different transform paths
	according to the handlings of different vect_memory_access_types.
---
 gcc/tree-vect-stmts.cc | 86 ++++++++++++++++++++++++++++--------------
 1 file changed, 57 insertions(+), 29 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a7acc032d47..44514658be3 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9430,7 +9430,9 @@ vectorizable_load (vec_info *vinfo,
 	}
     }
 
-  if (!vec_stmt) /* transformation not required.  */
+  bool costing_p = !vec_stmt;
+
+  if (costing_p) /* transformation not required.  */
     {
       if (slp_node
 	  && mask
@@ -9464,17 +9466,13 @@ vectorizable_load (vec_info *vinfo,
 	vinfo->any_known_not_updated_vssa = true;
 
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
-      vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			    alignment_support_scheme, misalignment,
-			    &gs_info, slp_node, cost_vec);
-      return true;
     }
 
   if (!slp)
     gcc_assert (memory_access_type
 		== STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info));
 
-  if (dump_enabled_p ())
+  if (dump_enabled_p () && !costing_p)
     dump_printf_loc (MSG_NOTE, vect_location,
                      "transform load. ncopies = %d\n", ncopies);
 
@@ -9485,13 +9483,26 @@ vectorizable_load (vec_info *vinfo,
 
   if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
     {
-      vect_build_gather_load_calls (vinfo,
-				    stmt_info, gsi, vec_stmt, &gs_info, mask);
+      if (costing_p)
+	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
+			      alignment_support_scheme, misalignment, &gs_info,
+			      slp_node, cost_vec);
+      else
+	vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
+				      mask);
       return true;
     }
 
   if (memory_access_type == VMAT_INVARIANT)
     {
+      if (costing_p)
+	{
+	  vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
+				memory_access_type, alignment_support_scheme,
+				misalignment, &gs_info, slp_node, cost_vec);
+	  return true;
+	}
+
       gcc_assert (!grouped_load && !mask && !bb_vinfo);
       /* If we have versioned for aliasing or the loop doesn't
 	 have any data dependencies that would preclude this,
@@ -9548,6 +9559,14 @@ vectorizable_load (vec_info *vinfo,
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
     {
+      if (costing_p)
+	{
+	  vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
+				memory_access_type, alignment_support_scheme,
+				misalignment, &gs_info, slp_node, cost_vec);
+	  return true;
+	}
+
       gimple_stmt_iterator incr_gsi;
       bool insert_after;
       tree offvar;
@@ -9989,17 +10008,20 @@ vectorizable_load (vec_info *vinfo,
 	 here, since we can't guarantee first_stmt_info DR has been
 	 initialized yet, use first_stmt_info_for_drptr DR by bumping the
 	 distance from first_stmt_info DR instead as below.  */
-      if (!diff_first_stmt_info)
-	msq = vect_setup_realignment (vinfo,
-				      first_stmt_info, gsi, &realignment_token,
-				      alignment_support_scheme, NULL_TREE,
-				      &at_loop);
-      if (alignment_support_scheme == dr_explicit_realign_optimized)
-	{
-	  phi = as_a <gphi *> (SSA_NAME_DEF_STMT (msq));
-	  offset = size_binop (MINUS_EXPR, TYPE_SIZE_UNIT (vectype),
-			       size_one_node);
-	  gcc_assert (!first_stmt_info_for_drptr);
+      if (!costing_p)
+	{
+	  if (!diff_first_stmt_info)
+	    msq = vect_setup_realignment (vinfo, first_stmt_info, gsi,
+					  &realignment_token,
+					  alignment_support_scheme, NULL_TREE,
+					  &at_loop);
+	  if (alignment_support_scheme == dr_explicit_realign_optimized)
+	    {
+	      phi = as_a<gphi *> (SSA_NAME_DEF_STMT (msq));
+	      offset = size_binop (MINUS_EXPR, TYPE_SIZE_UNIT (vectype),
+				   size_one_node);
+	      gcc_assert (!first_stmt_info_for_drptr);
+	    }
 	}
     }
   else
@@ -10020,8 +10042,9 @@ vectorizable_load (vec_info *vinfo,
   else if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       aggr_type = elem_type;
-      vect_get_strided_load_store_ops (stmt_info, loop_vinfo, &gs_info,
-				       &bump, &vec_offset);
+      if (!costing_p)
+	vect_get_strided_load_store_ops (stmt_info, loop_vinfo, &gs_info, &bump,
+					 &vec_offset);
     }
   else
     {
@@ -10035,7 +10058,7 @@ vectorizable_load (vec_info *vinfo,
 
   auto_vec<tree> vec_offsets;
   auto_vec<tree> vec_masks;
-  if (mask)
+  if (mask && !costing_p)
     {
       if (slp_node)
 	vect_get_slp_defs (SLP_TREE_CHILDREN (slp_node)[mask_index],
@@ -10049,7 +10072,7 @@ vectorizable_load (vec_info *vinfo,
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
-      if (j == 0)
+      if (j == 0 && !costing_p)
 	{
 	  bool simd_lane_access_p
 	    = STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) != 0;
@@ -10108,7 +10131,7 @@ vectorizable_load (vec_info *vinfo,
 	  if (mask)
 	    vec_mask = vec_masks[0];
 	}
-      else
+      else if (!costing_p)
 	{
 	  gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	  if (dataref_offset)
@@ -10125,7 +10148,7 @@ vectorizable_load (vec_info *vinfo,
 	dr_chain.create (vec_num);
 
       gimple *new_stmt = NULL;
-      if (memory_access_type == VMAT_LOAD_STORE_LANES)
+      if (memory_access_type == VMAT_LOAD_STORE_LANES && !costing_p)
 	{
 	  tree vec_array;
 
@@ -10177,7 +10200,7 @@ vectorizable_load (vec_info *vinfo,
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 	}
-      else
+      else if (!costing_p)
 	{
 	  for (i = 0; i < vec_num; i++)
 	    {
@@ -10631,7 +10654,7 @@ vectorizable_load (vec_info *vinfo,
       if (slp && !slp_perm)
 	continue;
 
-      if (slp_perm)
+      if (slp_perm && !costing_p)
         {
 	  unsigned n_perms;
 	  /* For SLP we know we've seen all possible uses of dr_chain so
@@ -10643,7 +10666,7 @@ vectorizable_load (vec_info *vinfo,
 						  nullptr, true);
 	  gcc_assert (ok);
         }
-      else
+      else if (!costing_p)
         {
           if (grouped_load)
   	    {
@@ -10659,9 +10682,14 @@ vectorizable_load (vec_info *vinfo,
         }
       dr_chain.release ();
     }
-  if (!slp)
+  if (!slp && !costing_p)
     *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 
+  if (costing_p)
+    vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
+			  alignment_support_scheme, misalignment, &gs_info,
+			  slp_node, cost_vec);
+
   return true;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
  2023-06-13  2:03 ` [PATCH 1/9] vect: Move vect_model_load_cost next to the transform in vectorizable_load Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-30 11:11   ` Richard Biener
  2023-06-13  2:03 ` [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT Kewen Lin
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adds one extra argument cost_vec to function
vect_build_gather_load_calls, so that we can do costing
next to the tranform in vect_build_gather_load_calls.
For now, the implementation just follows the handlings in
vect_model_load_cost, it isn't so good, so placing one
FIXME for any further improvement.  This patch should not
cause any functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_build_gather_load_calls): Add the handlings
	on costing with one extra argument cost_vec.
	(vectorizable_load): Adjust the call to vect_build_gather_load_calls.
	(vect_model_load_cost): Assert it won't get VMAT_GATHER_SCATTER with
	gs_info.decl set any more.
---
 gcc/tree-vect-stmts.cc | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 44514658be3..744cdf40e26 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1135,6 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
+
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
 
@@ -2819,7 +2821,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
 			      gimple_stmt_iterator *gsi,
 			      gimple **vec_stmt,
 			      gather_scatter_info *gs_info,
-			      tree mask)
+			      tree mask,
+			      stmt_vector_for_cost *cost_vec)
 {
   loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
@@ -2831,6 +2834,23 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
   poly_uint64 gather_off_nunits
     = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
 
+  /* FIXME: Keep the previous costing way in vect_model_load_cost by costing
+     N scalar loads, but it should be tweaked to use target specific costs
+     on related gather load calls.  */
+  if (!vec_stmt)
+    {
+      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
+      unsigned int inside_cost;
+      inside_cost = record_stmt_cost (cost_vec, ncopies * assumed_nunits,
+				      scalar_load, stmt_info, 0, vect_body);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_load_cost: inside_cost = %d, "
+			 "prologue_cost = 0 .\n",
+			 inside_cost);
+      return;
+    }
+
   tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info->decl));
   tree rettype = TREE_TYPE (TREE_TYPE (gs_info->decl));
   tree srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
@@ -9483,13 +9503,8 @@ vectorizable_load (vec_info *vinfo,
 
   if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
     {
-      if (costing_p)
-	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			      alignment_support_scheme, misalignment, &gs_info,
-			      slp_node, cost_vec);
-      else
-	vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
-				      mask);
+      vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
+				    mask, cost_vec);
       return true;
     }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
  2023-06-13  2:03 ` [PATCH 1/9] vect: Move vect_model_load_cost next to the transform in vectorizable_load Kewen Lin
  2023-06-13  2:03 ` [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-30 11:18   ` Richard Biener
  2023-06-13  2:03 ` [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on VMAT_INVARIANT in
function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.

To make the costing on VMAT_INVARIANT better, this patch is
to query hoist_defs_of_uses for hoisting decision, and add
costs for different "where" based on it.  Currently function
hoist_defs_of_uses would always hoist the defs of all SSA
uses, adding one argument HOIST_P aims to avoid the actual
hoisting during costing phase.

gcc/ChangeLog:

	* tree-vect-stmts.cc (hoist_defs_of_uses): Add one argument HOIST_P.
	(vectorizable_load): Adjust the handling on VMAT_INVARIANT to respect
	hoisting decision and without calling vect_model_load_cost.
	(vect_model_load_cost): Assert it won't get VMAT_INVARIANT any more
	and remove VMAT_INVARIANT related handlings.
---
 gcc/tree-vect-stmts.cc | 61 +++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 744cdf40e26..19c61d703c8 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1135,7 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
+  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
+	      && memory_access_type != VMAT_INVARIANT);
 
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1238,16 +1239,6 @@ vect_model_load_cost (vec_info *vinfo,
 				       ncopies * assumed_nunits,
 				       scalar_load, stmt_info, 0, vect_body);
     }
-  else if (memory_access_type == VMAT_INVARIANT)
-    {
-      /* Invariant loads will ideally be hoisted and splat to a vector.  */
-      prologue_cost += record_stmt_cost (cost_vec, 1,
-					 scalar_load, stmt_info, 0,
-					 vect_prologue);
-      prologue_cost += record_stmt_cost (cost_vec, 1,
-					 scalar_to_vec, stmt_info, 0,
-					 vect_prologue);
-    }
   else
     vect_get_load_cost (vinfo, stmt_info, ncopies,
 			alignment_support_scheme, misalignment, first_stmt_p,
@@ -9121,10 +9112,11 @@ permute_vec_elements (vec_info *vinfo,
 /* Hoist the definitions of all SSA uses on STMT_INFO out of the loop LOOP,
    inserting them on the loops preheader edge.  Returns true if we
    were successful in doing so (and thus STMT_INFO can be moved then),
-   otherwise returns false.  */
+   otherwise returns false.  HOIST_P indicates if we want to hoist the
+   definitions of all SSA uses, it would be false when we are costing.  */
 
 static bool
-hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
+hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop, bool hoist_p)
 {
   ssa_op_iter i;
   tree op;
@@ -9158,6 +9150,9 @@ hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
   if (!any)
     return true;
 
+  if (!hoist_p)
+    return true;
+
   FOR_EACH_SSA_TREE_OPERAND (op, stmt_info->stmt, i, SSA_OP_USE)
     {
       gimple *def_stmt = SSA_NAME_DEF_STMT (op);
@@ -9510,14 +9505,6 @@ vectorizable_load (vec_info *vinfo,
 
   if (memory_access_type == VMAT_INVARIANT)
     {
-      if (costing_p)
-	{
-	  vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
-				memory_access_type, alignment_support_scheme,
-				misalignment, &gs_info, slp_node, cost_vec);
-	  return true;
-	}
-
       gcc_assert (!grouped_load && !mask && !bb_vinfo);
       /* If we have versioned for aliasing or the loop doesn't
 	 have any data dependencies that would preclude this,
@@ -9525,7 +9512,37 @@ vectorizable_load (vec_info *vinfo,
 	 thus we can insert it on the preheader edge.  */
       bool hoist_p = (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)
 		      && !nested_in_vect_loop
-		      && hoist_defs_of_uses (stmt_info, loop));
+		      && hoist_defs_of_uses (stmt_info, loop, !costing_p));
+      if (costing_p)
+	{
+	  if (hoist_p)
+	    {
+	      unsigned int prologue_cost;
+	      prologue_cost = record_stmt_cost (cost_vec, 1, scalar_load,
+						stmt_info, 0, vect_prologue);
+	      prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+						 stmt_info, 0, vect_prologue);
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "vect_model_load_cost: inside_cost = 0, "
+				 "prologue_cost = %d .\n",
+				 prologue_cost);
+	    }
+	  else
+	    {
+	      unsigned int inside_cost;
+	      inside_cost = record_stmt_cost (cost_vec, 1, scalar_load,
+					      stmt_info, 0, vect_body);
+	      inside_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
+					       stmt_info, 0, vect_body);
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "vect_model_load_cost: inside_cost = %d, "
+				 "prologue_cost = 0 .\n",
+				 inside_cost);
+	    }
+	  return true;
+	}
       if (hoist_p)
 	{
 	  gassign *stmt = as_a <gassign *> (stmt_info->stmt);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (2 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-07-02  8:58   ` Richard Sandiford
  2023-06-13  2:03 ` [PATCH 5/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER Kewen Lin
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on VMAT_ELEMENTWISE
and VMAT_STRIDED_SLP in function vectorizable_load.  We
don't call function vect_model_load_cost for them any more.

As PR82255 shows, we don't always need a vector construction
there, moving costing next to the transform can make us only
cost for vector construction when it's actually needed.
Besides, it can count the number of loads consistently for
some cases.

	 PR tree-optimization/82255

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling
	on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP without calling
	vect_model_load_cost.
	(vect_model_load_cost): Assert it won't get VMAT_ELEMENTWISE and
	VMAT_STRIDED_SLP any more, and remove their related handlings.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New test.

2023-06-13  Bill Schmidt  <wschmidt@linux.ibm.com>
	    Kewen Lin  <linkw@linux.ibm.com>
---
 .../vect/costmodel/ppc/costmodel-pr82255.c    |  31 ++++
 gcc/tree-vect-stmts.cc                        | 170 +++++++++++-------
 2 files changed, 134 insertions(+), 67 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
new file mode 100644
index 00000000000..9317ee2e15b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+/* PR82255: Ensure we don't require a vec_construct cost when we aren't
+   going to generate a strided load.  */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__))
+__attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+#pragma GCC unroll 16
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+/* { dg-final { scan-tree-dump-times "vec_construct" 0 "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 19c61d703c8..651dc800380 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1136,7 +1136,9 @@ vect_model_load_cost (vec_info *vinfo,
 		      stmt_vector_for_cost *cost_vec)
 {
   gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
-	      && memory_access_type != VMAT_INVARIANT);
+	      && memory_access_type != VMAT_INVARIANT
+	      && memory_access_type != VMAT_ELEMENTWISE
+	      && memory_access_type != VMAT_STRIDED_SLP);
 
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1221,8 +1223,7 @@ vect_model_load_cost (vec_info *vinfo,
     }
 
   /* The loads themselves.  */
-  if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_GATHER_SCATTER)
+  if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       tree vectype = STMT_VINFO_VECTYPE (stmt_info);
       unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
@@ -1244,10 +1245,10 @@ vect_model_load_cost (vec_info *vinfo,
 			alignment_support_scheme, misalignment, first_stmt_p,
 			&inside_cost, &prologue_cost, 
 			cost_vec, cost_vec, true);
-  if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_STRIDED_SLP
-      || (memory_access_type == VMAT_GATHER_SCATTER
-	  && gs_info->ifn == IFN_LAST && !gs_info->decl))
+
+  if (memory_access_type == VMAT_GATHER_SCATTER
+      && gs_info->ifn == IFN_LAST
+      && !gs_info->decl)
     inside_cost += record_stmt_cost (cost_vec, ncopies, vec_construct,
 				     stmt_info, 0, vect_body);
 
@@ -9591,14 +9592,6 @@ vectorizable_load (vec_info *vinfo,
   if (memory_access_type == VMAT_ELEMENTWISE
       || memory_access_type == VMAT_STRIDED_SLP)
     {
-      if (costing_p)
-	{
-	  vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
-				memory_access_type, alignment_support_scheme,
-				misalignment, &gs_info, slp_node, cost_vec);
-	  return true;
-	}
-
       gimple_stmt_iterator incr_gsi;
       bool insert_after;
       tree offvar;
@@ -9610,6 +9603,7 @@ vectorizable_load (vec_info *vinfo,
       unsigned int const_nunits = nunits.to_constant ();
       unsigned HOST_WIDE_INT cst_offset = 0;
       tree dr_offset;
+      unsigned int inside_cost = 0;
 
       gcc_assert (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
@@ -9624,6 +9618,7 @@ vectorizable_load (vec_info *vinfo,
 	  first_stmt_info = stmt_info;
 	  first_dr_info = dr_info;
 	}
+
       if (slp && grouped_load)
 	{
 	  group_size = DR_GROUP_SIZE (first_stmt_info);
@@ -9640,43 +9635,44 @@ vectorizable_load (vec_info *vinfo,
 	  ref_type = reference_alias_ptr_type (DR_REF (dr_info->dr));
 	}
 
-      dr_offset = get_dr_vinfo_offset (vinfo, first_dr_info);
-      stride_base
-	= fold_build_pointer_plus
-	    (DR_BASE_ADDRESS (first_dr_info->dr),
-	     size_binop (PLUS_EXPR,
-			 convert_to_ptrofftype (dr_offset),
-			 convert_to_ptrofftype (DR_INIT (first_dr_info->dr))));
-      stride_step = fold_convert (sizetype, DR_STEP (first_dr_info->dr));
+      if (!costing_p)
+	{
+	  dr_offset = get_dr_vinfo_offset (vinfo, first_dr_info);
+	  stride_base = fold_build_pointer_plus (
+	    DR_BASE_ADDRESS (first_dr_info->dr),
+	    size_binop (PLUS_EXPR, convert_to_ptrofftype (dr_offset),
+			convert_to_ptrofftype (DR_INIT (first_dr_info->dr))));
+	  stride_step = fold_convert (sizetype, DR_STEP (first_dr_info->dr));
 
-      /* For a load with loop-invariant (but other than power-of-2)
-         stride (i.e. not a grouped access) like so:
+	  /* For a load with loop-invariant (but other than power-of-2)
+	     stride (i.e. not a grouped access) like so:
 
-	   for (i = 0; i < n; i += stride)
-	     ... = array[i];
+	       for (i = 0; i < n; i += stride)
+		 ... = array[i];
 
-	 we generate a new induction variable and new accesses to
-	 form a new vector (or vectors, depending on ncopies):
+	     we generate a new induction variable and new accesses to
+	     form a new vector (or vectors, depending on ncopies):
 
-	   for (j = 0; ; j += VF*stride)
-	     tmp1 = array[j];
-	     tmp2 = array[j + stride];
-	     ...
-	     vectemp = {tmp1, tmp2, ...}
-         */
+	       for (j = 0; ; j += VF*stride)
+		 tmp1 = array[j];
+		 tmp2 = array[j + stride];
+		 ...
+		 vectemp = {tmp1, tmp2, ...}
+	     */
 
-      ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (stride_step), stride_step,
-			    build_int_cst (TREE_TYPE (stride_step), vf));
+	  ivstep = fold_build2 (MULT_EXPR, TREE_TYPE (stride_step), stride_step,
+				build_int_cst (TREE_TYPE (stride_step), vf));
 
-      standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+	  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
 
-      stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
-      ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
-      create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
-		 loop, &incr_gsi, insert_after,
-		 &offvar, NULL);
+	  stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base);
+	  ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep);
+	  create_iv (stride_base, PLUS_EXPR, ivstep, NULL,
+		     loop, &incr_gsi, insert_after,
+		     &offvar, NULL);
 
-      stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
+	  stride_step = cse_and_gimplify_to_preheader (loop_vinfo, stride_step);
+	}
 
       running_off = offvar;
       alias_off = build_int_cst (ref_type, 0);
@@ -9743,11 +9739,23 @@ vectorizable_load (vec_info *vinfo,
       unsigned int n_groups = 0;
       for (j = 0; j < ncopies; j++)
 	{
-	  if (nloads > 1)
+	  if (nloads > 1 && !costing_p)
 	    vec_alloc (v, nloads);
 	  gimple *new_stmt = NULL;
 	  for (i = 0; i < nloads; i++)
 	    {
+	      if (costing_p)
+		{
+		  if (VECTOR_TYPE_P (ltype))
+		    vect_get_load_cost (vinfo, stmt_info, 1,
+					alignment_support_scheme, misalignment,
+					false, &inside_cost, nullptr, cost_vec,
+					cost_vec, true);
+		  else
+		    inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
+						     stmt_info, 0, vect_body);
+		  continue;
+		}
 	      tree this_off = build_int_cst (TREE_TYPE (alias_off),
 					     group_el * elsz + cst_offset);
 	      tree data_ref = build2 (MEM_REF, ltype, running_off, this_off);
@@ -9778,42 +9786,70 @@ vectorizable_load (vec_info *vinfo,
 		  group_el = 0;
 		}
 	    }
+
 	  if (nloads > 1)
 	    {
-	      tree vec_inv = build_constructor (lvectype, v);
-	      new_temp = vect_init_vector (vinfo, stmt_info,
-					   vec_inv, lvectype, gsi);
-	      new_stmt = SSA_NAME_DEF_STMT (new_temp);
-	      if (lvectype != vectype)
+	      if (costing_p)
+		inside_cost += record_stmt_cost (cost_vec, 1, vec_construct,
+						 stmt_info, 0, vect_body);
+	      else
 		{
-		  new_stmt = gimple_build_assign (make_ssa_name (vectype),
-						  VIEW_CONVERT_EXPR,
-						  build1 (VIEW_CONVERT_EXPR,
-							  vectype, new_temp));
-		  vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+		  tree vec_inv = build_constructor (lvectype, v);
+		  new_temp = vect_init_vector (vinfo, stmt_info, vec_inv,
+					       lvectype, gsi);
+		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
+		  if (lvectype != vectype)
+		    {
+		      new_stmt
+			= gimple_build_assign (make_ssa_name (vectype),
+					       VIEW_CONVERT_EXPR,
+					       build1 (VIEW_CONVERT_EXPR,
+						       vectype, new_temp));
+		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
+						   gsi);
+		    }
 		}
 	    }
 
-	  if (slp)
+	  if (!costing_p)
 	    {
-	      if (slp_perm)
-		dr_chain.quick_push (gimple_assign_lhs (new_stmt));
+	      if (slp)
+		{
+		  if (slp_perm)
+		    dr_chain.quick_push (gimple_assign_lhs (new_stmt));
+		  else
+		    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
+		}
 	      else
-		SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
-	    }
-	  else
-	    {
-	      if (j == 0)
-		*vec_stmt = new_stmt;
-	      STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+		{
+		  if (j == 0)
+		    *vec_stmt = new_stmt;
+		  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+		}
 	    }
 	}
       if (slp_perm)
 	{
 	  unsigned n_perms;
-	  vect_transform_slp_perm_load (vinfo, slp_node, dr_chain, gsi, vf,
-					false, &n_perms);
+	  if (costing_p)
+	    {
+	      unsigned n_loads;
+	      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL, vf,
+					    true, &n_perms, &n_loads);
+	      inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
+					       first_stmt_info, 0, vect_body);
+	    }
+	  else
+	    vect_transform_slp_perm_load (vinfo, slp_node, dr_chain, gsi, vf,
+					  false, &n_perms);
 	}
+
+      if (costing_p && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_load_cost: inside_cost = %u, "
+			 "prologue_cost = 0 .\n",
+			 inside_cost);
+
       return true;
     }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (3 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-07-03  3:01   ` [PATCH 5/9 v2] " Kewen.Lin
  2023-06-13  2:03 ` [PATCH 6/9] vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES Kewen Lin
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on VMAT_GATHER_SCATTER
in function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.

It's mainly for gather loads with IFN or emulated gather
loads, it follows the handlings in function
vect_model_load_cost.  This patch shouldn't have any
functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
	VMAT_GATHER_SCATTER without calling vect_model_load_cost.
	(vect_model_load_cost): Adjut the assertion on VMAT_GATHER_SCATTER,
	remove VMAT_GATHER_SCATTER related handlings and the related parameter
	gs_info.
---
 gcc/tree-vect-stmts.cc | 123 +++++++++++++++++++++++++----------------
 1 file changed, 75 insertions(+), 48 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 651dc800380..a3fd0bf879e 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1131,11 +1131,10 @@ vect_model_load_cost (vec_info *vinfo,
 		      vect_memory_access_type memory_access_type,
 		      dr_alignment_support alignment_support_scheme,
 		      int misalignment,
-		      gather_scatter_info *gs_info,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
 	      && memory_access_type != VMAT_INVARIANT
 	      && memory_access_type != VMAT_ELEMENTWISE
 	      && memory_access_type != VMAT_STRIDED_SLP);
@@ -1222,35 +1221,9 @@ vect_model_load_cost (vec_info *vinfo,
                          group_size);
     }
 
-  /* The loads themselves.  */
-  if (memory_access_type == VMAT_GATHER_SCATTER)
-    {
-      tree vectype = STMT_VINFO_VECTYPE (stmt_info);
-      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
-      if (memory_access_type == VMAT_GATHER_SCATTER
-	  && gs_info->ifn == IFN_LAST && !gs_info->decl)
-	/* For emulated gathers N offset vector element extracts
-	   (we assume the scalar scaling and ptr + offset add is consumed by
-	   the load).  */
-	inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits,
-					 vec_to_scalar, stmt_info, 0,
-					 vect_body);
-      /* N scalar loads plus gathering them into a vector.  */
-      inside_cost += record_stmt_cost (cost_vec,
-				       ncopies * assumed_nunits,
-				       scalar_load, stmt_info, 0, vect_body);
-    }
-  else
-    vect_get_load_cost (vinfo, stmt_info, ncopies,
-			alignment_support_scheme, misalignment, first_stmt_p,
-			&inside_cost, &prologue_cost, 
-			cost_vec, cost_vec, true);
-
-  if (memory_access_type == VMAT_GATHER_SCATTER
-      && gs_info->ifn == IFN_LAST
-      && !gs_info->decl)
-    inside_cost += record_stmt_cost (cost_vec, ncopies, vec_construct,
-				     stmt_info, 0, vect_body);
+  vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
+		      misalignment, first_stmt_p, &inside_cost, &prologue_cost,
+		      cost_vec, cost_vec, true);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -10137,6 +10110,7 @@ vectorizable_load (vec_info *vinfo,
     }
   tree vec_mask = NULL_TREE;
   poly_uint64 group_elt = 0;
+  unsigned int inside_cost = 0;
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
@@ -10268,23 +10242,25 @@ vectorizable_load (vec_info *vinfo,
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 	}
-      else if (!costing_p)
+      else
 	{
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
-	      if (loop_masks
-		  && memory_access_type != VMAT_INVARIANT)
-		final_mask = vect_get_loop_mask (gsi, loop_masks,
-						 vec_num * ncopies,
-						 vectype, vec_num * j + i);
-	      if (vec_mask)
-		final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
-					       final_mask, vec_mask, gsi);
-
-	      if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
-		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
-					       gsi, stmt_info, bump);
+	      if (!costing_p)
+		{
+		  if (loop_masks && memory_access_type != VMAT_INVARIANT)
+		    final_mask
+		      = vect_get_loop_mask (gsi, loop_masks, vec_num * ncopies,
+					    vectype, vec_num * j + i);
+		  if (vec_mask)
+		    final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
+						   final_mask, vec_mask, gsi);
+
+		  if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+		    dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+						   gsi, stmt_info, bump);
+		}
 
 	      /* 2. Create the vector-load in the loop.  */
 	      switch (alignment_support_scheme)
@@ -10298,6 +10274,16 @@ vectorizable_load (vec_info *vinfo,
 		    if (memory_access_type == VMAT_GATHER_SCATTER
 			&& gs_info.ifn != IFN_LAST)
 		      {
+			if (costing_p)
+			  {
+			    unsigned int cnunits
+			      = vect_nunits_for_cost (vectype);
+			    inside_cost
+			      = record_stmt_cost (cost_vec, cnunits,
+						  scalar_load, stmt_info, 0,
+						  vect_body);
+			    goto vec_num_loop_costing_end;
+			  }
 			if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 			  vec_offset = vec_offsets[vec_num * j + i];
 			tree zero = build_zero_cst (vectype);
@@ -10322,6 +10308,25 @@ vectorizable_load (vec_info *vinfo,
 			gcc_assert (!final_mask);
 			unsigned HOST_WIDE_INT const_nunits
 			  = nunits.to_constant ();
+			if (costing_p)
+			  {
+			    /* For emulated gathers N offset vector element
+			       offset add is consumed by the load).  */
+			    inside_cost
+			      = record_stmt_cost (cost_vec, const_nunits,
+						  vec_to_scalar, stmt_info, 0,
+						  vect_body);
+			    /* N scalar loads plus gathering them into a
+			       vector.  */
+			    inside_cost
+			      = record_stmt_cost (cost_vec, const_nunits,
+						  scalar_load, stmt_info, 0,
+						  vect_body);
+			    inside_cost
+			      = record_stmt_cost (cost_vec, 1, vec_construct,
+						  stmt_info, 0, vect_body);
+			    goto vec_num_loop_costing_end;
+			  }
 			unsigned HOST_WIDE_INT const_offset_nunits
 			  = TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype)
 			      .to_constant ();
@@ -10374,6 +10379,9 @@ vectorizable_load (vec_info *vinfo,
 			break;
 		      }
 
+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
+
 		    align =
 		      known_alignment (DR_TARGET_ALIGNMENT (first_dr_info));
 		    if (alignment_support_scheme == dr_aligned)
@@ -10544,6 +10552,8 @@ vectorizable_load (vec_info *vinfo,
 		  }
 		case dr_explicit_realign:
 		  {
+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
 		    tree ptr, bump;
 
 		    tree vs = size_int (TYPE_VECTOR_SUBPARTS (vectype));
@@ -10606,6 +10616,8 @@ vectorizable_load (vec_info *vinfo,
 		  }
 		case dr_explicit_realign_optimized:
 		  {
+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
 		    if (TREE_CODE (dataref_ptr) == SSA_NAME)
 		      new_temp = copy_ssa_name (dataref_ptr);
 		    else
@@ -10702,10 +10714,14 @@ vectorizable_load (vec_info *vinfo,
 						 gsi, stmt_info, bump);
 		  group_elt = 0;
 		}
+vec_num_loop_costing_end:
+	      ;
 	    }
 	  /* Bump the vector pointer to account for a gap or for excess
 	     elements loaded for a permuted SLP load.  */
-	  if (maybe_ne (group_gap_adj, 0U) && slp_perm)
+	  if (!costing_p
+	      && maybe_ne (group_gap_adj, 0U)
+	      && slp_perm)
 	    {
 	      poly_wide_int bump_val
 		= (wi::to_wide (TYPE_SIZE_UNIT (elem_type))
@@ -10754,9 +10770,20 @@ vectorizable_load (vec_info *vinfo,
     *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 
   if (costing_p)
-    vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			  alignment_support_scheme, misalignment, &gs_info,
-			  slp_node, cost_vec);
+    {
+      if (memory_access_type == VMAT_GATHER_SCATTER)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %u, "
+			     "prologue_cost = 0 .\n",
+			     inside_cost);
+	}
+      else
+	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
+			      alignment_support_scheme, misalignment, slp_node,
+			      cost_vec);
+    }
 
   return true;
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 6/9] vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (4 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 5/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-13  2:03 ` [PATCH 7/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE Kewen Lin
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on
VMAT_LOAD_STORE_LANES in function vectorizable_load.  We
don't call function vect_model_load_cost for it any more.

It follows what we do in the function vect_model_load_cost,
and shouldn't have any functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
	VMAT_LOAD_STORE_LANES without calling vect_model_load_cost.
	(vectorizable_load): Remove VMAT_LOAD_STORE_LANES related handling and
	assert it will never get VMAT_LOAD_STORE_LANES.
---
 gcc/tree-vect-stmts.cc | 73 ++++++++++++++++++++++++------------------
 1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a3fd0bf879e..4c5ce2ab278 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1137,7 +1137,8 @@ vect_model_load_cost (vec_info *vinfo,
   gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
 	      && memory_access_type != VMAT_INVARIANT
 	      && memory_access_type != VMAT_ELEMENTWISE
-	      && memory_access_type != VMAT_STRIDED_SLP);
+	      && memory_access_type != VMAT_STRIDED_SLP
+	      && memory_access_type != VMAT_LOAD_STORE_LANES);
 
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1176,31 +1177,6 @@ vect_model_load_cost (vec_info *vinfo,
      once per group anyhow.  */
   bool first_stmt_p = (first_stmt_info == stmt_info);
 
-  /* An IFN_LOAD_LANES will load all its vector results, regardless of which
-     ones we actually need.  Account for the cost of unused results.  */
-  if (first_stmt_p && !slp_node && memory_access_type == VMAT_LOAD_STORE_LANES)
-    {
-      unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
-      stmt_vec_info next_stmt_info = first_stmt_info;
-      do
-	{
-	  gaps -= 1;
-	  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
-	}
-      while (next_stmt_info);
-      if (gaps)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_NOTE, vect_location,
-			     "vect_model_load_cost: %d unused vectors.\n",
-			     gaps);
-	  vect_get_load_cost (vinfo, stmt_info, ncopies * gaps,
-			      alignment_support_scheme, misalignment, false,
-			      &inside_cost, &prologue_cost,
-			      cost_vec, cost_vec, true);
-	}
-    }
-
   /* We assume that the cost of a single load-lanes instruction is
      equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
      access is instead being provided by a load-and-permute operation,
@@ -10110,7 +10086,7 @@ vectorizable_load (vec_info *vinfo,
     }
   tree vec_mask = NULL_TREE;
   poly_uint64 group_elt = 0;
-  unsigned int inside_cost = 0;
+  unsigned int inside_cost = 0, prologue_cost = 0;
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
@@ -10190,8 +10166,42 @@ vectorizable_load (vec_info *vinfo,
 	dr_chain.create (vec_num);
 
       gimple *new_stmt = NULL;
-      if (memory_access_type == VMAT_LOAD_STORE_LANES && !costing_p)
+      if (memory_access_type == VMAT_LOAD_STORE_LANES)
 	{
+	  if (costing_p)
+	    {
+	      /* An IFN_LOAD_LANES will load all its vector results,
+		 regardless of which ones we actually need.  Account
+		 for the cost of unused results.  */
+	      if (grouped_load && first_stmt_info == stmt_info)
+		{
+		  unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
+		  stmt_vec_info next_stmt_info = first_stmt_info;
+		  do
+		    {
+		      gaps -= 1;
+		      next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
+		    }
+		  while (next_stmt_info);
+		  if (gaps)
+		    {
+		      if (dump_enabled_p ())
+			dump_printf_loc (MSG_NOTE, vect_location,
+					 "vect_model_load_cost: %d "
+					 "unused vectors.\n",
+					 gaps);
+		      vect_get_load_cost (vinfo, stmt_info, gaps,
+					  alignment_support_scheme,
+					  misalignment, false, &inside_cost,
+					  &prologue_cost, cost_vec, cost_vec,
+					  true);
+		    }
+		}
+	      vect_get_load_cost (vinfo, stmt_info, 1, alignment_support_scheme,
+				  misalignment, false, &inside_cost,
+				  &prologue_cost, cost_vec, cost_vec, true);
+	      continue;
+	    }
 	  tree vec_array;
 
 	  vec_array = create_vector_array (vectype, vec_num);
@@ -10771,13 +10781,14 @@ vec_num_loop_costing_end:
 
   if (costing_p)
     {
-      if (memory_access_type == VMAT_GATHER_SCATTER)
+      if (memory_access_type == VMAT_GATHER_SCATTER
+	  || memory_access_type == VMAT_LOAD_STORE_LANES)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "vect_model_load_cost: inside_cost = %u, "
-			     "prologue_cost = 0 .\n",
-			     inside_cost);
+			     "prologue_cost = %u .\n",
+			     inside_cost, prologue_cost);
 	}
       else
 	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 7/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (5 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 6/9] vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-13  2:03 ` [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_REVERSE in function vectorizable_load.  We
don't call function vect_model_load_cost for it any more.

This change makes us not miscount some required vector
permutation as the associated test case shows.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_load_cost): Assert it won't get
	VMAT_CONTIGUOUS_REVERSE any more.
	(vectorizable_load): Adjust the costing handling on
	VMAT_CONTIGUOUS_REVERSE without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c: New test.
---
 .../costmodel/ppc/costmodel-vect-reversed.c   |  22 ++++
 gcc/tree-vect-stmts.cc                        | 109 ++++++++++++------
 2 files changed, 93 insertions(+), 38 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c
new file mode 100644
index 00000000000..651274be038
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-additional-options "-mvsx" } */
+
+/* Verify we do cost the required vec_perm.  */
+
+int x[1024], y[1024];
+
+void
+foo ()
+{
+  for (int i = 0; i < 512; ++i)
+    {
+      x[2 * i] = y[1023 - (2 * i)];
+      x[2 * i + 1] = y[1023 - (2 * i + 1)];
+    }
+}
+/* The reason why it doesn't check the exact count is that
+   retrying for the epilogue with partial vector capability
+   like Power10 can result in more than 1 vec_perm.  */
+/* { dg-final { scan-tree-dump {\mvec_perm\M} "vect" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 4c5ce2ab278..7f8d9db5363 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1134,11 +1134,8 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
-	      && memory_access_type != VMAT_INVARIANT
-	      && memory_access_type != VMAT_ELEMENTWISE
-	      && memory_access_type != VMAT_STRIDED_SLP
-	      && memory_access_type != VMAT_LOAD_STORE_LANES);
+  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
+	      || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
 
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -10292,7 +10289,7 @@ vectorizable_load (vec_info *vinfo,
 			      = record_stmt_cost (cost_vec, cnunits,
 						  scalar_load, stmt_info, 0,
 						  vect_body);
-			    goto vec_num_loop_costing_end;
+			    break;
 			  }
 			if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 			  vec_offset = vec_offsets[vec_num * j + i];
@@ -10335,7 +10332,7 @@ vectorizable_load (vec_info *vinfo,
 			    inside_cost
 			      = record_stmt_cost (cost_vec, 1, vec_construct,
 						  stmt_info, 0, vect_body);
-			    goto vec_num_loop_costing_end;
+			    break;
 			  }
 			unsigned HOST_WIDE_INT const_offset_nunits
 			  = TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype)
@@ -10390,7 +10387,7 @@ vectorizable_load (vec_info *vinfo,
 		      }
 
 		    if (costing_p)
-		      goto vec_num_loop_costing_end;
+		      break;
 
 		    align =
 		      known_alignment (DR_TARGET_ALIGNMENT (first_dr_info));
@@ -10563,7 +10560,7 @@ vectorizable_load (vec_info *vinfo,
 		case dr_explicit_realign:
 		  {
 		    if (costing_p)
-		      goto vec_num_loop_costing_end;
+		      break;
 		    tree ptr, bump;
 
 		    tree vs = size_int (TYPE_VECTOR_SUBPARTS (vectype));
@@ -10627,7 +10624,7 @@ vectorizable_load (vec_info *vinfo,
 		case dr_explicit_realign_optimized:
 		  {
 		    if (costing_p)
-		      goto vec_num_loop_costing_end;
+		      break;
 		    if (TREE_CODE (dataref_ptr) == SSA_NAME)
 		      new_temp = copy_ssa_name (dataref_ptr);
 		    else
@@ -10650,22 +10647,37 @@ vectorizable_load (vec_info *vinfo,
 		default:
 		  gcc_unreachable ();
 		}
-	      vec_dest = vect_create_destination_var (scalar_dest, vectype);
-	      /* DATA_REF is null if we've already built the statement.  */
-	      if (data_ref)
+
+	      /* One common place to cost the above vect load for different
+		 alignment support schemes.  */
+	      if (costing_p)
 		{
-		  vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
-		  new_stmt = gimple_build_assign (vec_dest, data_ref);
+		  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
+		    vect_get_load_cost (vinfo, stmt_info, 1,
+					alignment_support_scheme, misalignment,
+					false, &inside_cost, &prologue_cost,
+					cost_vec, cost_vec, true);
+		}
+	      else
+		{
+		  vec_dest = vect_create_destination_var (scalar_dest, vectype);
+		  /* DATA_REF is null if we've already built the statement.  */
+		  if (data_ref)
+		    {
+		      vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
+		      new_stmt = gimple_build_assign (vec_dest, data_ref);
+		    }
+		  new_temp = make_ssa_name (vec_dest, new_stmt);
+		  gimple_set_lhs (new_stmt, new_temp);
+		  vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
 		}
-	      new_temp = make_ssa_name (vec_dest, new_stmt);
-	      gimple_set_lhs (new_stmt, new_temp);
-	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
 
 	      /* 3. Handle explicit realignment if necessary/supported.
 		 Create in loop:
 		   vec_dest = realign_load (msq, lsq, realignment_token)  */
-	      if (alignment_support_scheme == dr_explicit_realign_optimized
-		  || alignment_support_scheme == dr_explicit_realign)
+	      if (!costing_p
+		  && (alignment_support_scheme == dr_explicit_realign_optimized
+		      || alignment_support_scheme == dr_explicit_realign))
 		{
 		  lsq = gimple_assign_lhs (new_stmt);
 		  if (!realignment_token)
@@ -10690,26 +10702,34 @@ vectorizable_load (vec_info *vinfo,
 
 	      if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
 		{
-		  tree perm_mask = perm_mask_for_reverse (vectype);
-		  new_temp = permute_vec_elements (vinfo, new_temp, new_temp,
-						   perm_mask, stmt_info, gsi);
-		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
+		  if (costing_p)
+		    inside_cost = record_stmt_cost (cost_vec, 1, vec_perm,
+						    stmt_info, 0, vect_body);
+		  else
+		    {
+		      tree perm_mask = perm_mask_for_reverse (vectype);
+		      new_temp
+			= permute_vec_elements (vinfo, new_temp, new_temp,
+						perm_mask, stmt_info, gsi);
+		      new_stmt = SSA_NAME_DEF_STMT (new_temp);
+		    }
 		}
 
 	      /* Collect vector loads and later create their permutation in
 		 vect_transform_grouped_load ().  */
-	      if (grouped_load || slp_perm)
+	      if (!costing_p && (grouped_load || slp_perm))
 		dr_chain.quick_push (new_temp);
 
 	      /* Store vector loads in the corresponding SLP_NODE.  */
-	      if (slp && !slp_perm)
+	      if (!costing_p && slp && !slp_perm)
 		SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
 
 	      /* With SLP permutation we load the gaps as well, without
 	         we need to skip the gaps after we manage to fully load
 		 all elements.  group_gap_adj is DR_GROUP_SIZE here.  */
 	      group_elt += nunits;
-	      if (maybe_ne (group_gap_adj, 0U)
+	      if (!costing_p
+		  && maybe_ne (group_gap_adj, 0U)
 		  && !slp_perm
 		  && known_eq (group_elt, group_size - group_gap_adj))
 		{
@@ -10724,8 +10744,6 @@ vectorizable_load (vec_info *vinfo,
 						 gsi, stmt_info, bump);
 		  group_elt = 0;
 		}
-vec_num_loop_costing_end:
-	      ;
 	    }
 	  /* Bump the vector pointer to account for a gap or for excess
 	     elements loaded for a permuted SLP load.  */
@@ -10748,18 +10766,30 @@ vec_num_loop_costing_end:
       if (slp && !slp_perm)
 	continue;
 
-      if (slp_perm && !costing_p)
-        {
+      if (slp_perm)
+	{
 	  unsigned n_perms;
 	  /* For SLP we know we've seen all possible uses of dr_chain so
 	     direct vect_transform_slp_perm_load to DCE the unused parts.
 	     ???  This is a hack to prevent compile-time issues as seen
 	     in PR101120 and friends.  */
-	  bool ok = vect_transform_slp_perm_load (vinfo, slp_node, dr_chain,
-						  gsi, vf, false, &n_perms,
-						  nullptr, true);
-	  gcc_assert (ok);
-        }
+	  if (costing_p
+	      && memory_access_type != VMAT_CONTIGUOUS
+	      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+	    {
+	      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
+					    true, &n_perms, nullptr);
+	      inside_cost = record_stmt_cost (cost_vec, n_perms, vec_perm,
+					      stmt_info, 0, vect_body);
+	    }
+	  else if (!costing_p)
+	    {
+	      bool ok = vect_transform_slp_perm_load (vinfo, slp_node, dr_chain,
+						      gsi, vf, false, &n_perms,
+						      nullptr, true);
+	      gcc_assert (ok);
+	    }
+	}
       else if (!costing_p)
         {
           if (grouped_load)
@@ -10781,8 +10811,11 @@ vec_num_loop_costing_end:
 
   if (costing_p)
     {
-      if (memory_access_type == VMAT_GATHER_SCATTER
-	  || memory_access_type == VMAT_LOAD_STORE_LANES)
+      gcc_assert (memory_access_type != VMAT_INVARIANT
+		  && memory_access_type != VMAT_ELEMENTWISE
+		  && memory_access_type != VMAT_STRIDED_SLP);
+      if (memory_access_type != VMAT_CONTIGUOUS
+	  && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (6 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 7/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-06-14  8:17   ` Hongtao Liu
  2023-06-13  2:03 ` [PATCH 9/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS Kewen Lin
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches
  Cc: richard.guenther, richard.sandiford, segher, bergner, ubizjak,
	hongtao.liu

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load.  We
don't call function vect_model_load_cost for it any more.

As the affected test case gcc.target/i386/pr70021.c shows,
the previous costing can under-cost the total generated
vector loads as for VMAT_CONTIGUOUS_PERMUTE function
vect_model_load_cost doesn't consider the group size which
is considered as vec_num during the transformation.

This patch makes the count of vector load in costing become
consistent with what we generates during the transformation.
To be more specific, for the given test case, for memory
access b[i_20], it costed for 2 vector loads before,
with this patch it costs 8 instead, it matches the final
count of generated vector loads basing from b.  This costing
change makes cost model analysis feel it's not profitable
to vectorize the first loop, so this patch adjusts the test
case without vect cost model any more.

But note that this test case also exposes something we can
improve further is that although the number of vector
permutation what we costed and generated are consistent,
but DCE can further optimize some unused permutation out,
it would be good if we can predict that and generate only
those necessary permutations.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
	handle memory_access_type VMAT_CONTIGUOUS, remove some
	VMAT_CONTIGUOUS_PERMUTE related handlings.
	(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
	without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
---
 gcc/testsuite/gcc.target/i386/pr70021.c |  2 +-
 gcc/tree-vect-stmts.cc                  | 88 ++++++++++++++-----------
 2 files changed, 51 insertions(+), 39 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c b/gcc/testsuite/gcc.target/i386/pr70021.c
index 6562c0f2bd0..d509583601e 100644
--- a/gcc/testsuite/gcc.target/i386/pr70021.c
+++ b/gcc/testsuite/gcc.target/i386/pr70021.c
@@ -1,7 +1,7 @@
 /* PR target/70021 */
 /* { dg-do run } */
 /* { dg-require-effective-target avx2 } */
-/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake" } */
+/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake -fno-vect-cost-model" } */
 
 #include "avx2-check.h"
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 7f8d9db5363..e7a97dbe05d 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
-	      || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
+  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
 
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo,
      once per group anyhow.  */
   bool first_stmt_p = (first_stmt_info == stmt_info);
 
-  /* We assume that the cost of a single load-lanes instruction is
-     equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
-     access is instead being provided by a load-and-permute operation,
-     include the cost of the permutes.  */
-  if (first_stmt_p
-      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
-    {
-      /* Uses an even and odd extract operations or shuffle operations
-	 for each needed permute.  */
-      int group_size = DR_GROUP_SIZE (first_stmt_info);
-      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
-      inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
-				       stmt_info, 0, vect_body);
-
-      if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
-                         "vect_model_load_cost: strided group_size = %d .\n",
-                         group_size);
-    }
-
   vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
 		      misalignment, first_stmt_p, &inside_cost, &prologue_cost,
 		      cost_vec, cost_vec, true);
@@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo,
 		 alignment support schemes.  */
 	      if (costing_p)
 		{
-		  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
+		  /* For VMAT_CONTIGUOUS_PERMUTE if it's grouped load, we
+		     only need to take care of the first stmt, whose
+		     stmt_info is first_stmt_info, vec_num iterating on it
+		     will cover the cost for the remaining, it's consistent
+		     with transforming.  For the prologue cost for realign,
+		     we only need to count it once for the whole group.  */
+		  bool first_stmt_info_p = first_stmt_info == stmt_info;
+		  bool add_realign_cost = first_stmt_info_p && i == 0;
+		  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
+		      || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
+			  && (!grouped_load || first_stmt_info_p)))
 		    vect_get_load_cost (vinfo, stmt_info, 1,
 					alignment_support_scheme, misalignment,
-					false, &inside_cost, &prologue_cost,
-					cost_vec, cost_vec, true);
+					add_realign_cost, &inside_cost,
+					&prologue_cost, cost_vec, cost_vec,
+					true);
 		}
 	      else
 		{
@@ -10774,8 +10764,7 @@ vectorizable_load (vec_info *vinfo,
 	     ???  This is a hack to prevent compile-time issues as seen
 	     in PR101120 and friends.  */
 	  if (costing_p
-	      && memory_access_type != VMAT_CONTIGUOUS
-	      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+	      && memory_access_type != VMAT_CONTIGUOUS)
 	    {
 	      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
 					    true, &n_perms, nullptr);
@@ -10790,20 +10779,44 @@ vectorizable_load (vec_info *vinfo,
 	      gcc_assert (ok);
 	    }
 	}
-      else if (!costing_p)
+      else
         {
           if (grouped_load)
   	    {
 	      if (memory_access_type != VMAT_LOAD_STORE_LANES)
-		vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
-					     group_size, gsi);
-	      *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
-	    }
-          else
-	    {
-	      STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+		{
+		  gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
+		  /* We assume that the cost of a single load-lanes instruction
+		     is equivalent to the cost of DR_GROUP_SIZE separate loads.
+		     If a grouped access is instead being provided by a
+		     load-and-permute operation, include the cost of the
+		     permutes.  */
+		  if (costing_p && first_stmt_info == stmt_info)
+		    {
+		      /* Uses an even and odd extract operations or shuffle
+			 operations for each needed permute.  */
+		      int group_size = DR_GROUP_SIZE (first_stmt_info);
+		      int nstmts = ceil_log2 (group_size) * group_size;
+		      inside_cost
+			+= record_stmt_cost (cost_vec, nstmts, vec_perm,
+					     stmt_info, 0, vect_body);
+
+		      if (dump_enabled_p ())
+			dump_printf_loc (
+			  MSG_NOTE, vect_location,
+			  "vect_model_load_cost: strided group_size = %d .\n",
+			  group_size);
+		    }
+		  else if (!costing_p)
+		    vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
+						 group_size, gsi);
+		}
+	      if (!costing_p)
+		*vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 	    }
-        }
+	  else if (!costing_p)
+	    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+	}
       dr_chain.release ();
     }
   if (!slp && !costing_p)
@@ -10814,8 +10827,7 @@ vectorizable_load (vec_info *vinfo,
       gcc_assert (memory_access_type != VMAT_INVARIANT
 		  && memory_access_type != VMAT_ELEMENTWISE
 		  && memory_access_type != VMAT_STRIDED_SLP);
-      if (memory_access_type != VMAT_CONTIGUOUS
-	  && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+      if (memory_access_type != VMAT_CONTIGUOUS)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 9/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (7 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
@ 2023-06-13  2:03 ` Kewen Lin
  2023-07-03  3:06   ` [PATCH 9/9 v2] " Kewen.Lin
  2023-06-26  6:00 ` [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen.Lin
  2023-06-30 11:37 ` Richard Biener
  10 siblings, 1 reply; 27+ messages in thread
From: Kewen Lin @ 2023-06-13  2:03 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.  It removes function
vect_model_load_cost which becomes useless and unreachable
now.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_load_cost): Remove.
	(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
	calling vect_model_load_cost.
---
 gcc/tree-vect-stmts.cc | 92 +++++-------------------------------------
 1 file changed, 9 insertions(+), 83 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index e7a97dbe05d..be3b277e8e1 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1117,73 +1117,6 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
     }
 }
 
-
-/* Function vect_model_load_cost
-
-   Models cost for loads.  In the case of grouped accesses, one access has
-   the overhead of the grouped access attributed to it.  Since unaligned
-   accesses are supported for loads, we also account for the costs of the
-   access scheme chosen.  */
-
-static void
-vect_model_load_cost (vec_info *vinfo,
-		      stmt_vec_info stmt_info, unsigned ncopies, poly_uint64 vf,
-		      vect_memory_access_type memory_access_type,
-		      dr_alignment_support alignment_support_scheme,
-		      int misalignment,
-		      slp_tree slp_node,
-		      stmt_vector_for_cost *cost_vec)
-{
-  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
-
-  unsigned int inside_cost = 0, prologue_cost = 0;
-  bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
-
-  gcc_assert (cost_vec);
-
-  /* ???  Somehow we need to fix this at the callers.  */
-  if (slp_node)
-    ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-
-  if (slp_node && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ())
-    {
-      /* If the load is permuted then the alignment is determined by
-	 the first group element not by the first scalar stmt DR.  */
-      stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-      /* Record the cost for the permutation.  */
-      unsigned n_perms, n_loads;
-      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL,
-				    vf, true, &n_perms, &n_loads);
-      inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
-				       first_stmt_info, 0, vect_body);
-
-      /* And adjust the number of loads performed.  This handles
-	 redundancies as well as loads that are later dead.  */
-      ncopies = n_loads;
-    }
-
-  /* Grouped loads read all elements in the group at once,
-     so we want the DR for the first statement.  */
-  stmt_vec_info first_stmt_info = stmt_info;
-  if (!slp_node && grouped_access_p)
-    first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-
-  /* True if we should include any once-per-group costs as well as
-     the cost of the statement itself.  For SLP we only get called
-     once per group anyhow.  */
-  bool first_stmt_p = (first_stmt_info == stmt_info);
-
-  vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
-		      misalignment, first_stmt_p, &inside_cost, &prologue_cost,
-		      cost_vec, cost_vec, true);
-
-  if (dump_enabled_p ())
-    dump_printf_loc (MSG_NOTE, vect_location,
-                     "vect_model_load_cost: inside_cost = %d, "
-                     "prologue_cost = %d .\n", inside_cost, prologue_cost);
-}
-
-
 /* Calculate cost of DR's memory access.  */
 void
 vect_get_load_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
@@ -10639,7 +10572,8 @@ vectorizable_load (vec_info *vinfo,
 		     we only need to count it once for the whole group.  */
 		  bool first_stmt_info_p = first_stmt_info == stmt_info;
 		  bool add_realign_cost = first_stmt_info_p && i == 0;
-		  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
+		  if (memory_access_type == VMAT_CONTIGUOUS
+		      || memory_access_type == VMAT_CONTIGUOUS_REVERSE
 		      || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
 			  && (!grouped_load || first_stmt_info_p)))
 		    vect_get_load_cost (vinfo, stmt_info, 1,
@@ -10763,15 +10697,14 @@ vectorizable_load (vec_info *vinfo,
 	     direct vect_transform_slp_perm_load to DCE the unused parts.
 	     ???  This is a hack to prevent compile-time issues as seen
 	     in PR101120 and friends.  */
-	  if (costing_p
-	      && memory_access_type != VMAT_CONTIGUOUS)
+	  if (costing_p)
 	    {
 	      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
 					    true, &n_perms, nullptr);
 	      inside_cost = record_stmt_cost (cost_vec, n_perms, vec_perm,
 					      stmt_info, 0, vect_body);
 	    }
-	  else if (!costing_p)
+	  else
 	    {
 	      bool ok = vect_transform_slp_perm_load (vinfo, slp_node, dr_chain,
 						      gsi, vf, false, &n_perms,
@@ -10827,18 +10760,11 @@ vectorizable_load (vec_info *vinfo,
       gcc_assert (memory_access_type != VMAT_INVARIANT
 		  && memory_access_type != VMAT_ELEMENTWISE
 		  && memory_access_type != VMAT_STRIDED_SLP);
-      if (memory_access_type != VMAT_CONTIGUOUS)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_NOTE, vect_location,
-			     "vect_model_load_cost: inside_cost = %u, "
-			     "prologue_cost = %u .\n",
-			     inside_cost, prologue_cost);
-	}
-      else
-	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			      alignment_support_scheme, misalignment, slp_node,
-			      cost_vec);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_load_cost: inside_cost = %u, "
+			 "prologue_cost = %u .\n",
+			 inside_cost, prologue_cost);
     }
 
   return true;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
  2023-06-13  2:03 ` [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
@ 2023-06-14  8:17   ` Hongtao Liu
  2023-06-19  7:23     ` Kewen.Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Hongtao Liu @ 2023-06-14  8:17 UTC (permalink / raw)
  To: Kewen Lin
  Cc: gcc-patches, richard.guenther, richard.sandiford, segher,
	bergner, ubizjak, hongtao.liu

On Tue, Jun 13, 2023 at 10:07 AM Kewen Lin via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This patch adjusts the cost handling on
> VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load.  We
> don't call function vect_model_load_cost for it any more.
>
> As the affected test case gcc.target/i386/pr70021.c shows,
> the previous costing can under-cost the total generated
> vector loads as for VMAT_CONTIGUOUS_PERMUTE function
> vect_model_load_cost doesn't consider the group size which
> is considered as vec_num during the transformation.
The original PR is for the correctness issue, and I'm not sure how
much of a performance impact the patch would be, but the change looks
reasonable, so the test change looks ok to me.
I'll track performance impact on SPEC2017 to see if there's any
regression caused by the patch(Guess probably not).
>
> This patch makes the count of vector load in costing become
> consistent with what we generates during the transformation.
> To be more specific, for the given test case, for memory
> access b[i_20], it costed for 2 vector loads before,
> with this patch it costs 8 instead, it matches the final
> count of generated vector loads basing from b.  This costing
> change makes cost model analysis feel it's not profitable
> to vectorize the first loop, so this patch adjusts the test
> case without vect cost model any more.
>
> But note that this test case also exposes something we can
> improve further is that although the number of vector
> permutation what we costed and generated are consistent,
> but DCE can further optimize some unused permutation out,
> it would be good if we can predict that and generate only
> those necessary permutations.
>
> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
>         handle memory_access_type VMAT_CONTIGUOUS, remove some
>         VMAT_CONTIGUOUS_PERMUTE related handlings.
>         (vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
>         without calling vect_model_load_cost.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
> ---
>  gcc/testsuite/gcc.target/i386/pr70021.c |  2 +-
>  gcc/tree-vect-stmts.cc                  | 88 ++++++++++++++-----------
>  2 files changed, 51 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c b/gcc/testsuite/gcc.target/i386/pr70021.c
> index 6562c0f2bd0..d509583601e 100644
> --- a/gcc/testsuite/gcc.target/i386/pr70021.c
> +++ b/gcc/testsuite/gcc.target/i386/pr70021.c
> @@ -1,7 +1,7 @@
>  /* PR target/70021 */
>  /* { dg-do run } */
>  /* { dg-require-effective-target avx2 } */
> -/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake" } */
> +/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake -fno-vect-cost-model" } */
>
>  #include "avx2-check.h"
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 7f8d9db5363..e7a97dbe05d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo,
>                       slp_tree slp_node,
>                       stmt_vector_for_cost *cost_vec)
>  {
> -  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
> -             || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
> +  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
>
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo,
>       once per group anyhow.  */
>    bool first_stmt_p = (first_stmt_info == stmt_info);
>
> -  /* We assume that the cost of a single load-lanes instruction is
> -     equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
> -     access is instead being provided by a load-and-permute operation,
> -     include the cost of the permutes.  */
> -  if (first_stmt_p
> -      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> -    {
> -      /* Uses an even and odd extract operations or shuffle operations
> -        for each needed permute.  */
> -      int group_size = DR_GROUP_SIZE (first_stmt_info);
> -      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
> -      inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
> -                                      stmt_info, 0, vect_body);
> -
> -      if (dump_enabled_p ())
> -        dump_printf_loc (MSG_NOTE, vect_location,
> -                         "vect_model_load_cost: strided group_size = %d .\n",
> -                         group_size);
> -    }
> -
>    vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
>                       misalignment, first_stmt_p, &inside_cost, &prologue_cost,
>                       cost_vec, cost_vec, true);
> @@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo,
>                  alignment support schemes.  */
>               if (costing_p)
>                 {
> -                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> +                 /* For VMAT_CONTIGUOUS_PERMUTE if it's grouped load, we
> +                    only need to take care of the first stmt, whose
> +                    stmt_info is first_stmt_info, vec_num iterating on it
> +                    will cover the cost for the remaining, it's consistent
> +                    with transforming.  For the prologue cost for realign,
> +                    we only need to count it once for the whole group.  */
> +                 bool first_stmt_info_p = first_stmt_info == stmt_info;
> +                 bool add_realign_cost = first_stmt_info_p && i == 0;
> +                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
> +                     || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
> +                         && (!grouped_load || first_stmt_info_p)))
>                     vect_get_load_cost (vinfo, stmt_info, 1,
>                                         alignment_support_scheme, misalignment,
> -                                       false, &inside_cost, &prologue_cost,
> -                                       cost_vec, cost_vec, true);
> +                                       add_realign_cost, &inside_cost,
> +                                       &prologue_cost, cost_vec, cost_vec,
> +                                       true);
>                 }
>               else
>                 {
> @@ -10774,8 +10764,7 @@ vectorizable_load (vec_info *vinfo,
>              ???  This is a hack to prevent compile-time issues as seen
>              in PR101120 and friends.  */
>           if (costing_p
> -             && memory_access_type != VMAT_CONTIGUOUS
> -             && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
> +             && memory_access_type != VMAT_CONTIGUOUS)
>             {
>               vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
>                                             true, &n_perms, nullptr);
> @@ -10790,20 +10779,44 @@ vectorizable_load (vec_info *vinfo,
>               gcc_assert (ok);
>             }
>         }
> -      else if (!costing_p)
> +      else
>          {
>            if (grouped_load)
>             {
>               if (memory_access_type != VMAT_LOAD_STORE_LANES)
> -               vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
> -                                            group_size, gsi);
> -             *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> -           }
> -          else
> -           {
> -             STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
> +               {
> +                 gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
> +                 /* We assume that the cost of a single load-lanes instruction
> +                    is equivalent to the cost of DR_GROUP_SIZE separate loads.
> +                    If a grouped access is instead being provided by a
> +                    load-and-permute operation, include the cost of the
> +                    permutes.  */
> +                 if (costing_p && first_stmt_info == stmt_info)
> +                   {
> +                     /* Uses an even and odd extract operations or shuffle
> +                        operations for each needed permute.  */
> +                     int group_size = DR_GROUP_SIZE (first_stmt_info);
> +                     int nstmts = ceil_log2 (group_size) * group_size;
> +                     inside_cost
> +                       += record_stmt_cost (cost_vec, nstmts, vec_perm,
> +                                            stmt_info, 0, vect_body);
> +
> +                     if (dump_enabled_p ())
> +                       dump_printf_loc (
> +                         MSG_NOTE, vect_location,
> +                         "vect_model_load_cost: strided group_size = %d .\n",
> +                         group_size);
> +                   }
> +                 else if (!costing_p)
> +                   vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
> +                                                group_size, gsi);
> +               }
> +             if (!costing_p)
> +               *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>             }
> -        }
> +         else if (!costing_p)
> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
> +       }
>        dr_chain.release ();
>      }
>    if (!slp && !costing_p)
> @@ -10814,8 +10827,7 @@ vectorizable_load (vec_info *vinfo,
>        gcc_assert (memory_access_type != VMAT_INVARIANT
>                   && memory_access_type != VMAT_ELEMENTWISE
>                   && memory_access_type != VMAT_STRIDED_SLP);
> -      if (memory_access_type != VMAT_CONTIGUOUS
> -         && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
> +      if (memory_access_type != VMAT_CONTIGUOUS)
>         {
>           if (dump_enabled_p ())
>             dump_printf_loc (MSG_NOTE, vect_location,
> --
> 2.31.1
>


-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
  2023-06-14  8:17   ` Hongtao Liu
@ 2023-06-19  7:23     ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-06-19  7:23 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: gcc-patches, richard.guenther, richard.sandiford, segher,
	bergner, ubizjak, hongtao.liu

Hi Hongtao,

on 2023/6/14 16:17, Hongtao Liu wrote:
> On Tue, Jun 13, 2023 at 10:07 AM Kewen Lin via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> This patch adjusts the cost handling on
>> VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load.  We
>> don't call function vect_model_load_cost for it any more.
>>
>> As the affected test case gcc.target/i386/pr70021.c shows,
>> the previous costing can under-cost the total generated
>> vector loads as for VMAT_CONTIGUOUS_PERMUTE function
>> vect_model_load_cost doesn't consider the group size which
>> is considered as vec_num during the transformation.
> The original PR is for the correctness issue, and I'm not sure how
> much of a performance impact the patch would be, but the change looks
> reasonable, so the test change looks ok to me.
> I'll track performance impact on SPEC2017 to see if there's any
> regression caused by the patch(Guess probably not).

Thanks for the feedback and further tracking!  Hope this (and
this whole series) doesn't impact SPEC2017 performance on x86. :)

BR,
Kewen

>>
>> This patch makes the count of vector load in costing become
>> consistent with what we generates during the transformation.
>> To be more specific, for the given test case, for memory
>> access b[i_20], it costed for 2 vector loads before,
>> with this patch it costs 8 instead, it matches the final
>> count of generated vector loads basing from b.  This costing
>> change makes cost model analysis feel it's not profitable
>> to vectorize the first loop, so this patch adjusts the test
>> case without vect cost model any more.
>>
>> But note that this test case also exposes something we can
>> improve further is that although the number of vector
>> permutation what we costed and generated are consistent,
>> but DCE can further optimize some unused permutation out,
>> it would be good if we can predict that and generate only
>> those necessary permutations.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
>>         handle memory_access_type VMAT_CONTIGUOUS, remove some
>>         VMAT_CONTIGUOUS_PERMUTE related handlings.
>>         (vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
>>         without calling vect_model_load_cost.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
>> ---
>>  gcc/testsuite/gcc.target/i386/pr70021.c |  2 +-
>>  gcc/tree-vect-stmts.cc                  | 88 ++++++++++++++-----------
>>  2 files changed, 51 insertions(+), 39 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c b/gcc/testsuite/gcc.target/i386/pr70021.c
>> index 6562c0f2bd0..d509583601e 100644
>> --- a/gcc/testsuite/gcc.target/i386/pr70021.c
>> +++ b/gcc/testsuite/gcc.target/i386/pr70021.c
>> @@ -1,7 +1,7 @@
>>  /* PR target/70021 */
>>  /* { dg-do run } */
>>  /* { dg-require-effective-target avx2 } */
>> -/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake" } */
>> +/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake -fno-vect-cost-model" } */
>>
>>  #include "avx2-check.h"
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index 7f8d9db5363..e7a97dbe05d 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo,
>>                       slp_tree slp_node,
>>                       stmt_vector_for_cost *cost_vec)
>>  {
>> -  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
>> -             || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
>> +  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
>>
>>    unsigned int inside_cost = 0, prologue_cost = 0;
>>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>> @@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo,
>>       once per group anyhow.  */
>>    bool first_stmt_p = (first_stmt_info == stmt_info);
>>
>> -  /* We assume that the cost of a single load-lanes instruction is
>> -     equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
>> -     access is instead being provided by a load-and-permute operation,
>> -     include the cost of the permutes.  */
>> -  if (first_stmt_p
>> -      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
>> -    {
>> -      /* Uses an even and odd extract operations or shuffle operations
>> -        for each needed permute.  */
>> -      int group_size = DR_GROUP_SIZE (first_stmt_info);
>> -      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
>> -      inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
>> -                                      stmt_info, 0, vect_body);
>> -
>> -      if (dump_enabled_p ())
>> -        dump_printf_loc (MSG_NOTE, vect_location,
>> -                         "vect_model_load_cost: strided group_size = %d .\n",
>> -                         group_size);
>> -    }
>> -
>>    vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
>>                       misalignment, first_stmt_p, &inside_cost, &prologue_cost,
>>                       cost_vec, cost_vec, true);
>> @@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo,
>>                  alignment support schemes.  */
>>               if (costing_p)
>>                 {
>> -                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
>> +                 /* For VMAT_CONTIGUOUS_PERMUTE if it's grouped load, we
>> +                    only need to take care of the first stmt, whose
>> +                    stmt_info is first_stmt_info, vec_num iterating on it
>> +                    will cover the cost for the remaining, it's consistent
>> +                    with transforming.  For the prologue cost for realign,
>> +                    we only need to count it once for the whole group.  */
>> +                 bool first_stmt_info_p = first_stmt_info == stmt_info;
>> +                 bool add_realign_cost = first_stmt_info_p && i == 0;
>> +                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
>> +                     || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
>> +                         && (!grouped_load || first_stmt_info_p)))
>>                     vect_get_load_cost (vinfo, stmt_info, 1,
>>                                         alignment_support_scheme, misalignment,
>> -                                       false, &inside_cost, &prologue_cost,
>> -                                       cost_vec, cost_vec, true);
>> +                                       add_realign_cost, &inside_cost,
>> +                                       &prologue_cost, cost_vec, cost_vec,
>> +                                       true);
>>                 }
>>               else
>>                 {
>> @@ -10774,8 +10764,7 @@ vectorizable_load (vec_info *vinfo,
>>              ???  This is a hack to prevent compile-time issues as seen
>>              in PR101120 and friends.  */
>>           if (costing_p
>> -             && memory_access_type != VMAT_CONTIGUOUS
>> -             && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
>> +             && memory_access_type != VMAT_CONTIGUOUS)
>>             {
>>               vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
>>                                             true, &n_perms, nullptr);
>> @@ -10790,20 +10779,44 @@ vectorizable_load (vec_info *vinfo,
>>               gcc_assert (ok);
>>             }
>>         }
>> -      else if (!costing_p)
>> +      else
>>          {
>>            if (grouped_load)
>>             {
>>               if (memory_access_type != VMAT_LOAD_STORE_LANES)
>> -               vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
>> -                                            group_size, gsi);
>> -             *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>> -           }
>> -          else
>> -           {
>> -             STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +               {
>> +                 gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
>> +                 /* We assume that the cost of a single load-lanes instruction
>> +                    is equivalent to the cost of DR_GROUP_SIZE separate loads.
>> +                    If a grouped access is instead being provided by a
>> +                    load-and-permute operation, include the cost of the
>> +                    permutes.  */
>> +                 if (costing_p && first_stmt_info == stmt_info)
>> +                   {
>> +                     /* Uses an even and odd extract operations or shuffle
>> +                        operations for each needed permute.  */
>> +                     int group_size = DR_GROUP_SIZE (first_stmt_info);
>> +                     int nstmts = ceil_log2 (group_size) * group_size;
>> +                     inside_cost
>> +                       += record_stmt_cost (cost_vec, nstmts, vec_perm,
>> +                                            stmt_info, 0, vect_body);
>> +
>> +                     if (dump_enabled_p ())
>> +                       dump_printf_loc (
>> +                         MSG_NOTE, vect_location,
>> +                         "vect_model_load_cost: strided group_size = %d .\n",
>> +                         group_size);
>> +                   }
>> +                 else if (!costing_p)
>> +                   vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
>> +                                                group_size, gsi);
>> +               }
>> +             if (!costing_p)
>> +               *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>>             }
>> -        }
>> +         else if (!costing_p)
>> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +       }
>>        dr_chain.release ();
>>      }
>>    if (!slp && !costing_p)
>> @@ -10814,8 +10827,7 @@ vectorizable_load (vec_info *vinfo,
>>        gcc_assert (memory_access_type != VMAT_INVARIANT
>>                   && memory_access_type != VMAT_ELEMENTWISE
>>                   && memory_access_type != VMAT_STRIDED_SLP);
>> -      if (memory_access_type != VMAT_CONTIGUOUS
>> -         && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
>> +      if (memory_access_type != VMAT_CONTIGUOUS)
>>         {
>>           if (dump_enabled_p ())
>>             dump_printf_loc (MSG_NOTE, vect_location,
>> --
>> 2.31.1
>>
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] vect: Move costing next to the transform for vect load
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (8 preceding siblings ...)
  2023-06-13  2:03 ` [PATCH 9/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS Kewen Lin
@ 2023-06-26  6:00 ` Kewen.Lin
  2023-06-30 11:37 ` Richard Biener
  10 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-06-26  6:00 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener, richard.sandiford, segher, bergner

Hi,

I'd like to gentle ping this patch series:

https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621460.html

BR,
Kewen

on 2023/6/13 10:03, Kewen Lin via Gcc-patches wrote:
> This patch series follows Richi's suggestion at the link [1],
> which suggest structuring vectorizable_load to make costing
> next to the transform, in order to make it easier to keep
> costing and the transform in sync.  For now, it's a known
> issue that what we cost can be inconsistent with what we
> transform, as the case in PR82255 and some other associated
> test cases in the patches of this series show.
> 
> Basically this patch series makes costing not call function
> vect_model_load_cost any more.  To make the review and
> bisection easy, I organized the changes according to the
> memory access types of vector load.  For each memory access
> type, firstly it follows the handlings in the function
> vect_model_load_costto avoid any missing, then refines
> further by referring to the transform code, I also checked
> them with some typical test cases to verify.  Hope the
> subjects of patches are clear enough.
> 
> The whole series can be bootstrapped and regtested
> incrementally on:
>   - x86_64-redhat-linux
>   - aarch64-linux-gnu
>   - powerpc64-linux-gnu P7, P8 and P9
>   - powerpc64le-linux-gnu P8, P9 and P10
> 
> By considering the current vector test buckets are mainly
> tested without cost model, I also verified the whole patch
> series was neutral for SPEC2017 int/fp on Power9 at O2,
> O3 and Ofast separately.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563624.html
> 
> Kewen Lin (9):
>   vect: Move vect_model_load_cost next to the transform in vectorizable_load
>   vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
>   vect: Adjust vectorizable_load costing on VMAT_INVARIANT
>   vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
>   vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER
>   vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS
> 
>  .../vect/costmodel/ppc/costmodel-pr82255.c    |  31 +
>  .../costmodel/ppc/costmodel-vect-reversed.c   |  22 +
>  gcc/testsuite/gcc.target/i386/pr70021.c       |   2 +-
>  gcc/tree-vect-stmts.cc                        | 651 ++++++++++--------
>  4 files changed, 432 insertions(+), 274 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
  2023-06-13  2:03 ` [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl Kewen Lin
@ 2023-06-30 11:11   ` Richard Biener
  2023-07-03  2:57     ` [PATCH 2/9 v2] " Kewen.Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Biener @ 2023-06-30 11:11 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford, segher, bergner

On Tue, Jun 13, 2023 at 4:03 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adds one extra argument cost_vec to function
> vect_build_gather_load_calls, so that we can do costing
> next to the tranform in vect_build_gather_load_calls.
> For now, the implementation just follows the handlings in
> vect_model_load_cost, it isn't so good, so placing one
> FIXME for any further improvement.  This patch should not
> cause any functional changes.
>
> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (vect_build_gather_load_calls): Add the handlings
>         on costing with one extra argument cost_vec.
>         (vectorizable_load): Adjust the call to vect_build_gather_load_calls.
>         (vect_model_load_cost): Assert it won't get VMAT_GATHER_SCATTER with
>         gs_info.decl set any more.
> ---
>  gcc/tree-vect-stmts.cc | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 44514658be3..744cdf40e26 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1135,6 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
>                       slp_tree slp_node,
>                       stmt_vector_for_cost *cost_vec)
>  {
> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
> +
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>
> @@ -2819,7 +2821,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>                               gimple_stmt_iterator *gsi,
>                               gimple **vec_stmt,
>                               gather_scatter_info *gs_info,
> -                             tree mask)
> +                             tree mask,
> +                             stmt_vector_for_cost *cost_vec)
>  {
>    loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> @@ -2831,6 +2834,23 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>    poly_uint64 gather_off_nunits
>      = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
>
> +  /* FIXME: Keep the previous costing way in vect_model_load_cost by costing
> +     N scalar loads, but it should be tweaked to use target specific costs
> +     on related gather load calls.  */
> +  if (!vec_stmt)

going over the series now, I'm collecting comments but wanted to get
this one out here:
I'd rather see if (cost_vec) here, that 'vec_stmt' argument is quite
legacy (I think it can
be completely purged everywhere)

> +    {
> +      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> +      unsigned int inside_cost;
> +      inside_cost = record_stmt_cost (cost_vec, ncopies * assumed_nunits,
> +                                     scalar_load, stmt_info, 0, vect_body);
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "vect_model_load_cost: inside_cost = %d, "
> +                        "prologue_cost = 0 .\n",
> +                        inside_cost);
> +      return;
> +    }
> +
>    tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info->decl));
>    tree rettype = TREE_TYPE (TREE_TYPE (gs_info->decl));
>    tree srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> @@ -9483,13 +9503,8 @@ vectorizable_load (vec_info *vinfo,
>
>    if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
>      {
> -      if (costing_p)
> -       vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
> -                             alignment_support_scheme, misalignment, &gs_info,
> -                             slp_node, cost_vec);
> -      else
> -       vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
> -                                     mask);
> +      vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
> +                                   mask, cost_vec);
>        return true;
>      }
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT
  2023-06-13  2:03 ` [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT Kewen Lin
@ 2023-06-30 11:18   ` Richard Biener
  2023-07-03  2:58     ` [PATCH 3/9 v2] " Kewen.Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Biener @ 2023-06-30 11:18 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford, segher, bergner

On Tue, Jun 13, 2023 at 4:03 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch adjusts the cost handling on VMAT_INVARIANT in
> function vectorizable_load.  We don't call function
> vect_model_load_cost for it any more.
>
> To make the costing on VMAT_INVARIANT better, this patch is
> to query hoist_defs_of_uses for hoisting decision, and add
> costs for different "where" based on it.  Currently function
> hoist_defs_of_uses would always hoist the defs of all SSA
> uses, adding one argument HOIST_P aims to avoid the actual
> hoisting during costing phase.
>
> gcc/ChangeLog:
>
>         * tree-vect-stmts.cc (hoist_defs_of_uses): Add one argument HOIST_P.
>         (vectorizable_load): Adjust the handling on VMAT_INVARIANT to respect
>         hoisting decision and without calling vect_model_load_cost.
>         (vect_model_load_cost): Assert it won't get VMAT_INVARIANT any more
>         and remove VMAT_INVARIANT related handlings.
> ---
>  gcc/tree-vect-stmts.cc | 61 +++++++++++++++++++++++++++---------------
>  1 file changed, 39 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 744cdf40e26..19c61d703c8 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1135,7 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
>                       slp_tree slp_node,
>                       stmt_vector_for_cost *cost_vec)
>  {
> -  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
> +  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
> +             && memory_access_type != VMAT_INVARIANT);
>
>    unsigned int inside_cost = 0, prologue_cost = 0;
>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -1238,16 +1239,6 @@ vect_model_load_cost (vec_info *vinfo,
>                                        ncopies * assumed_nunits,
>                                        scalar_load, stmt_info, 0, vect_body);
>      }
> -  else if (memory_access_type == VMAT_INVARIANT)
> -    {
> -      /* Invariant loads will ideally be hoisted and splat to a vector.  */
> -      prologue_cost += record_stmt_cost (cost_vec, 1,
> -                                        scalar_load, stmt_info, 0,
> -                                        vect_prologue);
> -      prologue_cost += record_stmt_cost (cost_vec, 1,
> -                                        scalar_to_vec, stmt_info, 0,
> -                                        vect_prologue);
> -    }
>    else
>      vect_get_load_cost (vinfo, stmt_info, ncopies,
>                         alignment_support_scheme, misalignment, first_stmt_p,
> @@ -9121,10 +9112,11 @@ permute_vec_elements (vec_info *vinfo,
>  /* Hoist the definitions of all SSA uses on STMT_INFO out of the loop LOOP,
>     inserting them on the loops preheader edge.  Returns true if we
>     were successful in doing so (and thus STMT_INFO can be moved then),
> -   otherwise returns false.  */
> +   otherwise returns false.  HOIST_P indicates if we want to hoist the
> +   definitions of all SSA uses, it would be false when we are costing.  */
>
>  static bool
> -hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
> +hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop, bool hoist_p)
>  {
>    ssa_op_iter i;
>    tree op;
> @@ -9158,6 +9150,9 @@ hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
>    if (!any)
>      return true;
>
> +  if (!hoist_p)
> +    return true;
> +
>    FOR_EACH_SSA_TREE_OPERAND (op, stmt_info->stmt, i, SSA_OP_USE)
>      {
>        gimple *def_stmt = SSA_NAME_DEF_STMT (op);
> @@ -9510,14 +9505,6 @@ vectorizable_load (vec_info *vinfo,
>
>    if (memory_access_type == VMAT_INVARIANT)
>      {
> -      if (costing_p)
> -       {
> -         vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
> -                               memory_access_type, alignment_support_scheme,
> -                               misalignment, &gs_info, slp_node, cost_vec);
> -         return true;
> -       }
> -
>        gcc_assert (!grouped_load && !mask && !bb_vinfo);
>        /* If we have versioned for aliasing or the loop doesn't
>          have any data dependencies that would preclude this,
> @@ -9525,7 +9512,37 @@ vectorizable_load (vec_info *vinfo,
>          thus we can insert it on the preheader edge.  */
>        bool hoist_p = (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)
>                       && !nested_in_vect_loop
> -                     && hoist_defs_of_uses (stmt_info, loop));
> +                     && hoist_defs_of_uses (stmt_info, loop, !costing_p));

'hoist_defs_of_uses' should ideally be computed once at analysis time and
the result remembered.  It's not so easy in this case so maybe just
add a comment
for this here.

> +      if (costing_p)
> +       {
> +         if (hoist_p)
> +           {
> +             unsigned int prologue_cost;
> +             prologue_cost = record_stmt_cost (cost_vec, 1, scalar_load,
> +                                               stmt_info, 0, vect_prologue);
> +             prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                                stmt_info, 0, vect_prologue);
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "vect_model_load_cost: inside_cost = 0, "
> +                                "prologue_cost = %d .\n",
> +                                prologue_cost);
> +           }
> +         else
> +           {
> +             unsigned int inside_cost;
> +             inside_cost = record_stmt_cost (cost_vec, 1, scalar_load,
> +                                             stmt_info, 0, vect_body);
> +             inside_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
> +                                              stmt_info, 0, vect_body);
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_NOTE, vect_location,
> +                                "vect_model_load_cost: inside_cost = %d, "
> +                                "prologue_cost = 0 .\n",
> +                                inside_cost);
> +           }

Please instead do

              enum vect_cost_model_location loc = hoist_p ?
vect_prologue : vect_body;

and merge the two branches which otherwise look identical to me.

> +         return true;
> +       }
>        if (hoist_p)
>         {
>           gassign *stmt = as_a <gassign *> (stmt_info->stmt);
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] vect: Move costing next to the transform for vect load
  2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
                   ` (9 preceding siblings ...)
  2023-06-26  6:00 ` [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen.Lin
@ 2023-06-30 11:37 ` Richard Biener
  2023-07-02  9:13   ` Richard Sandiford
  2023-07-03  3:39   ` Kewen.Lin
  10 siblings, 2 replies; 27+ messages in thread
From: Richard Biener @ 2023-06-30 11:37 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.sandiford, segher, bergner

On Tue, Jun 13, 2023 at 4:07 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>
> This patch series follows Richi's suggestion at the link [1],
> which suggest structuring vectorizable_load to make costing
> next to the transform, in order to make it easier to keep
> costing and the transform in sync.  For now, it's a known
> issue that what we cost can be inconsistent with what we
> transform, as the case in PR82255 and some other associated
> test cases in the patches of this series show.
>
> Basically this patch series makes costing not call function
> vect_model_load_cost any more.  To make the review and
> bisection easy, I organized the changes according to the
> memory access types of vector load.  For each memory access
> type, firstly it follows the handlings in the function
> vect_model_load_costto avoid any missing, then refines
> further by referring to the transform code, I also checked
> them with some typical test cases to verify.  Hope the
> subjects of patches are clear enough.
>
> The whole series can be bootstrapped and regtested
> incrementally on:
>   - x86_64-redhat-linux
>   - aarch64-linux-gnu
>   - powerpc64-linux-gnu P7, P8 and P9
>   - powerpc64le-linux-gnu P8, P9 and P10
>
> By considering the current vector test buckets are mainly
> tested without cost model, I also verified the whole patch
> series was neutral for SPEC2017 int/fp on Power9 at O2,
> O3 and Ofast separately.

I went through the series now and I like it overall (well, I suggested
the change).
Looking at the changes I think we want some followup to reduce the
mess in the final loop nest.  We already have some VMAT_* cases handled
separately, maybe we can split out some more cases.  Maybe we should
bite the bullet and duplicate that loop nest for the different VMAT_* cases.
Maybe we can merge some of the if (!costing_p) checks by clever
re-ordering.  So what
this series doesn't improve is overall readability of the code (indent and our
80 char line limit).

The change also makes it more difficult(?) to separate analysis and transform
though in the end I hope that analysis will actually "code generate" to a (SLP)
data structure so the target will have a chance to see the actual flow of insns.

That said, I'd like to hear from Richard whether he thinks this is a step
in the right direction.

Are you willing to followup with doing the same re-structuring to
vectorizable_store?

OK from my side with the few comments addressed.  The patch likely needs refresh
after the RVV changes in this area?

Thanks,
Richard.

> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563624.html
>
> Kewen Lin (9):
>   vect: Move vect_model_load_cost next to the transform in vectorizable_load
>   vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
>   vect: Adjust vectorizable_load costing on VMAT_INVARIANT
>   vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
>   vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER
>   vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
>   vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS
>
>  .../vect/costmodel/ppc/costmodel-pr82255.c    |  31 +
>  .../costmodel/ppc/costmodel-vect-reversed.c   |  22 +
>  gcc/testsuite/gcc.target/i386/pr70021.c       |   2 +-
>  gcc/tree-vect-stmts.cc                        | 651 ++++++++++--------
>  4 files changed, 432 insertions(+), 274 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-06-13  2:03 ` [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
@ 2023-07-02  8:58   ` Richard Sandiford
  2023-07-03  3:19     ` Kewen.Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Sandiford @ 2023-07-02  8:58 UTC (permalink / raw)
  To: Kewen Lin; +Cc: gcc-patches, richard.guenther, segher, bergner

Kewen Lin <linkw@linux.ibm.com> writes:
> @@ -9743,11 +9739,23 @@ vectorizable_load (vec_info *vinfo,
>        unsigned int n_groups = 0;
>        for (j = 0; j < ncopies; j++)
>  	{
> -	  if (nloads > 1)
> +	  if (nloads > 1 && !costing_p)
>  	    vec_alloc (v, nloads);
>  	  gimple *new_stmt = NULL;
>  	  for (i = 0; i < nloads; i++)
>  	    {
> +	      if (costing_p)
> +		{
> +		  if (VECTOR_TYPE_P (ltype))
> +		    vect_get_load_cost (vinfo, stmt_info, 1,
> +					alignment_support_scheme, misalignment,
> +					false, &inside_cost, nullptr, cost_vec,
> +					cost_vec, true);
> +		  else
> +		    inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
> +						     stmt_info, 0, vect_body);
> +		  continue;
> +		}

Just a note that this might make life harder for AArch64 costing.
Strided SLP loads are still equivalent to vector loads for AArch64,
since they happen on the FPR/vector side even if they have integral
modes.

But I agree this is more accurate from a general target-independent POV,
especially given the relatively coarse-grain costing enum.  So I think
that's just something AArch64 will need to account for.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] vect: Move costing next to the transform for vect load
  2023-06-30 11:37 ` Richard Biener
@ 2023-07-02  9:13   ` Richard Sandiford
  2023-07-03  3:39   ` Kewen.Lin
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Sandiford @ 2023-07-02  9:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: Kewen Lin, gcc-patches, segher, bergner

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Jun 13, 2023 at 4:07 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>>
>> This patch series follows Richi's suggestion at the link [1],
>> which suggest structuring vectorizable_load to make costing
>> next to the transform, in order to make it easier to keep
>> costing and the transform in sync.

FTR, I was keeping quiet given that this was following an agreed plan :)

Thanks for organising the series this way.  It made it easier to review.

>> For now, it's a known
>> issue that what we cost can be inconsistent with what we
>> transform, as the case in PR82255 and some other associated
>> test cases in the patches of this series show.
>>
>> Basically this patch series makes costing not call function
>> vect_model_load_cost any more.  To make the review and
>> bisection easy, I organized the changes according to the
>> memory access types of vector load.  For each memory access
>> type, firstly it follows the handlings in the function
>> vect_model_load_costto avoid any missing, then refines
>> further by referring to the transform code, I also checked
>> them with some typical test cases to verify.  Hope the
>> subjects of patches are clear enough.
>>
>> The whole series can be bootstrapped and regtested
>> incrementally on:
>>   - x86_64-redhat-linux
>>   - aarch64-linux-gnu
>>   - powerpc64-linux-gnu P7, P8 and P9
>>   - powerpc64le-linux-gnu P8, P9 and P10
>>
>> By considering the current vector test buckets are mainly
>> tested without cost model, I also verified the whole patch
>> series was neutral for SPEC2017 int/fp on Power9 at O2,
>> O3 and Ofast separately.
>
> I went through the series now and I like it overall (well, I suggested
> the change).
> Looking at the changes I think we want some followup to reduce the
> mess in the final loop nest.  We already have some VMAT_* cases handled
> separately, maybe we can split out some more cases.  Maybe we should
> bite the bullet and duplicate that loop nest for the different VMAT_* cases.
> Maybe we can merge some of the if (!costing_p) checks by clever
> re-ordering.  So what
> this series doesn't improve is overall readability of the code (indent and our
> 80 char line limit).
>
> The change also makes it more difficult(?) to separate analysis and transform
> though in the end I hope that analysis will actually "code generate" to a (SLP)
> data structure so the target will have a chance to see the actual flow of insns.
>
> That said, I'd like to hear from Richard whether he thinks this is a step
> in the right direction.

Yeah, agree that it's probably better on balance.  It's going to need a
bit of discipline to make sure that we don't accidentally change the IR
during the analysis phase, but I guess that already exists to a lesser
extent with the “before !vec_stmt”/“after !vec_stmt” split.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/9 v2] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl
  2023-06-30 11:11   ` Richard Biener
@ 2023-07-03  2:57     ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  2:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford, segher, bergner

Hi Richi,

on 2023/6/30 19:11, Richard Biener wrote:
> On Tue, Jun 13, 2023 at 4:03 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>>
>> This patch adds one extra argument cost_vec to function
>> vect_build_gather_load_calls, so that we can do costing
>> next to the tranform in vect_build_gather_load_calls.
>> For now, the implementation just follows the handlings in
>> vect_model_load_cost, it isn't so good, so placing one
>> FIXME for any further improvement.  This patch should not
>> cause any functional changes.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-stmts.cc (vect_build_gather_load_calls): Add the handlings
>>         on costing with one extra argument cost_vec.
>>         (vectorizable_load): Adjust the call to vect_build_gather_load_calls.
>>         (vect_model_load_cost): Assert it won't get VMAT_GATHER_SCATTER with
>>         gs_info.decl set any more.
>> ---
>>  gcc/tree-vect-stmts.cc | 31 +++++++++++++++++++++++--------
>>  1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index 44514658be3..744cdf40e26 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -1135,6 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
>>                       slp_tree slp_node,
>>                       stmt_vector_for_cost *cost_vec)
>>  {
>> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
>> +
>>    unsigned int inside_cost = 0, prologue_cost = 0;
>>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>>
>> @@ -2819,7 +2821,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>>                               gimple_stmt_iterator *gsi,
>>                               gimple **vec_stmt,
>>                               gather_scatter_info *gs_info,
>> -                             tree mask)
>> +                             tree mask,
>> +                             stmt_vector_for_cost *cost_vec)
>>  {
>>    loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> @@ -2831,6 +2834,23 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
>>    poly_uint64 gather_off_nunits
>>      = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
>>
>> +  /* FIXME: Keep the previous costing way in vect_model_load_cost by costing
>> +     N scalar loads, but it should be tweaked to use target specific costs
>> +     on related gather load calls.  */
>> +  if (!vec_stmt)
> 
> going over the series now, I'm collecting comments but wanted to get
> this one out here:
> I'd rather see if (cost_vec) here, that 'vec_stmt' argument is quite
> legacy (I think it can
> be completely purged everywhere)
> 

OK, updated with the below new version v2.  I guessed you meant we would like to
use cost_vec for checking costing or not and use vec_stmts in stmt_info to store
the vectorized stmt. 

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ec177dcad82..dd8f5421d4e 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1136,6 +1136,8 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
+
   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);

@@ -2846,7 +2848,8 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
 			      gimple_stmt_iterator *gsi,
 			      gimple **vec_stmt,
 			      gather_scatter_info *gs_info,
-			      tree mask)
+			      tree mask,
+			      stmt_vector_for_cost *cost_vec)
 {
   loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
@@ -2858,6 +2861,23 @@ vect_build_gather_load_calls (vec_info *vinfo, stmt_vec_info stmt_info,
   poly_uint64 gather_off_nunits
     = TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);

+  /* FIXME: Keep the previous costing way in vect_model_load_cost by costing
+     N scalar loads, but it should be tweaked to use target specific costs
+     on related gather load calls.  */
+  if (cost_vec)
+    {
+      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
+      unsigned int inside_cost;
+      inside_cost = record_stmt_cost (cost_vec, ncopies * assumed_nunits,
+				      scalar_load, stmt_info, 0, vect_body);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_load_cost: inside_cost = %d, "
+			 "prologue_cost = 0 .\n",
+			 inside_cost);
+      return;
+    }
+
   tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info->decl));
   tree rettype = TREE_TYPE (TREE_TYPE (gs_info->decl));
   tree srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
@@ -9631,13 +9651,8 @@ vectorizable_load (vec_info *vinfo,

   if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl)
     {
-      if (costing_p)
-	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			      alignment_support_scheme, misalignment, &gs_info,
-			      slp_node, cost_vec);
-      else
-	vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
-				      mask);
+      vect_build_gather_load_calls (vinfo, stmt_info, gsi, vec_stmt, &gs_info,
+				    mask, cost_vec);
       return true;
     }

--
2.31.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/9 v2] vect: Adjust vectorizable_load costing on VMAT_INVARIANT
  2023-06-30 11:18   ` Richard Biener
@ 2023-07-03  2:58     ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  2:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford, segher, bergner

Hi Richi,

on 2023/6/30 19:18, Richard Biener wrote:
> On Tue, Jun 13, 2023 at 4:03 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>>
>> This patch adjusts the cost handling on VMAT_INVARIANT in
>> function vectorizable_load.  We don't call function
>> vect_model_load_cost for it any more.
>>
>> To make the costing on VMAT_INVARIANT better, this patch is
>> to query hoist_defs_of_uses for hoisting decision, and add
>> costs for different "where" based on it.  Currently function
>> hoist_defs_of_uses would always hoist the defs of all SSA
>> uses, adding one argument HOIST_P aims to avoid the actual
>> hoisting during costing phase.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-stmts.cc (hoist_defs_of_uses): Add one argument HOIST_P.
>>         (vectorizable_load): Adjust the handling on VMAT_INVARIANT to respect
>>         hoisting decision and without calling vect_model_load_cost.
>>         (vect_model_load_cost): Assert it won't get VMAT_INVARIANT any more
>>         and remove VMAT_INVARIANT related handlings.
>> ---
>>  gcc/tree-vect-stmts.cc | 61 +++++++++++++++++++++++++++---------------
>>  1 file changed, 39 insertions(+), 22 deletions(-)
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index 744cdf40e26..19c61d703c8 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -1135,7 +1135,8 @@ vect_model_load_cost (vec_info *vinfo,
>>                       slp_tree slp_node,
>>                       stmt_vector_for_cost *cost_vec)
>>  {
>> -  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
>> +  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
>> +             && memory_access_type != VMAT_INVARIANT);
>>
>>    unsigned int inside_cost = 0, prologue_cost = 0;
>>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>> @@ -1238,16 +1239,6 @@ vect_model_load_cost (vec_info *vinfo,
>>                                        ncopies * assumed_nunits,
>>                                        scalar_load, stmt_info, 0, vect_body);
>>      }
>> -  else if (memory_access_type == VMAT_INVARIANT)
>> -    {
>> -      /* Invariant loads will ideally be hoisted and splat to a vector.  */
>> -      prologue_cost += record_stmt_cost (cost_vec, 1,
>> -                                        scalar_load, stmt_info, 0,
>> -                                        vect_prologue);
>> -      prologue_cost += record_stmt_cost (cost_vec, 1,
>> -                                        scalar_to_vec, stmt_info, 0,
>> -                                        vect_prologue);
>> -    }
>>    else
>>      vect_get_load_cost (vinfo, stmt_info, ncopies,
>>                         alignment_support_scheme, misalignment, first_stmt_p,
>> @@ -9121,10 +9112,11 @@ permute_vec_elements (vec_info *vinfo,
>>  /* Hoist the definitions of all SSA uses on STMT_INFO out of the loop LOOP,
>>     inserting them on the loops preheader edge.  Returns true if we
>>     were successful in doing so (and thus STMT_INFO can be moved then),
>> -   otherwise returns false.  */
>> +   otherwise returns false.  HOIST_P indicates if we want to hoist the
>> +   definitions of all SSA uses, it would be false when we are costing.  */
>>
>>  static bool
>> -hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
>> +hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop, bool hoist_p)
>>  {
>>    ssa_op_iter i;
>>    tree op;
>> @@ -9158,6 +9150,9 @@ hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
>>    if (!any)
>>      return true;
>>
>> +  if (!hoist_p)
>> +    return true;
>> +
>>    FOR_EACH_SSA_TREE_OPERAND (op, stmt_info->stmt, i, SSA_OP_USE)
>>      {
>>        gimple *def_stmt = SSA_NAME_DEF_STMT (op);
>> @@ -9510,14 +9505,6 @@ vectorizable_load (vec_info *vinfo,
>>
>>    if (memory_access_type == VMAT_INVARIANT)
>>      {
>> -      if (costing_p)
>> -       {
>> -         vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
>> -                               memory_access_type, alignment_support_scheme,
>> -                               misalignment, &gs_info, slp_node, cost_vec);
>> -         return true;
>> -       }
>> -
>>        gcc_assert (!grouped_load && !mask && !bb_vinfo);
>>        /* If we have versioned for aliasing or the loop doesn't
>>          have any data dependencies that would preclude this,
>> @@ -9525,7 +9512,37 @@ vectorizable_load (vec_info *vinfo,
>>          thus we can insert it on the preheader edge.  */
>>        bool hoist_p = (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)
>>                       && !nested_in_vect_loop
>> -                     && hoist_defs_of_uses (stmt_info, loop));
>> +                     && hoist_defs_of_uses (stmt_info, loop, !costing_p));
> 
> 'hoist_defs_of_uses' should ideally be computed once at analysis time and
> the result remembered.  It's not so easy in this case so maybe just
> add a comment
> for this here.

Ok, updated with: 

       /* If we have versioned for aliasing or the loop doesn't
 	 have any data dependencies that would preclude this,
 	 then we are sure this is a loop invariant load and
-	 thus we can insert it on the preheader edge.  */
+	 thus we can insert it on the preheader edge.
+	 TODO: hoist_defs_of_uses should ideally be computed
+	 once at analysis time, remembered and used in the
+	 transform time.  */


> 
>> +      if (costing_p)
>> +       {
>> +         if (hoist_p)
>> +           {
>> +             unsigned int prologue_cost;
>> +             prologue_cost = record_stmt_cost (cost_vec, 1, scalar_load,
>> +                                               stmt_info, 0, vect_prologue);
>> +             prologue_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
>> +                                                stmt_info, 0, vect_prologue);
>> +             if (dump_enabled_p ())
>> +               dump_printf_loc (MSG_NOTE, vect_location,
>> +                                "vect_model_load_cost: inside_cost = 0, "
>> +                                "prologue_cost = %d .\n",
>> +                                prologue_cost);
>> +           }
>> +         else
>> +           {
>> +             unsigned int inside_cost;
>> +             inside_cost = record_stmt_cost (cost_vec, 1, scalar_load,
>> +                                             stmt_info, 0, vect_body);
>> +             inside_cost += record_stmt_cost (cost_vec, 1, scalar_to_vec,
>> +                                              stmt_info, 0, vect_body);
>> +             if (dump_enabled_p ())
>> +               dump_printf_loc (MSG_NOTE, vect_location,
>> +                                "vect_model_load_cost: inside_cost = %d, "
>> +                                "prologue_cost = 0 .\n",
>> +                                inside_cost);
>> +           }
> 
> Please instead do
> 
>               enum vect_cost_model_location loc = hoist_p ?
> vect_prologue : vect_body;
> 
> and merge the two branches which otherwise look identical to me.

Good idea, the dump_printf_loc also has some difference, updated with:

+      if (costing_p)
+	{
+	  enum vect_cost_model_location cost_loc
+	    = hoist_p ? vect_prologue : vect_body;
+	  unsigned int cost = record_stmt_cost (cost_vec, 1, scalar_load,
+						stmt_info, 0, cost_loc);
+	  cost += record_stmt_cost (cost_vec, 1, scalar_to_vec, stmt_info, 0,
+				    cost_loc);
+	  unsigned int prologue_cost = hoist_p ? cost : 0;
+	  unsigned int inside_cost = hoist_p ? 0 : cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+	  return true;
+	}

---------------------

The whole patch v2 is as below:

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index dd8f5421d4e..ce53cb30c79 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1136,7 +1136,8 @@ vect_model_load_cost (vec_info *vinfo,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl);
+  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
+	      && memory_access_type != VMAT_INVARIANT);

   unsigned int inside_cost = 0, prologue_cost = 0;
   bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
@@ -1241,16 +1242,6 @@ vect_model_load_cost (vec_info *vinfo,
 				       ncopies * assumed_nunits,
 				       scalar_load, stmt_info, 0, vect_body);
     }
-  else if (memory_access_type == VMAT_INVARIANT)
-    {
-      /* Invariant loads will ideally be hoisted and splat to a vector.  */
-      prologue_cost += record_stmt_cost (cost_vec, 1,
-					 scalar_load, stmt_info, 0,
-					 vect_prologue);
-      prologue_cost += record_stmt_cost (cost_vec, 1,
-					 scalar_to_vec, stmt_info, 0,
-					 vect_prologue);
-    }
   else
     vect_get_load_cost (vinfo, stmt_info, ncopies,
 			alignment_support_scheme, misalignment, first_stmt_p,
@@ -9269,10 +9260,11 @@ permute_vec_elements (vec_info *vinfo,
 /* Hoist the definitions of all SSA uses on STMT_INFO out of the loop LOOP,
    inserting them on the loops preheader edge.  Returns true if we
    were successful in doing so (and thus STMT_INFO can be moved then),
-   otherwise returns false.  */
+   otherwise returns false.  HOIST_P indicates if we want to hoist the
+   definitions of all SSA uses, it would be false when we are costing.  */

 static bool
-hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
+hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop, bool hoist_p)
 {
   ssa_op_iter i;
   tree op;
@@ -9306,6 +9298,9 @@ hoist_defs_of_uses (stmt_vec_info stmt_info, class loop *loop)
   if (!any)
     return true;

+  if (!hoist_p)
+    return true;
+
   FOR_EACH_SSA_TREE_OPERAND (op, stmt_info->stmt, i, SSA_OP_USE)
     {
       gimple *def_stmt = SSA_NAME_DEF_STMT (op);
@@ -9658,22 +9653,34 @@ vectorizable_load (vec_info *vinfo,

   if (memory_access_type == VMAT_INVARIANT)
     {
-      if (costing_p)
-	{
-	  vect_model_load_cost (vinfo, stmt_info, ncopies, vf,
-				memory_access_type, alignment_support_scheme,
-				misalignment, &gs_info, slp_node, cost_vec);
-	  return true;
-	}
-
       gcc_assert (!grouped_load && !mask && !bb_vinfo);
       /* If we have versioned for aliasing or the loop doesn't
 	 have any data dependencies that would preclude this,
 	 then we are sure this is a loop invariant load and
-	 thus we can insert it on the preheader edge.  */
+	 thus we can insert it on the preheader edge.
+	 TODO: hoist_defs_of_uses should ideally be computed
+	 once at analysis time, remembered and used in the
+	 transform time.  */
       bool hoist_p = (LOOP_VINFO_NO_DATA_DEPENDENCIES (loop_vinfo)
 		      && !nested_in_vect_loop
-		      && hoist_defs_of_uses (stmt_info, loop));
+		      && hoist_defs_of_uses (stmt_info, loop, !costing_p));
+      if (costing_p)
+	{
+	  enum vect_cost_model_location cost_loc
+	    = hoist_p ? vect_prologue : vect_body;
+	  unsigned int cost = record_stmt_cost (cost_vec, 1, scalar_load,
+						stmt_info, 0, cost_loc);
+	  cost += record_stmt_cost (cost_vec, 1, scalar_to_vec, stmt_info, 0,
+				    cost_loc);
+	  unsigned int prologue_cost = hoist_p ? cost : 0;
+	  unsigned int inside_cost = hoist_p ? 0 : cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %d, "
+			     "prologue_cost = %d .\n",
+			     inside_cost, prologue_cost);
+	  return true;
+	}
       if (hoist_p)
 	{
 	  gassign *stmt = as_a <gassign *> (stmt_info->stmt);
--
2.31.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/9 v2] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER
  2023-06-13  2:03 ` [PATCH 5/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER Kewen Lin
@ 2023-07-03  3:01   ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  3:01 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This is version v2 rebasing from latest trunk.

=====

This patch adjusts the cost handling on VMAT_GATHER_SCATTER
in function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.

It's mainly for gather loads with IFN or emulated gather
loads, it follows the handlings in function
vect_model_load_cost.  This patch shouldn't have any
functional changes.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
	VMAT_GATHER_SCATTER without calling vect_model_load_cost.
	(vect_model_load_cost): Adjut the assertion on VMAT_GATHER_SCATTER,
	remove VMAT_GATHER_SCATTER related handlings and the related parameter
	gs_info.
---
 gcc/tree-vect-stmts.cc | 124 +++++++++++++++++++++++++----------------
 1 file changed, 76 insertions(+), 48 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 7d8e72bda67..1ae917db627 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1132,11 +1132,10 @@ vect_model_load_cost (vec_info *vinfo,
 		      vect_memory_access_type memory_access_type,
 		      dr_alignment_support alignment_support_scheme,
 		      int misalignment,
-		      gather_scatter_info *gs_info,
 		      slp_tree slp_node,
 		      stmt_vector_for_cost *cost_vec)
 {
-  gcc_assert ((memory_access_type != VMAT_GATHER_SCATTER || !gs_info->decl)
+  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
 	      && memory_access_type != VMAT_INVARIANT
 	      && memory_access_type != VMAT_ELEMENTWISE
 	      && memory_access_type != VMAT_STRIDED_SLP);
@@ -1225,35 +1224,9 @@ vect_model_load_cost (vec_info *vinfo,
                          group_size);
     }

-  /* The loads themselves.  */
-  if (memory_access_type == VMAT_GATHER_SCATTER)
-    {
-      tree vectype = STMT_VINFO_VECTYPE (stmt_info);
-      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
-      if (memory_access_type == VMAT_GATHER_SCATTER
-	  && gs_info->ifn == IFN_LAST && !gs_info->decl)
-	/* For emulated gathers N offset vector element extracts
-	   (we assume the scalar scaling and ptr + offset add is consumed by
-	   the load).  */
-	inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits,
-					 vec_to_scalar, stmt_info, 0,
-					 vect_body);
-      /* N scalar loads plus gathering them into a vector.  */
-      inside_cost += record_stmt_cost (cost_vec,
-				       ncopies * assumed_nunits,
-				       scalar_load, stmt_info, 0, vect_body);
-    }
-  else
-    vect_get_load_cost (vinfo, stmt_info, ncopies,
-			alignment_support_scheme, misalignment, first_stmt_p,
-			&inside_cost, &prologue_cost,
-			cost_vec, cost_vec, true);
-
-  if (memory_access_type == VMAT_GATHER_SCATTER
-      && gs_info->ifn == IFN_LAST
-      && !gs_info->decl)
-    inside_cost += record_stmt_cost (cost_vec, ncopies, vec_construct,
-				     stmt_info, 0, vect_body);
+  vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
+		      misalignment, first_stmt_p, &inside_cost, &prologue_cost,
+		      cost_vec, cost_vec, true);

   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -10283,6 +10256,7 @@ vectorizable_load (vec_info *vinfo,
     }
   tree vec_mask = NULL_TREE;
   poly_uint64 group_elt = 0;
+  unsigned int inside_cost = 0;
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
@@ -10414,23 +10388,26 @@ vectorizable_load (vec_info *vinfo,
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 	}
-      else if (!costing_p)
+      else
 	{
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
-	      if (loop_masks
-		  && memory_access_type != VMAT_INVARIANT)
-		final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
-						 vec_num * ncopies,
-						 vectype, vec_num * j + i);
-	      if (vec_mask)
-		final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
-					       final_mask, vec_mask, gsi);
-
-	      if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
-		dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
-					       gsi, stmt_info, bump);
+	      if (!costing_p)
+		{
+		  if (loop_masks && memory_access_type != VMAT_INVARIANT)
+		    final_mask
+		      = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
+					    vec_num * ncopies, vectype,
+					    vec_num * j + i);
+		  if (vec_mask)
+		    final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
+						   final_mask, vec_mask, gsi);
+
+		  if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+		    dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+						   gsi, stmt_info, bump);
+		}

 	      /* 2. Create the vector-load in the loop.  */
 	      switch (alignment_support_scheme)
@@ -10444,6 +10421,16 @@ vectorizable_load (vec_info *vinfo,
 		    if (memory_access_type == VMAT_GATHER_SCATTER
 			&& gs_info.ifn != IFN_LAST)
 		      {
+			if (costing_p)
+			  {
+			    unsigned int cnunits
+			      = vect_nunits_for_cost (vectype);
+			    inside_cost
+			      = record_stmt_cost (cost_vec, cnunits,
+						  scalar_load, stmt_info, 0,
+						  vect_body);
+			    goto vec_num_loop_costing_end;
+			  }
 			if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 			  vec_offset = vec_offsets[vec_num * j + i];
 			tree zero = build_zero_cst (vectype);
@@ -10468,6 +10455,25 @@ vectorizable_load (vec_info *vinfo,
 			gcc_assert (!final_mask);
 			unsigned HOST_WIDE_INT const_nunits
 			  = nunits.to_constant ();
+			if (costing_p)
+			  {
+			    /* For emulated gathers N offset vector element
+			       offset add is consumed by the load).  */
+			    inside_cost
+			      = record_stmt_cost (cost_vec, const_nunits,
+						  vec_to_scalar, stmt_info, 0,
+						  vect_body);
+			    /* N scalar loads plus gathering them into a
+			       vector.  */
+			    inside_cost
+			      = record_stmt_cost (cost_vec, const_nunits,
+						  scalar_load, stmt_info, 0,
+						  vect_body);
+			    inside_cost
+			      = record_stmt_cost (cost_vec, 1, vec_construct,
+						  stmt_info, 0, vect_body);
+			    goto vec_num_loop_costing_end;
+			  }
 			unsigned HOST_WIDE_INT const_offset_nunits
 			  = TYPE_VECTOR_SUBPARTS (gs_info.offset_vectype)
 			      .to_constant ();
@@ -10520,6 +10526,9 @@ vectorizable_load (vec_info *vinfo,
 			break;
 		      }

+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
+
 		    align =
 		      known_alignment (DR_TARGET_ALIGNMENT (first_dr_info));
 		    if (alignment_support_scheme == dr_aligned)
@@ -10734,6 +10743,8 @@ vectorizable_load (vec_info *vinfo,
 		  }
 		case dr_explicit_realign:
 		  {
+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
 		    tree ptr, bump;

 		    tree vs = size_int (TYPE_VECTOR_SUBPARTS (vectype));
@@ -10796,6 +10807,8 @@ vectorizable_load (vec_info *vinfo,
 		  }
 		case dr_explicit_realign_optimized:
 		  {
+		    if (costing_p)
+		      goto vec_num_loop_costing_end;
 		    if (TREE_CODE (dataref_ptr) == SSA_NAME)
 		      new_temp = copy_ssa_name (dataref_ptr);
 		    else
@@ -10892,10 +10905,14 @@ vectorizable_load (vec_info *vinfo,
 						 gsi, stmt_info, bump);
 		  group_elt = 0;
 		}
+vec_num_loop_costing_end:
+	      ;
 	    }
 	  /* Bump the vector pointer to account for a gap or for excess
 	     elements loaded for a permuted SLP load.  */
-	  if (maybe_ne (group_gap_adj, 0U) && slp_perm)
+	  if (!costing_p
+	      && maybe_ne (group_gap_adj, 0U)
+	      && slp_perm)
 	    {
 	      poly_wide_int bump_val
 		= (wi::to_wide (TYPE_SIZE_UNIT (elem_type))
@@ -10944,9 +10961,20 @@ vectorizable_load (vec_info *vinfo,
     *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];

   if (costing_p)
-    vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			  alignment_support_scheme, misalignment, &gs_info,
-			  slp_node, cost_vec);
+    {
+      if (memory_access_type == VMAT_GATHER_SCATTER)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "vect_model_load_cost: inside_cost = %u, "
+			     "prologue_cost = 0 .\n",
+			     inside_cost);
+	}
+      else
+	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
+			      alignment_support_scheme, misalignment, slp_node,
+			      cost_vec);
+    }

   return true;
 }
--
2.31.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 9/9 v2] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS
  2023-06-13  2:03 ` [PATCH 9/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS Kewen Lin
@ 2023-07-03  3:06   ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  3:06 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, richard.sandiford, segher, bergner

This is version v2 rebasing from latest trunk.

=====

This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.  It removes function
vect_model_load_cost which becomes useless and unreachable
now.

gcc/ChangeLog:

	* tree-vect-stmts.cc (vect_model_load_cost): Remove.
	(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
	calling vect_model_load_cost.
---
 gcc/tree-vect-stmts.cc | 94 ++++--------------------------------------
 1 file changed, 9 insertions(+), 85 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 0a9a75ce3c7..9dfe1903181 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1118,75 +1118,6 @@ vect_get_store_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
     }
 }

-
-/* Function vect_model_load_cost
-
-   Models cost for loads.  In the case of grouped accesses, one access has
-   the overhead of the grouped access attributed to it.  Since unaligned
-   accesses are supported for loads, we also account for the costs of the
-   access scheme chosen.  */
-
-static void
-vect_model_load_cost (vec_info *vinfo,
-		      stmt_vec_info stmt_info, unsigned ncopies, poly_uint64 vf,
-		      vect_memory_access_type memory_access_type,
-		      dr_alignment_support alignment_support_scheme,
-		      int misalignment,
-		      slp_tree slp_node,
-		      stmt_vector_for_cost *cost_vec)
-{
-  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
-
-  unsigned int inside_cost = 0, prologue_cost = 0;
-  bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
-
-  gcc_assert (cost_vec);
-
-  /* ???  Somehow we need to fix this at the callers.  */
-  if (slp_node)
-    ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-
-  if (slp_node && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ())
-    {
-      /* If the load is permuted then the alignment is determined by
-	 the first group element not by the first scalar stmt DR.  */
-      stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-      if (!first_stmt_info)
-	first_stmt_info = stmt_info;
-      /* Record the cost for the permutation.  */
-      unsigned n_perms, n_loads;
-      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL,
-				    vf, true, &n_perms, &n_loads);
-      inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
-				       first_stmt_info, 0, vect_body);
-
-      /* And adjust the number of loads performed.  This handles
-	 redundancies as well as loads that are later dead.  */
-      ncopies = n_loads;
-    }
-
-  /* Grouped loads read all elements in the group at once,
-     so we want the DR for the first statement.  */
-  stmt_vec_info first_stmt_info = stmt_info;
-  if (!slp_node && grouped_access_p)
-    first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-
-  /* True if we should include any once-per-group costs as well as
-     the cost of the statement itself.  For SLP we only get called
-     once per group anyhow.  */
-  bool first_stmt_p = (first_stmt_info == stmt_info);
-
-  vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
-		      misalignment, first_stmt_p, &inside_cost, &prologue_cost,
-		      cost_vec, cost_vec, true);
-
-  if (dump_enabled_p ())
-    dump_printf_loc (MSG_NOTE, vect_location,
-                     "vect_model_load_cost: inside_cost = %d, "
-                     "prologue_cost = %d .\n", inside_cost, prologue_cost);
-}
-
-
 /* Calculate cost of DR's memory access.  */
 void
 vect_get_load_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
@@ -10830,7 +10761,8 @@ vectorizable_load (vec_info *vinfo,
 		     we only need to count it once for the whole group.  */
 		  bool first_stmt_info_p = first_stmt_info == stmt_info;
 		  bool add_realign_cost = first_stmt_info_p && i == 0;
-		  if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
+		  if (memory_access_type == VMAT_CONTIGUOUS
+		      || memory_access_type == VMAT_CONTIGUOUS_REVERSE
 		      || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
 			  && (!grouped_load || first_stmt_info_p)))
 		    vect_get_load_cost (vinfo, stmt_info, 1,
@@ -10954,15 +10886,14 @@ vectorizable_load (vec_info *vinfo,
 	     direct vect_transform_slp_perm_load to DCE the unused parts.
 	     ???  This is a hack to prevent compile-time issues as seen
 	     in PR101120 and friends.  */
-	  if (costing_p
-	      && memory_access_type != VMAT_CONTIGUOUS)
+	  if (costing_p)
 	    {
 	      vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
 					    true, &n_perms, nullptr);
 	      inside_cost = record_stmt_cost (cost_vec, n_perms, vec_perm,
 					      stmt_info, 0, vect_body);
 	    }
-	  else if (!costing_p)
+	  else
 	    {
 	      bool ok = vect_transform_slp_perm_load (vinfo, slp_node, dr_chain,
 						      gsi, vf, false, &n_perms,
@@ -11018,18 +10949,11 @@ vectorizable_load (vec_info *vinfo,
       gcc_assert (memory_access_type != VMAT_INVARIANT
 		  && memory_access_type != VMAT_ELEMENTWISE
 		  && memory_access_type != VMAT_STRIDED_SLP);
-      if (memory_access_type != VMAT_CONTIGUOUS)
-	{
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_NOTE, vect_location,
-			     "vect_model_load_cost: inside_cost = %u, "
-			     "prologue_cost = %u .\n",
-			     inside_cost, prologue_cost);
-	}
-      else
-	vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type,
-			      alignment_support_scheme, misalignment, slp_node,
-			      cost_vec);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_model_load_cost: inside_cost = %u, "
+			 "prologue_cost = %u .\n",
+			 inside_cost, prologue_cost);
     }

   return true;
--
2.31.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-07-02  8:58   ` Richard Sandiford
@ 2023-07-03  3:19     ` Kewen.Lin
  2023-07-22 15:58       ` Iain Sandoe
  0 siblings, 1 reply; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  3:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, segher, bergner, richard.sandiford

Hi Richard,

on 2023/7/2 16:58, Richard Sandiford wrote:
> Kewen Lin <linkw@linux.ibm.com> writes:
>> @@ -9743,11 +9739,23 @@ vectorizable_load (vec_info *vinfo,
>>        unsigned int n_groups = 0;
>>        for (j = 0; j < ncopies; j++)
>>  	{
>> -	  if (nloads > 1)
>> +	  if (nloads > 1 && !costing_p)
>>  	    vec_alloc (v, nloads);
>>  	  gimple *new_stmt = NULL;
>>  	  for (i = 0; i < nloads; i++)
>>  	    {
>> +	      if (costing_p)
>> +		{
>> +		  if (VECTOR_TYPE_P (ltype))
>> +		    vect_get_load_cost (vinfo, stmt_info, 1,
>> +					alignment_support_scheme, misalignment,
>> +					false, &inside_cost, nullptr, cost_vec,
>> +					cost_vec, true);
>> +		  else
>> +		    inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
>> +						     stmt_info, 0, vect_body);
>> +		  continue;
>> +		}
> 
> Just a note that this might make life harder for AArch64 costing.
> Strided SLP loads are still equivalent to vector loads for AArch64,
> since they happen on the FPR/vector side even if they have integral
> modes.
> 
> But I agree this is more accurate from a general target-independent POV,
> especially given the relatively coarse-grain costing enum.  So I think
> that's just something AArch64 will need to account for.

Sorry for the inconvenience.  Hope accounting for it with target hook on
vect costing isn't very complicated.

BR,
Kewen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] vect: Move costing next to the transform for vect load
  2023-06-30 11:37 ` Richard Biener
  2023-07-02  9:13   ` Richard Sandiford
@ 2023-07-03  3:39   ` Kewen.Lin
  2023-07-03  8:42     ` Richard Biener
  1 sibling, 1 reply; 27+ messages in thread
From: Kewen.Lin @ 2023-07-03  3:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford, segher, bergner

Hi Richi,

Thanks for your review comments on this and some others!

on 2023/6/30 19:37, Richard Biener wrote:
> On Tue, Jun 13, 2023 at 4:07 AM Kewen Lin <linkw@linux.ibm.com> wrote:
>>
>> This patch series follows Richi's suggestion at the link [1],
>> which suggest structuring vectorizable_load to make costing
>> next to the transform, in order to make it easier to keep
>> costing and the transform in sync.  For now, it's a known
>> issue that what we cost can be inconsistent with what we
>> transform, as the case in PR82255 and some other associated
>> test cases in the patches of this series show.
>>
>> Basically this patch series makes costing not call function
>> vect_model_load_cost any more.  To make the review and
>> bisection easy, I organized the changes according to the
>> memory access types of vector load.  For each memory access
>> type, firstly it follows the handlings in the function
>> vect_model_load_costto avoid any missing, then refines
>> further by referring to the transform code, I also checked
>> them with some typical test cases to verify.  Hope the
>> subjects of patches are clear enough.
>>
>> The whole series can be bootstrapped and regtested
>> incrementally on:
>>   - x86_64-redhat-linux
>>   - aarch64-linux-gnu
>>   - powerpc64-linux-gnu P7, P8 and P9
>>   - powerpc64le-linux-gnu P8, P9 and P10
>>
>> By considering the current vector test buckets are mainly
>> tested without cost model, I also verified the whole patch
>> series was neutral for SPEC2017 int/fp on Power9 at O2,
>> O3 and Ofast separately.
> 
> I went through the series now and I like it overall (well, I suggested
> the change).
> Looking at the changes I think we want some followup to reduce the
> mess in the final loop nest.  We already have some VMAT_* cases handled
> separately, maybe we can split out some more cases.  Maybe we should

At first glance, the simple parts look to be the handlings for
VMAT_LOAD_STORE_LANES, and VMAT_GATHER_SCATTER (with ifn and emulated).
It seems a bit straightforward if it's fine to duplicate the nested loop,
but may need to care about removing some useless code.

> bite the bullet and duplicate that loop nest for the different VMAT_* cases.
> Maybe we can merge some of the if (!costing_p) checks by clever
> re-ordering.

I've tried a bit to merge them if possible, like the place to check
VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and VMAT_CONTIGUOUS_PERMUTE.
But will keep in mind for the following updates.

> So what
> this series doesn't improve is overall readability of the code (indent and our
> 80 char line limit).

Sorry about that.

> 
> The change also makes it more difficult(?) to separate analysis and transform
> though in the end I hope that analysis will actually "code generate" to a (SLP)
> data structure so the target will have a chance to see the actual flow of insns.
> 
> That said, I'd like to hear from Richard whether he thinks this is a step
> in the right direction.
> 
> Are you willing to followup with doing the same re-structuring to
> vectorizable_store?

Yes, vectorizable_store was also pointed out in your original suggestion [1],
I planned to update this once this series meets your expectations and gets landed.

> 
> OK from my side with the few comments addressed.  The patch likely needs refresh
> after the RVV changes in this area?

Thanks!  Yes, I've updated 2/9 and 3/9 according to your comments, and updated
5/9 and 9/9 as they had some conflicts when rebasing.  Re-testing is ongoing,
do the updated versions look good to you?  Is this series ok for trunk if all the
test runs go well again as before?

BR,
Kewen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/9] vect: Move costing next to the transform for vect load
  2023-07-03  3:39   ` Kewen.Lin
@ 2023-07-03  8:42     ` Richard Biener
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Biener @ 2023-07-03  8:42 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: gcc-patches, richard.sandiford, segher, bergner

On Mon, Jul 3, 2023 at 5:39 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richi,
>
> Thanks for your review comments on this and some others!
>
> on 2023/6/30 19:37, Richard Biener wrote:
> > On Tue, Jun 13, 2023 at 4:07 AM Kewen Lin <linkw@linux.ibm.com> wrote:
> >>
> >> This patch series follows Richi's suggestion at the link [1],
> >> which suggest structuring vectorizable_load to make costing
> >> next to the transform, in order to make it easier to keep
> >> costing and the transform in sync.  For now, it's a known
> >> issue that what we cost can be inconsistent with what we
> >> transform, as the case in PR82255 and some other associated
> >> test cases in the patches of this series show.
> >>
> >> Basically this patch series makes costing not call function
> >> vect_model_load_cost any more.  To make the review and
> >> bisection easy, I organized the changes according to the
> >> memory access types of vector load.  For each memory access
> >> type, firstly it follows the handlings in the function
> >> vect_model_load_costto avoid any missing, then refines
> >> further by referring to the transform code, I also checked
> >> them with some typical test cases to verify.  Hope the
> >> subjects of patches are clear enough.
> >>
> >> The whole series can be bootstrapped and regtested
> >> incrementally on:
> >>   - x86_64-redhat-linux
> >>   - aarch64-linux-gnu
> >>   - powerpc64-linux-gnu P7, P8 and P9
> >>   - powerpc64le-linux-gnu P8, P9 and P10
> >>
> >> By considering the current vector test buckets are mainly
> >> tested without cost model, I also verified the whole patch
> >> series was neutral for SPEC2017 int/fp on Power9 at O2,
> >> O3 and Ofast separately.
> >
> > I went through the series now and I like it overall (well, I suggested
> > the change).
> > Looking at the changes I think we want some followup to reduce the
> > mess in the final loop nest.  We already have some VMAT_* cases handled
> > separately, maybe we can split out some more cases.  Maybe we should
>
> At first glance, the simple parts look to be the handlings for
> VMAT_LOAD_STORE_LANES, and VMAT_GATHER_SCATTER (with ifn and emulated).
> It seems a bit straightforward if it's fine to duplicate the nested loop,
> but may need to care about removing some useless code.
>
> > bite the bullet and duplicate that loop nest for the different VMAT_* cases.
> > Maybe we can merge some of the if (!costing_p) checks by clever
> > re-ordering.
>
> I've tried a bit to merge them if possible, like the place to check
> VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and VMAT_CONTIGUOUS_PERMUTE.
> But will keep in mind for the following updates.
>
> > So what
> > this series doesn't improve is overall readability of the code (indent and our
> > 80 char line limit).
>
> Sorry about that.
>
> >
> > The change also makes it more difficult(?) to separate analysis and transform
> > though in the end I hope that analysis will actually "code generate" to a (SLP)
> > data structure so the target will have a chance to see the actual flow of insns.
> >
> > That said, I'd like to hear from Richard whether he thinks this is a step
> > in the right direction.
> >
> > Are you willing to followup with doing the same re-structuring to
> > vectorizable_store?
>
> Yes, vectorizable_store was also pointed out in your original suggestion [1],
> I planned to update this once this series meets your expectations and gets landed.
>
> >
> > OK from my side with the few comments addressed.  The patch likely needs refresh
> > after the RVV changes in this area?
>
> Thanks!  Yes, I've updated 2/9 and 3/9 according to your comments, and updated
> 5/9 and 9/9 as they had some conflicts when rebasing.  Re-testing is ongoing,
> do the updated versions look good to you?  Is this series ok for trunk if all the
> test runs go well again as before?

Yes.

Thanks,
Richard.

> BR,
> Kewen

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-07-03  3:19     ` Kewen.Lin
@ 2023-07-22 15:58       ` Iain Sandoe
  2023-07-24  1:50         ` Kewen.Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Iain Sandoe @ 2023-07-22 15:58 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Segher Boessenkool

Hi Kewen,

This patch breaks bootstrap on powerpc-darwin (which has Altivec, but not VSX) while building libgfortran.

> On 3 Jul 2023, at 04:19, Kewen.Lin via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:

Please see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110776
thanks
Iain


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP
  2023-07-22 15:58       ` Iain Sandoe
@ 2023-07-24  1:50         ` Kewen.Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Kewen.Lin @ 2023-07-24  1:50 UTC (permalink / raw)
  To: Iain Sandoe; +Cc: GCC Patches, Segher Boessenkool

Hi Iain,

on 2023/7/22 23:58, Iain Sandoe wrote:
> Hi Kewen,
> 
> This patch breaks bootstrap on powerpc-darwin (which has Altivec, but not VSX) while building libgfortran.
> 
>> On 3 Jul 2023, at 04:19, Kewen.Lin via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> 
> Please see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110776
> thanks
> Iain
> 

Thanks for reporting!  I'll have a look at it.


BR,
Kewen

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-07-24  1:50 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-13  2:03 [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen Lin
2023-06-13  2:03 ` [PATCH 1/9] vect: Move vect_model_load_cost next to the transform in vectorizable_load Kewen Lin
2023-06-13  2:03 ` [PATCH 2/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl Kewen Lin
2023-06-30 11:11   ` Richard Biener
2023-07-03  2:57     ` [PATCH 2/9 v2] " Kewen.Lin
2023-06-13  2:03 ` [PATCH 3/9] vect: Adjust vectorizable_load costing on VMAT_INVARIANT Kewen Lin
2023-06-30 11:18   ` Richard Biener
2023-07-03  2:58     ` [PATCH 3/9 v2] " Kewen.Lin
2023-06-13  2:03 ` [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP Kewen Lin
2023-07-02  8:58   ` Richard Sandiford
2023-07-03  3:19     ` Kewen.Lin
2023-07-22 15:58       ` Iain Sandoe
2023-07-24  1:50         ` Kewen.Lin
2023-06-13  2:03 ` [PATCH 5/9] vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER Kewen Lin
2023-07-03  3:01   ` [PATCH 5/9 v2] " Kewen.Lin
2023-06-13  2:03 ` [PATCH 6/9] vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES Kewen Lin
2023-06-13  2:03 ` [PATCH 7/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE Kewen Lin
2023-06-13  2:03 ` [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE Kewen Lin
2023-06-14  8:17   ` Hongtao Liu
2023-06-19  7:23     ` Kewen.Lin
2023-06-13  2:03 ` [PATCH 9/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS Kewen Lin
2023-07-03  3:06   ` [PATCH 9/9 v2] " Kewen.Lin
2023-06-26  6:00 ` [PATCH 0/9] vect: Move costing next to the transform for vect load Kewen.Lin
2023-06-30 11:37 ` Richard Biener
2023-07-02  9:13   ` Richard Sandiford
2023-07-03  3:39   ` Kewen.Lin
2023-07-03  8:42     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).