[PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
@ 2023-05-11 10:12 Cui, Lili
  2023-05-11 10:12 ` [PATCH 2/2] Add a tune option to control the length of the chain with FMA Cui, Lili
  2023-05-11 10:52 ` [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Richard Biener
  0 siblings, 2 replies; 10+ messages in thread
From: Cui, Lili @ 2023-05-11 10:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Lili Cui

From: Lili Cui <lili.cui@intel.com>

Hi,

Those two patches each add a param to control the length of the chain with
FMA in reassoc pass and a tuning option in the backend.

Bootstrapped and regtested. Ok for trunk?

Regards
Lili.

Add a param for the chain with FMA in reassoc pass to make it more friendly to
the fma pass later. First to detect if this chain has ability to
generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
is enabled, We will rearrange the ops so that they can be combined into more
FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
build parallel chains according to given association width and try to keep FMA
opportunity as much as possible.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For -Ofast -march=icelake-server  GCC generates:
        vmulss  %xmm3, %xmm2, %xmm2
        vfmadd132ss     %xmm1, %xmm2, %xmm0
        vaddss  (%rdi), %xmm0, %xmm0
        ret

with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
        vfmadd213ss   (%rdi), %xmm1, %xmm0
        vfmadd231ss   %xmm2, %xmm3, %xmm0
        ret

gcc/ChangeLog:

	PR gcc/98350
	* params.opt (reassoc-max-fma-chain-length): New param.
	* tree-ssa-reassoc.cc
	(rewrite_expr_tree_parallel_for_fma): New.
	(rank_ops_for_fma): Ditto.
	(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

	PR gcc/98350
	* gcc.dg/pr98350-1.c: New test.
	* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/params.opt                   |   4 +
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 +++++
 gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
 gcc/tree-ssa-reassoc.cc          | 228 ++++++++++++++++++++++++++++---
 4 files changed, 264 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/params.opt b/gcc/params.opt
index 823cdb2ff85..f7c719afe64 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=reassoc-max-chain-length-with-fma=
+Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
+The maximum chain length with fma considered in reassociation pass.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 00000000000..32ecce13a2d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+    a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-assembler-times "vfm" 6  } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 00000000000..246025d43b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
+
+/* Test that the compiler properly build parallel chains according to given
+   association width and try to keep FMA opportunity as much as possible.  */
+#define N 33
+double a[N];
+
+void
+foo (void)
+{
+  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
+    + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
+    + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
+    + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
+}
+/* { dg-final { scan-assembler-times "vfm" 12  } } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 067a3f07f7e..6d2e158c4f5 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-reassoc.h"
 #include "tree-ssa-math-opts.h"
 #include "gimple-range.h"
+#include "internal-fn.h"
 
 /*  This is a simple global reassociation pass.  It is, in part, based
     on the LLVM pass of the same name (They do some things more/less
@@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* Rewrite statements with dependency chain with regard to the chance to
+   generate FMA. When the dependency chain length exceeds
+   param_max_reassoc_chain_length_with_fma, build parallel chains according to
+   given association width and try to keep fma opportunity as much as possible.
+   E.g.
+   e + f + g + a * b + c * d;
+
+   ssa1 = e + f;
+   ssa2 = g + a * b;
+   ssa3 = ssa1 + c * d;
+   ssa4 = ssa2 + ssa3;
+
+   This reassociation approach preserves the chance of fma generation as much
+   as possible.  */
+static void
+rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
+					 const vec<operand_entry *> &ops)
+{
+  enum tree_code opcode = gimple_assign_rhs_code (stmt);
+  int op_num = ops.length ();
+  gcc_assert (op_num > 0);
+  int stmt_num = op_num - 1;
+  gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
+  int op_index = op_num - 1;
+  int width_count = width;
+  int i = 0, j = 0;
+  tree tmp_op[2], op1;
+  operand_entry *oe;
+  gimple *stmt1 = NULL;
+  tree last_rhs1 = gimple_assign_rhs1 (stmt);
+
+  /* We start expression rewriting from the top statements.
+     So, in this loop we create a full list of statements
+     we will work with.  */
+  stmts[stmt_num - 1] = stmt;
+  for (i = stmt_num - 2; i >= 0; i--)
+    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
+
+  /* Build parallel FMA dependency chain according to width.  */
+  for (i = 0; i < width; i++)
+    {
+      for (j = 0; j < 2; j++)
+	{
+	  oe = ops[op_index--];
+	  tmp_op[j] = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	}
+      stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
+      gimple_set_visited (stmts[i], true);
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+
+  for (i = width; i < stmt_num; i++)
+    {
+      /* We keep original statement only for the last one.  All others are
+	 recreated.  */
+      if ( op_index < 0)
+	{
+	  if (width_count == 2)
+	    {
+
+	      gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
+	      gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
+	    }
+	  else
+	    {
+
+	      stmts[i] =
+		build_and_add_sum (TREE_TYPE (last_rhs1),
+				   gimple_assign_lhs (stmts[i-width_count]),
+				   gimple_assign_lhs (stmts[i-width_count+1]),
+				   opcode);
+	      width_count--;
+	    }
+	  update_stmt (stmts[i]);
+	}
+      else
+	{
+	  oe = ops[op_index--];
+	  op1 = oe->op;
+	  stmt1 = oe->stmt_to_insert;
+	  if (stmt1)
+	    insert_stmt_before_use (stmts[i], stmt1);
+	  stmt1 = NULL;
+	  stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
+				  gimple_assign_lhs (stmts[i-width]),
+				  op1,
+				  opcode);
+	  gimple_set_visited (stmts[i], true);
+  }
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, " into ");
+	  print_gimple_stmt (dump_file, stmts[i], 0);
+	}
+    }
+  remove_visited_stmt_chain (last_rhs1);
+}
+
 /* Recursively rewrite our linearized statements so that the operators
    match those in OPS[OPINDEX], putting the computation in rank
    order and trying to allow operations to be executed in
@@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
+   Putting ops that not def from mult in front can generate more fma.
+   E.g.
+   a * b + c * d + e generates:
+
+   _4  = c_9(D) * d_10(D);
+   _12 = .FMA (a_7(D), b_8(D), _4);
+   _11 = e_6(D) + _12;
+
+   Rtearrange ops to -> e + a * b + c * d generates:
+
+   _4  = .FMA (c_7(D), d_8(D), _3);
+   _11 = .FMA (a_5(D), b_6(D), _4);
+ */
+static bool
+rank_ops_for_fma (vec<operand_entry *> *ops)
+{
+  operand_entry *oe;
+  unsigned int i;
+  auto_vec<operand_entry *> ops_mult;
+  auto_vec<operand_entry *> ops_others;
+
+  FOR_EACH_VEC_ELT (*ops, i, oe)
+    {
+      if (TREE_CODE (oe->op) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+	  if (is_gimple_assign (def_stmt)
+	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+	    ops_mult.safe_push (oe);
+	  else
+	    ops_others.safe_push (oe);
+	}
+      else
+	ops_others.safe_push (oe);
+    }
+  /* When ops_mult.length == 2, like the following case,
+
+     a * b + c * d + e.
+
+     we need to rearrange the ops.
+
+     Putting ops that not def from mult in front can generate more fmas.  */
+  if (ops_mult.length () >= 2)
+    {
+      /* If all ops are defined with mult, we don't need to rearrange them.  */
+      if (ops_mult.length () != ops->length ())
+	{
+	  ops->truncate (0);
+	  FOR_EACH_VEC_ELT (ops_mult, i, oe)
+	    ops->safe_push (oe);
+	  FOR_EACH_VEC_ELT (ops_others, i, oe)
+	    ops->safe_push (oe);
+	}
+      return true;
+    }
+  return false;
+}
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
 
@@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
+		  bool keep_fma_chain = false;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
 		      && TREE_CODE (ops.last ()->op) == INTEGER_CST)
 		    std::swap (*ops[0], *ops[ops_num - 1]);
 
+		  optimization_type opt_type = bb_optimization_type (bb);
+
+		  /* When enabling param_reassoc_max_chain_length_with_fma to
+		     keep the chain with fma, rank_ops_for_fma will detect if
+		     the chain has fmas and if so it will rearrange the ops.  */
+		  if (param_reassoc_max_chain_length_with_fma > 1
+		      && direct_internal_fn_supported_p (IFN_FMA,
+							 TREE_TYPE (lhs),
+							 opt_type)
+		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
+		    {
+		      keep_fma_chain = rank_ops_for_fma(&ops);
+		    }
+
+		  int len = ops.length ();
 		  /* Only rewrite the expression tree to parallel in the
 		     last reassoc pass to avoid useless work back-and-forth
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
-		      && ops.length () > 3
+		      && len > 3
+		      && (!keep_fma_chain
+			  || (keep_fma_chain
+			      && len > param_reassoc_max_chain_length_with_fma))
 		      && (width = get_reassociation_width (ops_num, rhs_code,
-							   mode)) > 1)
+							    mode)) > 1)
 		    {
-		      if (dump_file && (dump_flags & TDF_DETAILS))
-			fprintf (dump_file,
-				 "Width = %d was chosen for reassociation\n",
-				 width);
-		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
-						  width, ops);
+		      if (keep_fma_chain)
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Break chain len = %d into width for FMA\n",
+				     len);
+			  rewrite_expr_tree_parallel_for_fma
+			    (as_a <gassign *> (stmt), width, ops);
+			}
+		      else
+			{
+			  if (dump_file && (dump_flags & TDF_DETAILS))
+			    fprintf (dump_file,
+				     "Width = %d was chosen for reassociation\n",
+				     width);
+			  rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
+						      width, ops);
+			}
 		    }
 		  else
-                    {
-                      /* When there are three operands left, we want
-                         to make sure the ones that get the double
-                         binary op are chosen wisely.  */
-                      int len = ops.length ();
-                      if (len >= 3)
+		    {
+		      /* When there are three operands left, we want
+			 to make sure the ones that get the double
+			 binary op are chosen wisely.  */
+		      if (len >= 3 && !keep_fma_chain)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
 						   powi_result != NULL
 						   || negate_result,
 						   len != orig_len);
-                    }
-
+		    }
 		  /* If we combined some repeated factors into a 
 		     __builtin_powi call, multiply that result by the
 		     reassociated operands.  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] Add a tune option to control the length of the chain with FMA
  2023-05-11 10:12 [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Cui, Lili
@ 2023-05-11 10:12 ` Cui, Lili
  2023-05-11 10:56   ` Richard Biener
  2023-05-11 10:52 ` [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Richard Biener
  1 sibling, 1 reply; 10+ messages in thread
From: Cui, Lili @ 2023-05-11 10:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: Lili Cui

From: Lili Cui <lili.cui@intel.com>

Set the length of the chain with FMA to 5 for icelake_cost.

With this patch applied,
SPR multi-copy: 508.namd_r increased by 3%
ICX multi-copy: 508.namd_r increased by 3.5%,
                507.cactuBSSN_r increased by 3.7%

Using FMA instead of mult + add reduces register pressure and insruction
retired.

gcc/ChangeLog:

        * config/i386/i386-options.cc (ix86_option_override_internal):
        Set param_max_reassoc_fma_chain_length.
        * config/i386/i386.h (struct processor_costs): Add new tune parameters.
        * config/i386/x86-tune-costs.h (struct processor_costs): Set
	reassoc_max_chain_length_with_fma to 5 for icelake.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/fma-chain.c: New test.
---
 gcc/config/i386/i386-options.cc           |  2 ++
 gcc/config/i386/i386.h                    |  3 ++
 gcc/config/i386/x86-tune-costs.h          | 35 +++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/fma-chain.c | 11 +++++++
 4 files changed, 51 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/fma-chain.c

diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 2cb0bddcd35..67d35d89d91 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -2684,6 +2684,8 @@ ix86_option_override_internal (bool main_args_p,
 		       ix86_tune_cost->l1_cache_size);
   SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
 		       ix86_tune_cost->l2_cache_size);
+  SET_OPTION_IF_UNSET (opts, opts_set, param_reassoc_max_chain_length_with_fma,
+		       ix86_tune_cost->reassoc_max_chain_length_with_fma);
 
   /* 64B is the accepted value for these for all x86.  */
   SET_OPTION_IF_UNSET (&global_options, &global_options_set,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index c7439f89bdf..c7fa7312a67 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -206,6 +206,9 @@ struct processor_costs {
 				   to number of instructions executed in
 				   parallel.  See also
 				   ix86_reassociation_width.  */
+  const int reassoc_max_chain_length_with_fma;
+				/* Specify max reassociation chain length with
+				   FMA.  */
   struct stringop_algs *memcpy, *memset;
   const int cond_taken_branch_cost;    /* Cost of taken branch for vectorizer
 					  cost model.  */
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 4f7a67ca5c5..1f57a5ee2a7 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -127,6 +127,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of SQRTSS instruction.  */
   COSTS_N_BYTES (2),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   ix86_size_memcpy,
   ix86_size_memset,
   COSTS_N_BYTES (1),			/* cond_taken_branch_cost.  */
@@ -238,6 +239,7 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   COSTS_N_INSNS (122),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (122),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   i386_memcpy,
   i386_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -350,6 +352,7 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   COSTS_N_INSNS (83),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (83),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   i486_memcpy,
   i486_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -460,6 +463,7 @@ struct processor_costs pentium_cost = {
   COSTS_N_INSNS (70),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (70),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium_memcpy,
   pentium_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -563,6 +567,7 @@ struct processor_costs lakemont_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (63),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium_memcpy,
   pentium_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -681,6 +686,7 @@ struct processor_costs pentiumpro_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (31),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentiumpro_memcpy,
   pentiumpro_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -790,6 +796,7 @@ struct processor_costs geode_cost = {
   COSTS_N_INSNS (54),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (54),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   geode_memcpy,
   geode_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -902,6 +909,7 @@ struct processor_costs k6_cost = {
   COSTS_N_INSNS (56),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (56),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   k6_memcpy,
   k6_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1015,6 +1023,7 @@ struct processor_costs athlon_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (19),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   athlon_memcpy,
   athlon_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1137,6 +1146,7 @@ struct processor_costs k8_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (27),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   k8_memcpy,
   k8_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -1267,6 +1277,7 @@ struct processor_costs amdfam10_cost = {
   COSTS_N_INSNS (19),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (27),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   amdfam10_memcpy,
   amdfam10_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -1390,6 +1401,7 @@ const struct processor_costs bdver_cost = {
   COSTS_N_INSNS (15),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (26),			/* cost of SQRTSD instruction.  */
   1, 2, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   bdver_memcpy,
   bdver_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1545,6 +1557,7 @@ struct processor_costs znver1_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver1_memcpy,
   znver1_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1704,6 +1717,7 @@ struct processor_costs znver2_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1838,6 +1852,7 @@ struct processor_costs znver3_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -1974,6 +1989,7 @@ struct processor_costs znver4_cost = {
      plus/minus operations per cycle but only one multiply.  This is adjusted
      in ix86_reassociation_width.  */
   4, 4, 3, 6,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   znver2_memcpy,
   znver2_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -2100,6 +2116,7 @@ struct processor_costs skylake_cost = {
   COSTS_N_INSNS (12),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   skylake_memcpy,
   skylake_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2228,6 +2245,12 @@ struct processor_costs icelake_cost = {
   COSTS_N_INSNS (12),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  /* Icelake-server prefers fma chains instead of breaking dependencies into
+     mult + add, which can reduce instruction retired. 1 means not to keep
+     the fma chain. When the value big than 1, we will generate fma chain.
+     When the actual fma chain length is greater than this value, the fma
+     chain will be split with width.  */
+  5,					/* Reassoc max FMA chain length.  */
   icelake_memcpy,
   icelake_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2350,6 +2373,7 @@ struct processor_costs alderlake_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   alderlake_memcpy,
   alderlake_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -2465,6 +2489,7 @@ const struct processor_costs btver1_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (48),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   btver1_memcpy,
   btver1_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -2577,6 +2602,7 @@ const struct processor_costs btver2_cost = {
   COSTS_N_INSNS (16),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (21),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   btver2_memcpy,
   btver2_memset,
   COSTS_N_INSNS (2),			/* cond_taken_branch_cost.  */
@@ -2688,6 +2714,7 @@ struct processor_costs pentium4_cost = {
   COSTS_N_INSNS (23),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (38),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   pentium4_memcpy,
   pentium4_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2802,6 +2829,7 @@ struct processor_costs nocona_cost = {
   COSTS_N_INSNS (32),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (41),			/* cost of SQRTSD instruction.  */
   1, 1, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   nocona_memcpy,
   nocona_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -2914,6 +2942,7 @@ struct processor_costs atom_cost = {
   COSTS_N_INSNS (31),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (63),			/* cost of SQRTSD instruction.  */
   2, 2, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   atom_memcpy,
   atom_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3026,6 +3055,7 @@ struct processor_costs slm_cost = {
   COSTS_N_INSNS (20),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (35),			/* cost of SQRTSD instruction.  */
   1, 2, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   slm_memcpy,
   slm_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3152,6 +3182,7 @@ struct processor_costs tremont_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   tremont_memcpy,
   tremont_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3264,6 +3295,7 @@ struct processor_costs intel_cost = {
   COSTS_N_INSNS (40),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (40),			/* cost of SQRTSD instruction.  */
   1, 4, 1, 1,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   intel_memcpy,
   intel_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
@@ -3381,6 +3413,7 @@ struct processor_costs lujiazui_cost = {
   COSTS_N_INSNS (32),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (60),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   lujiazui_memcpy,
   lujiazui_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3502,6 +3535,7 @@ struct processor_costs generic_cost = {
   COSTS_N_INSNS (14),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (18),			/* cost of SQRTSD instruction.  */
   1, 4, 3, 3,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   generic_memcpy,
   generic_memset,
   COSTS_N_INSNS (4),			/* cond_taken_branch_cost.  */
@@ -3630,6 +3664,7 @@ struct processor_costs core_cost = {
   COSTS_N_INSNS (30),			/* cost of SQRTSS instruction.  */
   COSTS_N_INSNS (58),			/* cost of SQRTSD instruction.  */
   1, 4, 2, 2,				/* reassoc int, fp, vec_int, vec_fp.  */
+  1,					/* Reassoc max FMA chain length.  */
   core_memcpy,
   core_memset,
   COSTS_N_INSNS (3),			/* cond_taken_branch_cost.  */
diff --git a/gcc/testsuite/gcc.target/i386/fma-chain.c b/gcc/testsuite/gcc.target/i386/fma-chain.c
new file mode 100644
index 00000000000..9de61f1b6ff
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/fma-chain.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -Wno-attributes " } */
+
+/* Test that the compiler properly optimizes multiply and add
+   to generate more FMA instructions.  */
+float
+foo (float a, float b, float c, float d, float e, float f, float g, float h, float j)
+{
+   return a * b + c * d + e * f + g * h + j;
+}
+/* { dg-final { scan-assembler-times "vfm" 4 } } */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-11 10:12 [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Cui, Lili
  2023-05-11 10:12 ` [PATCH 2/2] Add a tune option to control the length of the chain with FMA Cui, Lili
@ 2023-05-11 10:52 ` Richard Biener
  2023-05-11 15:18   ` Cui, Lili
  1 sibling, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-05-11 10:52 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches

On Thu, May 11, 2023 at 12:13 PM Cui, Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> From: Lili Cui <lili.cui@intel.com>
>
> Hi,
>
> Those two patches each add a param to control the length of the chain with
> FMA in reassoc pass and a tuning option in the backend.
>
> Bootstrapped and regtested. Ok for trunk?
>
> Regards
> Lili.
>
> Add a param for the chain with FMA in reassoc pass to make it more friendly to
> the fma pass later. First to detect if this chain has ability to
> generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
> is enabled, We will rearrange the ops so that they can be combined into more
> FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
> build parallel chains according to given association width and try to keep FMA
> opportunity as much as possible.
>
> TEST1:
>
> float
> foo (float a, float b, float c, float d, float *e)
> {
>    return  *e  + a * b + c * d ;
> }
>
> For -Ofast -march=icelake-server  GCC generates:
>         vmulss  %xmm3, %xmm2, %xmm2
>         vfmadd132ss     %xmm1, %xmm2, %xmm0
>         vaddss  (%rdi), %xmm0, %xmm0
>         ret
>
> with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
>         vfmadd213ss   (%rdi), %xmm1, %xmm0
>         vfmadd231ss   %xmm2, %xmm3, %xmm0
>         ret
>
> gcc/ChangeLog:
>
>         PR gcc/98350
>         * params.opt (reassoc-max-fma-chain-length): New param.
>         * tree-ssa-reassoc.cc
>         (rewrite_expr_tree_parallel_for_fma): New.
>         (rank_ops_for_fma): Ditto.
>         (reassociate_bb): Handle new function.
>
> gcc/testsuite/ChangeLog:
>
>         PR gcc/98350
>         * gcc.dg/pr98350-1.c: New test.
>         * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/params.opt                   |   4 +
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31 +++++
>  gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
>  gcc/tree-ssa-reassoc.cc          | 228 ++++++++++++++++++++++++++++---
>  4 files changed, 264 insertions(+), 16 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
>
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 823cdb2ff85..f7c719afe64 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
>  Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
>  Enable loop vectorization of floating point inductions.
>
> +-param=reassoc-max-chain-length-with-fma=
> +Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) IntegerRange(1, 65536) Param Optimization
> +The maximum chain length with fma considered in reassociation pass.
> +
>  ; This comment is to ensure we retain the blank line above.
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
> new file mode 100644
> index 00000000000..32ecce13a2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */
> +#define N 1024
> +double a[N];
> +double b[N];
> +double c[N];
> +double d[N];
> +double e[N];
> +double f[N];
> +double g[N];
> +double h[N];
> +double j[N];
> +double k[N];
> +double l[N];
> +double m[N];
> +double o[N];
> +double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i = 0; i < N; i++)
> +  {
> +    a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + m[i]* o[i] + p[i];
> +  }
> +}
> +/* { dg-final { scan-assembler-times "vfm" 6  } } */
> diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
> new file mode 100644
> index 00000000000..246025d43b8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
> +
> +/* Test that the compiler properly build parallel chains according to given
> +   association width and try to keep FMA opportunity as much as possible.  */
> +#define N 33
> +double a[N];
> +
> +void
> +foo (void)
> +{
> +  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
> +    + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
> +    + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
> +    + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
> +}
> +/* { dg-final { scan-assembler-times "vfm" 12  } } */
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index 067a3f07f7e..6d2e158c4f5 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-reassoc.h"
>  #include "tree-ssa-math-opts.h"
>  #include "gimple-range.h"
> +#include "internal-fn.h"
>
>  /*  This is a simple global reassociation pass.  It is, in part, based
>      on the LLVM pass of the same name (They do some things more/less
> @@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code opc,
>    return width;
>  }
>
> +/* Rewrite statements with dependency chain with regard to the chance to
> +   generate FMA. When the dependency chain length exceeds
> +   param_max_reassoc_chain_length_with_fma, build parallel chains according to
> +   given association width and try to keep fma opportunity as much as possible.
> +   E.g.
> +   e + f + g + a * b + c * d;
> +
> +   ssa1 = e + f;
> +   ssa2 = g + a * b;
> +   ssa3 = ssa1 + c * d;
> +   ssa4 = ssa2 + ssa3;
> +
> +   This reassociation approach preserves the chance of fma generation as much
> +   as possible.  */
> +static void
> +rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
> +                                        const vec<operand_entry *> &ops)
> +{
> +  enum tree_code opcode = gimple_assign_rhs_code (stmt);
> +  int op_num = ops.length ();
> +  gcc_assert (op_num > 0);
> +  int stmt_num = op_num - 1;
> +  gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> +  int op_index = op_num - 1;
> +  int width_count = width;
> +  int i = 0, j = 0;
> +  tree tmp_op[2], op1;
> +  operand_entry *oe;
> +  gimple *stmt1 = NULL;
> +  tree last_rhs1 = gimple_assign_rhs1 (stmt);
> +
> +  /* We start expression rewriting from the top statements.
> +     So, in this loop we create a full list of statements
> +     we will work with.  */
> +  stmts[stmt_num - 1] = stmt;
> +  for (i = stmt_num - 2; i >= 0; i--)
> +    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> +
> +  /* Build parallel FMA dependency chain according to width.  */
> +  for (i = 0; i < width; i++)
> +    {
> +      for (j = 0; j < 2; j++)
> +       {
> +         oe = ops[op_index--];
> +         tmp_op[j] = oe->op;
> +         stmt1 = oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 = NULL;
> +       }
> +      stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], tmp_op[0], opcode);
> +      gimple_set_visited (stmts[i], true);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, " into ");
> +         print_gimple_stmt (dump_file, stmts[i], 0);
> +       }
> +    }
> +
> +  for (i = width; i < stmt_num; i++)
> +    {
> +      /* We keep original statement only for the last one.  All others are
> +        recreated.  */
> +      if ( op_index < 0)
> +       {
> +         if (width_count == 2)
> +           {
> +
> +             gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]));
> +             gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]));
> +           }
> +         else
> +           {
> +
> +             stmts[i] =
> +               build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                  gimple_assign_lhs (stmts[i-width_count]),
> +                                  gimple_assign_lhs (stmts[i-width_count+1]),
> +                                  opcode);
> +             width_count--;
> +           }
> +         update_stmt (stmts[i]);
> +       }
> +      else
> +       {
> +         oe = ops[op_index--];
> +         op1 = oe->op;
> +         stmt1 = oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 = NULL;
> +         stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                 gimple_assign_lhs (stmts[i-width]),
> +                                 op1,
> +                                 opcode);
> +         gimple_set_visited (stmts[i], true);
> +  }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, " into ");
> +         print_gimple_stmt (dump_file, stmts[i], 0);
> +       }
> +    }
> +  remove_visited_stmt_chain (last_rhs1);
> +}
> +
>  /* Recursively rewrite our linearized statements so that the operators
>     match those in OPS[OPINDEX], putting the computation in rank
>     order and trying to allow operations to be executed in
> @@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
>      }
>  }
>
> +/* Rearrange ops to generate more FMA when the chain may has more than 2 fmas.
> +   Putting ops that not def from mult in front can generate more fma.
> +   E.g.
> +   a * b + c * d + e generates:
> +
> +   _4  = c_9(D) * d_10(D);
> +   _12 = .FMA (a_7(D), b_8(D), _4);
> +   _11 = e_6(D) + _12;
> +
> +   Rtearrange ops to -> e + a * b + c * d generates:
> +
> +   _4  = .FMA (c_7(D), d_8(D), _3);
> +   _11 = .FMA (a_5(D), b_6(D), _4);
> + */
> +static bool
> +rank_ops_for_fma (vec<operand_entry *> *ops)
> +{
> +  operand_entry *oe;
> +  unsigned int i;
> +  auto_vec<operand_entry *> ops_mult;
> +  auto_vec<operand_entry *> ops_others;
> +
> +  FOR_EACH_VEC_ELT (*ops, i, oe)
> +    {
> +      if (TREE_CODE (oe->op) == SSA_NAME)
> +       {
> +         gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
> +         if (is_gimple_assign (def_stmt)
> +             && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
> +           ops_mult.safe_push (oe);
> +         else
> +           ops_others.safe_push (oe);
> +       }
> +      else
> +       ops_others.safe_push (oe);
> +    }
> +  /* When ops_mult.length == 2, like the following case,
> +
> +     a * b + c * d + e.
> +
> +     we need to rearrange the ops.
> +
> +     Putting ops that not def from mult in front can generate more fmas.  */
> +  if (ops_mult.length () >= 2)
> +    {
> +      /* If all ops are defined with mult, we don't need to rearrange them.  */
> +      if (ops_mult.length () != ops->length ())
> +       {
> +         ops->truncate (0);
> +         FOR_EACH_VEC_ELT (ops_mult, i, oe)
> +           ops->safe_push (oe);

As you are not changing the number of ops you should be able to use
quick_push here
and below.  You should be able to do

 ops->splice (ops_mult);
 ops->splice (ops_others);

as well.

> +         FOR_EACH_VEC_ELT (ops_others, i, oe)
> +           ops->safe_push (oe);
> +       }
> +      return true;
> +    }
> +  return false;
> +}
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.
>
> @@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>                   int ops_num = ops.length ();
>                   int width;
> +                 bool keep_fma_chain = false;
>
>                   /* For binary bit operations, if there are at least 3
>                      operands and the last operand in OPS is a constant,
> @@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
>                       && TREE_CODE (ops.last ()->op) == INTEGER_CST)
>                     std::swap (*ops[0], *ops[ops_num - 1]);
>
> +                 optimization_type opt_type = bb_optimization_type (bb);
> +
> +                 /* When enabling param_reassoc_max_chain_length_with_fma to
> +                    keep the chain with fma, rank_ops_for_fma will detect if
> +                    the chain has fmas and if so it will rearrange the ops.  */
> +                 if (param_reassoc_max_chain_length_with_fma > 1
> +                     && direct_internal_fn_supported_p (IFN_FMA,
> +                                                        TREE_TYPE (lhs),
> +                                                        opt_type)
> +                     && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> +                   {
> +                     keep_fma_chain = rank_ops_for_fma(&ops);
> +                   }
> +
> +                 int len = ops.length ();
>                   /* Only rewrite the expression tree to parallel in the
>                      last reassoc pass to avoid useless work back-and-forth
>                      with initial linearization.  */

we are doing the parallel rewrite only in the last reassoc pass, i think it
makes sense to do the same for reassoc-for-fma.

Why do the existing expr rewrites not work after re-sorting the ops?

>                   if (!reassoc_insert_powi_p
> -                     && ops.length () > 3
> +                     && len > 3
> +                     && (!keep_fma_chain
> +                         || (keep_fma_chain
> +                             && len > param_reassoc_max_chain_length_with_fma))

in the case len < param_reassoc_max_chain_length_with_fma we have the
chain re-sorted but fall through to non-parallel rewrite.  I wonder if we do not
want to instead adjust the reassociation width?  I'd say it depends on the
number of mult cases in the chain (sth the re-sorting could have computed).
Why do we have two completely independent --params here?  Can you give
an example --param
value combination that makes "sense" and show how it is beneficial?

>                       && (width = get_reassociation_width (ops_num, rhs_code,
> -                                                          mode)) > 1)
> +                                                           mode)) > 1)
>                     {
> -                     if (dump_file && (dump_flags & TDF_DETAILS))
> -                       fprintf (dump_file,
> -                                "Width = %d was chosen for reassociation\n",
> -                                width);
> -                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> -                                                 width, ops);
> +                     if (keep_fma_chain)
> +                       {
> +                         if (dump_file && (dump_flags & TDF_DETAILS))
> +                           fprintf (dump_file,
> +                                    "Break chain len = %d into width for FMA\n",
> +                                    len);
> +                         rewrite_expr_tree_parallel_for_fma
> +                           (as_a <gassign *> (stmt), width, ops);
> +                       }
> +                     else
> +                       {
> +                         if (dump_file && (dump_flags & TDF_DETAILS))
> +                           fprintf (dump_file,
> +                                    "Width = %d was chosen for reassociation\n",
> +                                    width);
> +                         rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> +                                                     width, ops);
> +                       }
>                     }
>                   else
> -                    {
> -                      /* When there are three operands left, we want
> -                         to make sure the ones that get the double
> -                         binary op are chosen wisely.  */
> -                      int len = ops.length ();
> -                      if (len >= 3)
> +                   {
> +                     /* When there are three operands left, we want
> +                        to make sure the ones that get the double
> +                        binary op are chosen wisely.  */
> +                     if (len >= 3 && !keep_fma_chain)
>                         swap_ops_for_binary_stmt (ops, len - 3);
>
>                       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
>                                                    powi_result != NULL
>                                                    || negate_result,
>                                                    len != orig_len);
> -                    }
> -
> +                   }
>                   /* If we combined some repeated factors into a
>                      __builtin_powi call, multiply that result by the
>                      reassociated operands.  */
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] Add a tune option to control the length of the chain with FMA
  2023-05-11 10:12 ` [PATCH 2/2] Add a tune option to control the length of the chain with FMA Cui, Lili
@ 2023-05-11 10:56   ` Richard Biener
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Biener @ 2023-05-11 10:56 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches

On Thu, May 11, 2023 at 12:13 PM Cui, Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> From: Lili Cui <lili.cui@intel.com>
>
> Set the length of the chain with FMA to 5 for icelake_cost.
>
> With this patch applied,
> SPR multi-copy: 508.namd_r increased by 3%
> ICX multi-copy: 508.namd_r increased by 3.5%,
>                 507.cactuBSSN_r increased by 3.7%
>
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.

I would say it would make more sense to use the existing reassoc_width
hook and based on the opcode specify the number of adds vs. mults
(where I guess all subarchs have #mults equal to the #fmas) that can
be carried out in parallel?

That means for the reassoc patch shouldn't we simply query
the PLUS and MULT reassoc width and compute something from that
instead of adding another --param?

Richrad.

> gcc/ChangeLog:
>
>         * config/i386/i386-options.cc (ix86_option_override_internal):
>         Set param_max_reassoc_fma_chain_length.
>         * config/i386/i386.h (struct processor_costs): Add new tune parameters.
>         * config/i386/x86-tune-costs.h (struct processor_costs): Set
>         reassoc_max_chain_length_with_fma to 5 for icelake.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/fma-chain.c: New test.
> ---
>  gcc/config/i386/i386-options.cc           |  2 ++
>  gcc/config/i386/i386.h                    |  3 ++
>  gcc/config/i386/x86-tune-costs.h          | 35 +++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/fma-chain.c | 11 +++++++
>  4 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/fma-chain.c
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 2cb0bddcd35..67d35d89d91 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -2684,6 +2684,8 @@ ix86_option_override_internal (bool main_args_p,
>                        ix86_tune_cost->l1_cache_size);
>    SET_OPTION_IF_UNSET (opts, opts_set, param_l2_cache_size,
>                        ix86_tune_cost->l2_cache_size);
> +  SET_OPTION_IF_UNSET (opts, opts_set, param_reassoc_max_chain_length_with_fma,
> +                      ix86_tune_cost->reassoc_max_chain_length_with_fma);
>
>    /* 64B is the accepted value for these for all x86.  */
>    SET_OPTION_IF_UNSET (&global_options, &global_options_set,
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index c7439f89bdf..c7fa7312a67 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -206,6 +206,9 @@ struct processor_costs {
>                                    to number of instructions executed in
>                                    parallel.  See also
>                                    ix86_reassociation_width.  */
> +  const int reassoc_max_chain_length_with_fma;
> +                               /* Specify max reassociation chain length with
> +                                  FMA.  */
>    struct stringop_algs *memcpy, *memset;
>    const int cond_taken_branch_cost;    /* Cost of taken branch for vectorizer
>                                           cost model.  */
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 4f7a67ca5c5..1f57a5ee2a7 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -127,6 +127,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
>    COSTS_N_BYTES (2),                   /* cost of SQRTSS instruction.  */
>    COSTS_N_BYTES (2),                   /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    ix86_size_memcpy,
>    ix86_size_memset,
>    COSTS_N_BYTES (1),                   /* cond_taken_branch_cost.  */
> @@ -238,6 +239,7 @@ struct processor_costs i386_cost = {        /* 386 specific costs */
>    COSTS_N_INSNS (122),                 /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (122),                 /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    i386_memcpy,
>    i386_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -350,6 +352,7 @@ struct processor_costs i486_cost = {        /* 486 specific costs */
>    COSTS_N_INSNS (83),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (83),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    i486_memcpy,
>    i486_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -460,6 +463,7 @@ struct processor_costs pentium_cost = {
>    COSTS_N_INSNS (70),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (70),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    pentium_memcpy,
>    pentium_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -563,6 +567,7 @@ struct processor_costs lakemont_cost = {
>    COSTS_N_INSNS (31),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (63),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    pentium_memcpy,
>    pentium_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -681,6 +686,7 @@ struct processor_costs pentiumpro_cost = {
>    COSTS_N_INSNS (31),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (31),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    pentiumpro_memcpy,
>    pentiumpro_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -790,6 +796,7 @@ struct processor_costs geode_cost = {
>    COSTS_N_INSNS (54),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (54),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    geode_memcpy,
>    geode_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -902,6 +909,7 @@ struct processor_costs k6_cost = {
>    COSTS_N_INSNS (56),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (56),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    k6_memcpy,
>    k6_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -1015,6 +1023,7 @@ struct processor_costs athlon_cost = {
>    COSTS_N_INSNS (19),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (19),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    athlon_memcpy,
>    athlon_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -1137,6 +1146,7 @@ struct processor_costs k8_cost = {
>    COSTS_N_INSNS (19),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (27),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    k8_memcpy,
>    k8_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -1267,6 +1277,7 @@ struct processor_costs amdfam10_cost = {
>    COSTS_N_INSNS (19),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (27),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    amdfam10_memcpy,
>    amdfam10_memset,
>    COSTS_N_INSNS (2),                   /* cond_taken_branch_cost.  */
> @@ -1390,6 +1401,7 @@ const struct processor_costs bdver_cost = {
>    COSTS_N_INSNS (15),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (26),                  /* cost of SQRTSD instruction.  */
>    1, 2, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    bdver_memcpy,
>    bdver_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -1545,6 +1557,7 @@ struct processor_costs znver1_cost = {
>       plus/minus operations per cycle but only one multiply.  This is adjusted
>       in ix86_reassociation_width.  */
>    4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    znver1_memcpy,
>    znver1_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -1704,6 +1717,7 @@ struct processor_costs znver2_cost = {
>       plus/minus operations per cycle but only one multiply.  This is adjusted
>       in ix86_reassociation_width.  */
>    4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    znver2_memcpy,
>    znver2_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -1838,6 +1852,7 @@ struct processor_costs znver3_cost = {
>       plus/minus operations per cycle but only one multiply.  This is adjusted
>       in ix86_reassociation_width.  */
>    4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    znver2_memcpy,
>    znver2_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -1974,6 +1989,7 @@ struct processor_costs znver4_cost = {
>       plus/minus operations per cycle but only one multiply.  This is adjusted
>       in ix86_reassociation_width.  */
>    4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    znver2_memcpy,
>    znver2_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -2100,6 +2116,7 @@ struct processor_costs skylake_cost = {
>    COSTS_N_INSNS (12),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
>    1, 4, 2, 2,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    skylake_memcpy,
>    skylake_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -2228,6 +2245,12 @@ struct processor_costs icelake_cost = {
>    COSTS_N_INSNS (12),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
>    1, 4, 2, 2,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  /* Icelake-server prefers fma chains instead of breaking dependencies into
> +     mult + add, which can reduce instruction retired. 1 means not to keep
> +     the fma chain. When the value big than 1, we will generate fma chain.
> +     When the actual fma chain length is greater than this value, the fma
> +     chain will be split with width.  */
> +  5,                                   /* Reassoc max FMA chain length.  */
>    icelake_memcpy,
>    icelake_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -2350,6 +2373,7 @@ struct processor_costs alderlake_cost = {
>    COSTS_N_INSNS (14),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
>    1, 4, 3, 3,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    alderlake_memcpy,
>    alderlake_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -2465,6 +2489,7 @@ const struct processor_costs btver1_cost = {
>    COSTS_N_INSNS (14),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (48),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    btver1_memcpy,
>    btver1_memset,
>    COSTS_N_INSNS (2),                   /* cond_taken_branch_cost.  */
> @@ -2577,6 +2602,7 @@ const struct processor_costs btver2_cost = {
>    COSTS_N_INSNS (16),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (21),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    btver2_memcpy,
>    btver2_memset,
>    COSTS_N_INSNS (2),                   /* cond_taken_branch_cost.  */
> @@ -2688,6 +2714,7 @@ struct processor_costs pentium4_cost = {
>    COSTS_N_INSNS (23),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (38),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    pentium4_memcpy,
>    pentium4_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -2802,6 +2829,7 @@ struct processor_costs nocona_cost = {
>    COSTS_N_INSNS (32),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (41),                  /* cost of SQRTSD instruction.  */
>    1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    nocona_memcpy,
>    nocona_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -2914,6 +2942,7 @@ struct processor_costs atom_cost = {
>    COSTS_N_INSNS (31),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (63),                  /* cost of SQRTSD instruction.  */
>    2, 2, 2, 2,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    atom_memcpy,
>    atom_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -3026,6 +3055,7 @@ struct processor_costs slm_cost = {
>    COSTS_N_INSNS (20),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (35),                  /* cost of SQRTSD instruction.  */
>    1, 2, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    slm_memcpy,
>    slm_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -3152,6 +3182,7 @@ struct processor_costs tremont_cost = {
>    COSTS_N_INSNS (14),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
>    1, 4, 3, 3,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    tremont_memcpy,
>    tremont_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -3264,6 +3295,7 @@ struct processor_costs intel_cost = {
>    COSTS_N_INSNS (40),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (40),                  /* cost of SQRTSD instruction.  */
>    1, 4, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    intel_memcpy,
>    intel_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> @@ -3381,6 +3413,7 @@ struct processor_costs lujiazui_cost = {
>    COSTS_N_INSNS (32),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (60),                  /* cost of SQRTSD instruction.  */
>    1, 4, 3, 3,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    lujiazui_memcpy,
>    lujiazui_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -3502,6 +3535,7 @@ struct processor_costs generic_cost = {
>    COSTS_N_INSNS (14),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (18),                  /* cost of SQRTSD instruction.  */
>    1, 4, 3, 3,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    generic_memcpy,
>    generic_memset,
>    COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> @@ -3630,6 +3664,7 @@ struct processor_costs core_cost = {
>    COSTS_N_INSNS (30),                  /* cost of SQRTSS instruction.  */
>    COSTS_N_INSNS (58),                  /* cost of SQRTSD instruction.  */
>    1, 4, 2, 2,                          /* reassoc int, fp, vec_int, vec_fp.  */
> +  1,                                   /* Reassoc max FMA chain length.  */
>    core_memcpy,
>    core_memset,
>    COSTS_N_INSNS (3),                   /* cond_taken_branch_cost.  */
> diff --git a/gcc/testsuite/gcc.target/i386/fma-chain.c b/gcc/testsuite/gcc.target/i386/fma-chain.c
> new file mode 100644
> index 00000000000..9de61f1b6ff
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/fma-chain.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=icelake-server -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */
> +float
> +foo (float a, float b, float c, float d, float e, float f, float g, float h, float j)
> +{
> +   return a * b + c * d + e * f + g * h + j;
> +}
> +/* { dg-final { scan-assembler-times "vfm" 4 } } */
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-11 10:52 ` [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Richard Biener
@ 2023-05-11 15:18   ` Cui, Lili
  2023-05-12  6:04     ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Cui, Lili @ 2023-05-11 15:18 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, May 11, 2023 6:53 PM
> To: Cui, Lili <lili.cui@intel.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of
> the chain with FMA in reassoc pass

Hi Richard,
Thanks for helping to review the patch.

> 
> As you are not changing the number of ops you should be able to use
> quick_push here and below.  You should be able to do
> 
>  ops->splice (ops_mult);
>  ops->splice (ops_others);
> 
> as well.
> 
Done.

> > +                 /* When enabling param_reassoc_max_chain_length_with_fma
> to
> > +                    keep the chain with fma, rank_ops_for_fma will detect if
> > +                    the chain has fmas and if so it will rearrange the ops.  */
> > +                 if (param_reassoc_max_chain_length_with_fma > 1
> > +                     && direct_internal_fn_supported_p (IFN_FMA,
> > +                                                        TREE_TYPE (lhs),
> > +                                                        opt_type)
> > +                     && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> > +                   {
> > +                     keep_fma_chain = rank_ops_for_fma(&ops);
> > +                   }
> > +
> > +                 int len = ops.length ();
> >                   /* Only rewrite the expression tree to parallel in the
> >                      last reassoc pass to avoid useless work back-and-forth
> >                      with initial linearization.  */
> 
> we are doing the parallel rewrite only in the last reassoc pass, i think it makes
> sense to do the same for reassoc-for-fma.

I rearranged the order of ops in reassoc1 without break the chain, it generated more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2.

> 
> Why do the existing expr rewrites not work after re-sorting the ops?

For case https://godbolt.org/z/3x9PWE9Kb:  we put  "j" at first.

j + l * m + a * b + c * d + e * f + g * h;

GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " generates 3 FMAs.
---------------------------------------------------------------------------------------
  _1 = l_10(D) * m_11(D);
  _3 = a_13(D) * b_14(D);
  _4 = j_12(D) + _3;    --------> Here is one FMA.
  _5 = c_15(D) * d_16(D);
  _8 = _1 + _5;            --------> Here is one FMA and lost one.
  _7 = e_17(D) * f_18(D);
  _9 = g_19(D) * h_20(D);
  _2 = _7 + _9;           --------> Here is one FMA and lost one.
  _6 = _2 + _4;
  _21 = _6 + _8;
  # VUSE <.MEM_22(D)>
  return _21;
--------------------------------------------------------------------------------------
width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " generates 4 FMAs.
------------------------------------------------------------------------------
_1 = a_10(D) * b_11(D);
  _3 = c_13(D) * d_14(D);
  _5 = e_15(D) * f_16(D);
  _7 = g_17(D) * h_18(D);
  _4 = _5 + _7;           --------> Here is one FMA and lost one.
  _8 = _4 + _1;           --------> Here is one FMA.
  _9 = l_19(D) * m_20(D);
  _2 = _9 + j_12(D);    --------> Here is one FMA.
  _6 = _2 + _3;            --------> Here is one FMA.
  _21 = _8 + _6;         
  return _21;
------------------------------------------------------------------------------------


> 
> >                   if (!reassoc_insert_powi_p
> > -                     && ops.length () > 3
> > +                     && len > 3
> > +                     && (!keep_fma_chain
> > +                         || (keep_fma_chain
> > +                             && len >
> > + param_reassoc_max_chain_length_with_fma))
> 
> in the case len < param_reassoc_max_chain_length_with_fma we have the
> chain re-sorted but fall through to non-parallel rewrite.  I wonder if we do
> not want to instead adjust the reassociation width?  I'd say it depends on the
> number of mult cases in the chain (sth the re-sorting could have computed).
> Why do we have two completely independent --params here?  Can you give
> an example --param value combination that makes "sense" and show how it
> is beneficial?

For this small case https://godbolt.org/z/Pxczrre8P
a * b + c * d + e * f  + j

GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost is 4). Calculated: Width = 2. we can get 2 FMAs.
----------------------------------
  _1 = a_6(D) * b_7(D);
  _2 = c_8(D) * d_9(D);
  _5 = _1 + _2;
  _4 = e_10(D) * f_11(D);
  _3 = _4 + j_12(D);
  _13 = _3 + _5;
--------------------------------------------------------
  _2 = c_8(D) * d_9(D);
  _5 = .FMA (a_6(D), b_7(D), _2);
  _3 = .FMA (e_10(D), f_11(D), j_12(D));
  _13 = _3 + _5;
--------------------------------------------------------
New patch: If just rearrange ops and fall through to parallel rewrite to break the chain with width = 2.

---------------------------------------------------------
  _1 = a_6(D) * b_7(D);
  _2 = j + _1;          -----> put j at the first.
  _3 = c_8(D) * d_9(D);
  _4 = e_10(D) * f_11(D);
  _5 = _3 + _4;       -----> break chain with width = 2. we lost a FMA here.
  _13 = _2 + 5;

-------------------------------------------------------
  _3 = c_8(D) * d_9(D);
  _2 = .FMA (a_6(D), b_7(D), j);
  _5 = .FMA (e_10(D), f_11(D), _3);
  _13 = _2 + _5;
--------------------------------------------------------
Sometimes break chain will lose FMA( break chain needs put two mult-ops together, which will lose one FMA ), we can only get 2 FMAs here, if we want to get 3 FMAs, we need to keep the chain and not break it. So I added a param to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For the small case in Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
-------------------------------------------------------
  _1 = a_6(D) * b_7(D);
  _2 = c_8(D) * d_9(D);
  _4 = e_10(D) * f_11(D);
  _15 = _4 + j_12(D);
  _16 = _15 + _2;
  _13 = _16 + _1;
-------------------------------------------------------
  _15 = .FMA (e_10(D), f_11(D), j_12(D));
  _16 = .FMA (c_8(D), d_9(D), _15);
  _13 = .FMA (a_6(D), b_7(D), _16);
-------------------------------------------------------
In some case we want to break the chain with width, we can set "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and break the chain with width.

> 
> >                       && (width = get_reassociation_width (ops_num, rhs_code,
> > -                                                          mode)) > 1)
> > +                                                           mode)) >
> > + 1)
> >                     {
> > -                     if (dump_file && (dump_flags & TDF_DETAILS))
> > -                       fprintf (dump_file,
> > -                                "Width = %d was chosen for reassociation\n",
> > -                                width);
> > -                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > -                                                 width, ops);
> > +                     if (keep_fma_chain)
> > +                       {
> > +                         if (dump_file && (dump_flags & TDF_DETAILS))
> > +                           fprintf (dump_file,
> > +                                    "Break chain len = %d into width for FMA\n",
> > +                                    len);
> > +                         rewrite_expr_tree_parallel_for_fma
> > +                           (as_a <gassign *> (stmt), width, ops);
> > +                       }
> > +                     else
> > +                       {
> > +                         if (dump_file && (dump_flags & TDF_DETAILS))
> > +                           fprintf (dump_file,
> > +                                    "Width = %d was chosen for reassociation\n",
> > +                                    width);
> > +                         rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > +                                                     width, ops);
> > +                       }
> >                     }
> >                   else
> > -                    {
> > -                      /* When there are three operands left, we want
> > -                         to make sure the ones that get the double
> > -                         binary op are chosen wisely.  */
> > -                      int len = ops.length ();
> > -                      if (len >= 3)
> > +                   {
> > +                     /* When there are three operands left, we want
> > +                        to make sure the ones that get the double
> > +                        binary op are chosen wisely.  */
> > +                     if (len >= 3 && !keep_fma_chain)
> >                         swap_ops_for_binary_stmt (ops, len - 3);
> >
> >                       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> >                                                    powi_result != NULL
> >                                                    || negate_result,
> >                                                    len != orig_len);
> > -                    }
> > -
> > +                   }
> >                   /* If we combined some repeated factors into a
> >                      __builtin_powi call, multiply that result by the
> >                      reassociated operands.  */
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-11 15:18   ` Cui, Lili
@ 2023-05-12  6:04     ` Richard Biener
  2023-05-12  9:04       ` Cui, Lili
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-05-12  6:04 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches

On Thu, May 11, 2023 at 5:20 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Thursday, May 11, 2023 6:53 PM
> > To: Cui, Lili <lili.cui@intel.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of
> > the chain with FMA in reassoc pass
>
> Hi Richard,
> Thanks for helping to review the patch.
>
> >
> > As you are not changing the number of ops you should be able to use
> > quick_push here and below.  You should be able to do
> >
> >  ops->splice (ops_mult);
> >  ops->splice (ops_others);
> >
> > as well.
> >
> Done.
>
> > > +                 /* When enabling param_reassoc_max_chain_length_with_fma
> > to
> > > +                    keep the chain with fma, rank_ops_for_fma will detect if
> > > +                    the chain has fmas and if so it will rearrange the ops.  */
> > > +                 if (param_reassoc_max_chain_length_with_fma > 1
> > > +                     && direct_internal_fn_supported_p (IFN_FMA,
> > > +                                                        TREE_TYPE (lhs),
> > > +                                                        opt_type)
> > > +                     && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> > > +                   {
> > > +                     keep_fma_chain = rank_ops_for_fma(&ops);
> > > +                   }
> > > +
> > > +                 int len = ops.length ();
> > >                   /* Only rewrite the expression tree to parallel in the
> > >                      last reassoc pass to avoid useless work back-and-forth
> > >                      with initial linearization.  */
> >
> > we are doing the parallel rewrite only in the last reassoc pass, i think it makes
> > sense to do the same for reassoc-for-fma.
>
> I rearranged the order of ops in reassoc1 without break the chain, it generated more vectorize during vector pass( seen in benchmark 503). So I rewrite the ssa tree and keep the chain with function "rewrite_expr_tree" in reassoc1, break the chain with "rewrite_expr_tree_parallel_for_fma" in reassoc2.
>
> >
> > Why do the existing expr rewrites not work after re-sorting the ops?
>
> For case https://godbolt.org/z/3x9PWE9Kb:  we put  "j" at first.
>
> j + l * m + a * b + c * d + e * f + g * h;
>
> GCC trunk: width = 2, ops_num = 6, old function " rewrite_expr_tree_parallel " generates 3 FMAs.
> ---------------------------------------------------------------------------------------
>   _1 = l_10(D) * m_11(D);
>   _3 = a_13(D) * b_14(D);
>   _4 = j_12(D) + _3;    --------> Here is one FMA.
>   _5 = c_15(D) * d_16(D);
>   _8 = _1 + _5;            --------> Here is one FMA and lost one.
>   _7 = e_17(D) * f_18(D);
>   _9 = g_19(D) * h_20(D);
>   _2 = _7 + _9;           --------> Here is one FMA and lost one.
>   _6 = _2 + _4;
>   _21 = _6 + _8;
>   # VUSE <.MEM_22(D)>
>   return _21;
> --------------------------------------------------------------------------------------
> width = 2, ops_num = 6, new function " rewrite_expr_tree_parallel_for_fma " generates 4 FMAs.
> ------------------------------------------------------------------------------
> _1 = a_10(D) * b_11(D);
>   _3 = c_13(D) * d_14(D);
>   _5 = e_15(D) * f_16(D);
>   _7 = g_17(D) * h_18(D);
>   _4 = _5 + _7;           --------> Here is one FMA and lost one.
>   _8 = _4 + _1;           --------> Here is one FMA.
>   _9 = l_19(D) * m_20(D);
>   _2 = _9 + j_12(D);    --------> Here is one FMA.
>   _6 = _2 + _3;            --------> Here is one FMA.
>   _21 = _8 + _6;
>   return _21;
> ------------------------------------------------------------------------------------

ISTR there were no sufficient comments in the code explaining why
rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...

>
> >
> > >                   if (!reassoc_insert_powi_p
> > > -                     && ops.length () > 3
> > > +                     && len > 3
> > > +                     && (!keep_fma_chain
> > > +                         || (keep_fma_chain
> > > +                             && len >
> > > + param_reassoc_max_chain_length_with_fma))
> >
> > in the case len < param_reassoc_max_chain_length_with_fma we have the
> > chain re-sorted but fall through to non-parallel rewrite.  I wonder if we do
> > not want to instead adjust the reassociation width?  I'd say it depends on the
> > number of mult cases in the chain (sth the re-sorting could have computed).
> > Why do we have two completely independent --params here?  Can you give
> > an example --param value combination that makes "sense" and show how it
> > is beneficial?
>
> For this small case https://godbolt.org/z/Pxczrre8P
> a * b + c * d + e * f  + j
>
> GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> ----------------------------------
>   _1 = a_6(D) * b_7(D);
>   _2 = c_8(D) * d_9(D);
>   _5 = _1 + _2;
>   _4 = e_10(D) * f_11(D);
>   _3 = _4 + j_12(D);
>   _13 = _3 + _5;
> --------------------------------------------------------
>   _2 = c_8(D) * d_9(D);
>   _5 = .FMA (a_6(D), b_7(D), _2);
>   _3 = .FMA (e_10(D), f_11(D), j_12(D));
>   _13 = _3 + _5;
> --------------------------------------------------------
> New patch: If just rearrange ops and fall through to parallel rewrite to break the chain with width = 2.
>
> ---------------------------------------------------------
>   _1 = a_6(D) * b_7(D);
>   _2 = j + _1;          -----> put j at the first.
>   _3 = c_8(D) * d_9(D);
>   _4 = e_10(D) * f_11(D);
>   _5 = _3 + _4;       -----> break chain with width = 2. we lost a FMA here.
>   _13 = _2 + 5;
>
> -------------------------------------------------------
>   _3 = c_8(D) * d_9(D);
>   _2 = .FMA (a_6(D), b_7(D), j);
>   _5 = .FMA (e_10(D), f_11(D), _3);
>   _13 = _2 + _5;
> --------------------------------------------------------
> Sometimes break chain will lose FMA( break chain needs put two mult-ops together, which will lose one FMA ), we can only get 2 FMAs here, if we want to get 3 FMAs, we need to keep the chain and not break it. So I added a param to control chain length "param_reassoc_max_chain_length_with_fma = 4" (For the small case in Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> -------------------------------------------------------
>   _1 = a_6(D) * b_7(D);
>   _2 = c_8(D) * d_9(D);
>   _4 = e_10(D) * f_11(D);
>   _15 = _4 + j_12(D);
>   _16 = _15 + _2;
>   _13 = _16 + _1;
> -------------------------------------------------------
>   _15 = .FMA (e_10(D), f_11(D), j_12(D));
>   _16 = .FMA (c_8(D), d_9(D), _15);
>   _13 = .FMA (a_6(D), b_7(D), _16);
> -------------------------------------------------------
> In some case we want to break the chain with width, we can set "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and break the chain with width.

... it sounds like the problem could be fully addressed by sorting the
chain with reassoc-width in mind?
Wouldn't it be preferable if rewrite_expr_tree_parallel would get a
vector of mul and a vector of
non-mul ops so it can pick from the optimal candidate?

That said, I think rewrite_expr_tree_parallel_for_fma at least needs
more comments.

> >
> > >                       && (width = get_reassociation_width (ops_num, rhs_code,
> > > -                                                          mode)) > 1)
> > > +                                                           mode)) >
> > > + 1)
> > >                     {
> > > -                     if (dump_file && (dump_flags & TDF_DETAILS))
> > > -                       fprintf (dump_file,
> > > -                                "Width = %d was chosen for reassociation\n",
> > > -                                width);
> > > -                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > > -                                                 width, ops);
> > > +                     if (keep_fma_chain)
> > > +                       {
> > > +                         if (dump_file && (dump_flags & TDF_DETAILS))
> > > +                           fprintf (dump_file,
> > > +                                    "Break chain len = %d into width for FMA\n",
> > > +                                    len);
> > > +                         rewrite_expr_tree_parallel_for_fma
> > > +                           (as_a <gassign *> (stmt), width, ops);
> > > +                       }
> > > +                     else
> > > +                       {
> > > +                         if (dump_file && (dump_flags & TDF_DETAILS))
> > > +                           fprintf (dump_file,
> > > +                                    "Width = %d was chosen for reassociation\n",
> > > +                                    width);
> > > +                         rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > > +                                                     width, ops);
> > > +                       }
> > >                     }
> > >                   else
> > > -                    {
> > > -                      /* When there are three operands left, we want
> > > -                         to make sure the ones that get the double
> > > -                         binary op are chosen wisely.  */
> > > -                      int len = ops.length ();
> > > -                      if (len >= 3)
> > > +                   {
> > > +                     /* When there are three operands left, we want
> > > +                        to make sure the ones that get the double
> > > +                        binary op are chosen wisely.  */
> > > +                     if (len >= 3 && !keep_fma_chain)
> > >                         swap_ops_for_binary_stmt (ops, len - 3);
> > >
> > >                       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> > >                                                    powi_result != NULL
> > >                                                    || negate_result,
> > >                                                    len != orig_len);
> > > -                    }
> > > -
> > > +                   }
> > >                   /* If we combined some repeated factors into a
> > >                      __builtin_powi call, multiply that result by the
> > >                      reassociated operands.  */
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-12  6:04     ` Richard Biener
@ 2023-05-12  9:04       ` Cui, Lili
  2023-05-15 13:12         ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Cui, Lili @ 2023-05-12  9:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

> ISTR there were no sufficient comments in the code explaining why
> rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...
> 
> >
> > >
> > > >                   if (!reassoc_insert_powi_p
> > > > -                     && ops.length () > 3
> > > > +                     && len > 3
> > > > +                     && (!keep_fma_chain
> > > > +                         || (keep_fma_chain
> > > > +                             && len >
> > > > + param_reassoc_max_chain_length_with_fma))
> > >
> > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > the chain re-sorted but fall through to non-parallel rewrite.  I
> > > wonder if we do not want to instead adjust the reassociation width?
> > > I'd say it depends on the number of mult cases in the chain (sth the re-
> sorting could have computed).
> > > Why do we have two completely independent --params here?  Can you
> > > give an example --param value combination that makes "sense" and
> > > show how it is beneficial?
> >
> > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > * f  + j
> >
> > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp
> cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> > ----------------------------------
> >   _1 = a_6(D) * b_7(D);
> >   _2 = c_8(D) * d_9(D);
> >   _5 = _1 + _2;
> >   _4 = e_10(D) * f_11(D);
> >   _3 = _4 + j_12(D);
> >   _13 = _3 + _5;
> > --------------------------------------------------------
> >   _2 = c_8(D) * d_9(D);
> >   _5 = .FMA (a_6(D), b_7(D), _2);
> >   _3 = .FMA (e_10(D), f_11(D), j_12(D));
> >   _13 = _3 + _5;
> > --------------------------------------------------------
> > New patch: If just rearrange ops and fall through to parallel rewrite to
> break the chain with width = 2.
> >
> > ---------------------------------------------------------
> >   _1 = a_6(D) * b_7(D);
> >   _2 = j + _1;          -----> put j at the first.
> >   _3 = c_8(D) * d_9(D);
> >   _4 = e_10(D) * f_11(D);
> >   _5 = _3 + _4;       -----> break chain with width = 2. we lost a FMA here.
> >   _13 = _2 + 5;
> >
> > -------------------------------------------------------
> >   _3 = c_8(D) * d_9(D);
> >   _2 = .FMA (a_6(D), b_7(D), j);
> >   _5 = .FMA (e_10(D), f_11(D), _3);
> >   _13 = _2 + _5;
> > --------------------------------------------------------
> > Sometimes break chain will lose FMA( break chain needs put two
> > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > here, if we want to get 3 FMAs, we need to keep the chain and not
> > break it. So I added a param to control chain length
> > "param_reassoc_max_chain_length_with_fma = 4" (For the small case in
> > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > -------------------------------------------------------
> >   _1 = a_6(D) * b_7(D);
> >   _2 = c_8(D) * d_9(D);
> >   _4 = e_10(D) * f_11(D);
> >   _15 = _4 + j_12(D);
> >   _16 = _15 + _2;
> >   _13 = _16 + _1;
> > -------------------------------------------------------
> >   _15 = .FMA (e_10(D), f_11(D), j_12(D));
> >   _16 = .FMA (c_8(D), d_9(D), _15);
> >   _13 = .FMA (a_6(D), b_7(D), _16);
> > -------------------------------------------------------
> > In some case we want to break the chain with width, we can set
> "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and
> break the chain with width.
> 
> ... it sounds like the problem could be fully addressed by sorting the chain
> with reassoc-width in mind?
> Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of
> mul and a vector of non-mul ops so it can pick from the optimal candidate?
> 
> That said, I think rewrite_expr_tree_parallel_for_fma at least needs more
> comments.
> 
Sorry for not writing note clearly enough, I'll add more. 
I have two places that need to be clarified.

1. For some case we need to keep chain to generate more FMAs, because break chain will lose FMA.
   for example  g + a * b + c * d + e * f,
   Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to say which one is better, so we provide a param for users to customize.
   
2. when the chain has FMAs and need to break the chain with width,
for example l + a * b + c * d + e * f + g * h + j * k;(we already put non-mul first)
rewrite_expr_tree_parallel :
when width = 2, it will break the chain like this. actually it break the chain in to 3. It ignores the width and adds all ops two by two. it will lose FMA.  

ssa1 = l + a * b;
ssa2 = c * d + e * f;
ssa3 = g * h + j * k;
ssa4 = ssa1 + ssa2;
ssa5 = ssa4 + ssa3;

rewrite_expr_tree_parallel_for_fma
when width = 2, we break the chain into two like this.

ssa1 = l + a * b; 
ssa2 = c * d + e * f;
ssa3 = ssa1 + g * h;
ssa4 = ssa2 + j * k;
ssa5 = ssa3 +ssa4;

I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. More FMAs are generated only for some special cases.
I'm not sure whether the new method is better than the old one. I created a small case the execution time of the two sequences is almost the same on SPR and ICX.

Thanks,
Lili.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-12  9:04       ` Cui, Lili
@ 2023-05-15 13:12         ` Richard Biener
  2023-05-17 13:05           ` Cui, Lili
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-05-15 13:12 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches

On Fri, May 12, 2023 at 11:05 AM Cui, Lili <lili.cui@intel.com> wrote:
>
> > ISTR there were no sufficient comments in the code explaining why
> > rewrite_expr_tree_parallel_for_fma is better by design.  In fact ...
> >
> > >
> > > >
> > > > >                   if (!reassoc_insert_powi_p
> > > > > -                     && ops.length () > 3
> > > > > +                     && len > 3
> > > > > +                     && (!keep_fma_chain
> > > > > +                         || (keep_fma_chain
> > > > > +                             && len >
> > > > > + param_reassoc_max_chain_length_with_fma))
> > > >
> > > > in the case len < param_reassoc_max_chain_length_with_fma we have
> > > > the chain re-sorted but fall through to non-parallel rewrite.  I
> > > > wonder if we do not want to instead adjust the reassociation width?
> > > > I'd say it depends on the number of mult cases in the chain (sth the re-
> > sorting could have computed).
> > > > Why do we have two completely independent --params here?  Can you
> > > > give an example --param value combination that makes "sense" and
> > > > show how it is beneficial?
> > >
> > > For this small case https://godbolt.org/z/Pxczrre8P a * b + c * d + e
> > > * f  + j
> > >
> > > GCC trunk: ops_num = 4, targetm.sched.reassociation_width is 4 (scalar fp
> > cost is 4). Calculated: Width = 2. we can get 2 FMAs.
> > > ----------------------------------
> > >   _1 = a_6(D) * b_7(D);
> > >   _2 = c_8(D) * d_9(D);
> > >   _5 = _1 + _2;
> > >   _4 = e_10(D) * f_11(D);
> > >   _3 = _4 + j_12(D);
> > >   _13 = _3 + _5;
> > > --------------------------------------------------------
> > >   _2 = c_8(D) * d_9(D);
> > >   _5 = .FMA (a_6(D), b_7(D), _2);
> > >   _3 = .FMA (e_10(D), f_11(D), j_12(D));
> > >   _13 = _3 + _5;
> > > --------------------------------------------------------
> > > New patch: If just rearrange ops and fall through to parallel rewrite to
> > break the chain with width = 2.
> > >
> > > ---------------------------------------------------------
> > >   _1 = a_6(D) * b_7(D);
> > >   _2 = j + _1;          -----> put j at the first.
> > >   _3 = c_8(D) * d_9(D);
> > >   _4 = e_10(D) * f_11(D);
> > >   _5 = _3 + _4;       -----> break chain with width = 2. we lost a FMA here.
> > >   _13 = _2 + 5;
> > >
> > > -------------------------------------------------------
> > >   _3 = c_8(D) * d_9(D);
> > >   _2 = .FMA (a_6(D), b_7(D), j);
> > >   _5 = .FMA (e_10(D), f_11(D), _3);
> > >   _13 = _2 + _5;
> > > --------------------------------------------------------
> > > Sometimes break chain will lose FMA( break chain needs put two
> > > mult-ops together, which will lose one FMA ), we can only get 2 FMAs
> > > here, if we want to get 3 FMAs, we need to keep the chain and not
> > > break it. So I added a param to control chain length
> > > "param_reassoc_max_chain_length_with_fma = 4" (For the small case in
> > > Bugzilla 98350, we need to keep the chain to generate 6 FMAs.)
> > > -------------------------------------------------------
> > >   _1 = a_6(D) * b_7(D);
> > >   _2 = c_8(D) * d_9(D);
> > >   _4 = e_10(D) * f_11(D);
> > >   _15 = _4 + j_12(D);
> > >   _16 = _15 + _2;
> > >   _13 = _16 + _1;
> > > -------------------------------------------------------
> > >   _15 = .FMA (e_10(D), f_11(D), j_12(D));
> > >   _16 = .FMA (c_8(D), d_9(D), _15);
> > >   _13 = .FMA (a_6(D), b_7(D), _16);
> > > -------------------------------------------------------
> > > In some case we want to break the chain with width, we can set
> > "param_reassoc_max_chain_length_with_fma = 2", it will rearrange ops and
> > break the chain with width.
> >
> > ... it sounds like the problem could be fully addressed by sorting the chain
> > with reassoc-width in mind?
> > Wouldn't it be preferable if rewrite_expr_tree_parallel would get a vector of
> > mul and a vector of non-mul ops so it can pick from the optimal candidate?
> >
> > That said, I think rewrite_expr_tree_parallel_for_fma at least needs more
> > comments.
> >
> Sorry for not writing note clearly enough, I'll add more.
> I have two places that need to be clarified.
>
> 1. For some case we need to keep chain to generate more FMAs, because break chain will lose FMA.
>    for example  g + a * b + c * d + e * f,
>    Keep chain can get 3 FMAs, break chain can get 2 FMAs. It's hard to say which one is better, so we provide a param for users to customize.
>
> 2. when the chain has FMAs and need to break the chain with width,
> for example l + a * b + c * d + e * f + g * h + j * k;(we already put non-mul first)
> rewrite_expr_tree_parallel :
> when width = 2, it will break the chain like this. actually it break the chain in to 3. It ignores the width and adds all ops two by two. it will lose FMA.
>
> ssa1 = l + a * b;
> ssa2 = c * d + e * f;
> ssa3 = g * h + j * k;
> ssa4 = ssa1 + ssa2;
> ssa5 = ssa4 + ssa3;
>
> rewrite_expr_tree_parallel_for_fma
> when width = 2, we break the chain into two like this.
>
> ssa1 = l + a * b;
> ssa2 = c * d + e * f;
> ssa3 = ssa1 + g * h;
> ssa4 = ssa2 + j * k;
> ssa5 = ssa3 +ssa4;
>
> I think it's okay to remove or keep rewrite_expr_tree_parallel_for_fma. More FMAs are generated only for some special cases.
> I'm not sure whether the new method is better than the old one. I created a small case the execution time of the two sequences is almost the same on SPR and ICX.

I think to make a difference you need to hit the number of parallel
fadd/fmul the pipeline can perform.  I don't
think issue width is ever a problem for chains w/o fma and throughput
of fma vs fadd + fmul should be similar.

That said, I think iff then we should try to improve
rewrite_expr_tree_parallel rather than adding a new
function.  For example for the case with equal rank operands we can
try to sort adds first.  I can't convince
myself that rewrite_expr_tree_parallel honors ranks properly quickly.

Richard.

> Thanks,
> Lili.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-15 13:12         ` Richard Biener
@ 2023-05-17 13:05           ` Cui, Lili
  2023-05-22 13:15             ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Cui, Lili @ 2023-05-17 13:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

> I think to make a difference you need to hit the number of parallel fadd/fmul
> the pipeline can perform.  I don't think issue width is ever a problem for
> chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> 

Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width. 
The current implementation is reasonable  /* reassoc int, fp, vec_int, vec_fp.  */.

> That said, I think iff then we should try to improve
> rewrite_expr_tree_parallel rather than adding a new function.  For example
> for the case with equal rank operands we can try to sort adds first.  I can't
> convince myself that rewrite_expr_tree_parallel honors ranks properly
> quickly.
> 

I rewrite this patch, there are mainly two changes:
1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better.
2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain.
  
With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350  without any params.

------------------------------------------------------------------------------------------------------------------
Source code： g + h + j + s + m + n+a+b +e  (https://godbolt.org/z/G8sb86n84)
Compile options: -Ofast -mfpmath=sse -mfma
Width = 3 was chosen for reassociation
-----------------------------------------------------------------------------------------------------------------
Old rewrite_expr_tree_parallel generates:
  _6 = g_8(D) + h_9(D);       ------> parallel 0
  _3 = s_11(D) + m_12(D);  ------> parallel 1
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   ------> parallel 2
  _1 = b_15(D) + e_16(D);  -----> Parallel 3, This is not necessary, and it is not friendly to FMA.
  _4 = _1 + _2;        
  _7 = _4 + _5;        
  _17 = _6 + _7;      
  return _17;

When the width = 3,  we need 5 cycles here.
---------------------------------------------first end---------------------------------------------------------
Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:

  _3 = s_11(D) + m_12(D);  ------> parallel 0
  _5 = _3 + j_10(D);
  _2 = n_13(D) + a_14(D);   ------> parallel 1
  _1 = b_15(D) + e_16(D);   ------> parallel 2
  _4 = _1 + _2;
  _6 = _4 + _5;
  _7 = _6 + h_9(D);
  _17 = _7 + g_8(D); 
  return _17;

When the width = 3, we need 5 cycles here.
---------------------------------------------second end-------------------------------------------------------
Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates:

  _3 = s_11(D) + m_12(D);
  _6 = _3 + g_8(D);
  _2 = n_13(D) + a_14(D);
  _5 = _2 + h_9(D);
  _1 = b_15(D) + e_16(D);
  _4 = _1 + j_10(D);
  _7 = _4 + _5;
  _17 = _7 + _6;
  return _17;

When the width = 3, we need 4 cycles here.
--------------------------------------------third end-----------------------------------------------------------

Thanks,
Lili.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass
  2023-05-17 13:05           ` Cui, Lili
@ 2023-05-22 13:15             ` Richard Biener
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Biener @ 2023-05-22 13:15 UTC (permalink / raw)
  To: Cui, Lili; +Cc: gcc-patches

On Wed, May 17, 2023 at 3:05 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> > I think to make a difference you need to hit the number of parallel fadd/fmul
> > the pipeline can perform.  I don't think issue width is ever a problem for
> > chains w/o fma and throughput of fma vs fadd + fmul should be similar.
> >
>
> Yes, for x86 backend, fadd , fmul and fma have the same TP meaning they should have the same width.
> The current implementation is reasonable  /* reassoc int, fp, vec_int, vec_fp.  */.
>
> > That said, I think iff then we should try to improve
> > rewrite_expr_tree_parallel rather than adding a new function.  For example
> > for the case with equal rank operands we can try to sort adds first.  I can't
> > convince myself that rewrite_expr_tree_parallel honors ranks properly
> > quickly.
> >
>
> I rewrite this patch, there are mainly two changes:
> 1. I made some changes to rewrite_expr_tree_parallel_for_fma and used it instead of rewrite_expr_tree_parallel. The following example shows that the sequence generated by the this patch is better.
> 2. Put no-mult ops and mult ops alternately at the end of the queue, which is conducive to generating more fma and reducing the loss of FMA when breaking the chain.
>
> With these two changes, GCC can break the chain with width = 2 and generates 6 FMAs for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98350  without any params.
>
> ------------------------------------------------------------------------------------------------------------------
> Source code： g + h + j + s + m + n+a+b +e  (https://godbolt.org/z/G8sb86n84)
> Compile options: -Ofast -mfpmath=sse -mfma
> Width = 3 was chosen for reassociation
> -----------------------------------------------------------------------------------------------------------------
> Old rewrite_expr_tree_parallel generates:
>   _6 = g_8(D) + h_9(D);       ------> parallel 0
>   _3 = s_11(D) + m_12(D);  ------> parallel 1
>   _5 = _3 + j_10(D);
>   _2 = n_13(D) + a_14(D);   ------> parallel 2
>   _1 = b_15(D) + e_16(D);  -----> Parallel 3, This is not necessary, and it is not friendly to FMA.
>   _4 = _1 + _2;
>   _7 = _4 + _5;
>   _17 = _6 + _7;
>   return _17;
>
> When the width = 3,  we need 5 cycles here.
> ---------------------------------------------first end---------------------------------------------------------
> Rewrite the old rewrite_expr_tree_parallel (3 sets in parallel) generates:
>
>   _3 = s_11(D) + m_12(D);  ------> parallel 0
>   _5 = _3 + j_10(D);
>   _2 = n_13(D) + a_14(D);   ------> parallel 1
>   _1 = b_15(D) + e_16(D);   ------> parallel 2
>   _4 = _1 + _2;
>   _6 = _4 + _5;
>   _7 = _6 + h_9(D);
>   _17 = _7 + g_8(D);
>   return _17;
>
> When the width = 3, we need 5 cycles here.
> ---------------------------------------------second end-------------------------------------------------------
> Use rewrite_expr_tree_parallel_for_fma instead of rewrite_expr_tree_parallel generates:
>
>   _3 = s_11(D) + m_12(D);
>   _6 = _3 + g_8(D);
>   _2 = n_13(D) + a_14(D);
>   _5 = _2 + h_9(D);
>   _1 = b_15(D) + e_16(D);
>   _4 = _1 + j_10(D);
>   _7 = _4 + _5;
>   _17 = _7 + _6;
>   return _17;
>
> When the width = 3, we need 4 cycles here.
> --------------------------------------------third end-----------------------------------------------------------

Yes, so what I was saying is that I doubt rewrite_expr_tree_parallel
is optimal - you show
that for the specific example rewrite_expr_tree_parallel_for_fma is
better.  I was arguing
we want a single function, whether we single out leaves with
multiplications or not.

And we want documentation that shows the strategy will result in optimal latency
(I think we should not sacrifice latency just for the sake of forming
more FMAs).

Richard.

>
> Thanks,
> Lili.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-05-22 13:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-11 10:12 [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Cui, Lili
2023-05-11 10:12 ` [PATCH 2/2] Add a tune option to control the length of the chain with FMA Cui, Lili
2023-05-11 10:56   ` Richard Biener
2023-05-11 10:52 ` [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass Richard Biener
2023-05-11 15:18   ` Cui, Lili
2023-05-12  6:04     ` Richard Biener
2023-05-12  9:04       ` Cui, Lili
2023-05-15 13:12         ` Richard Biener
2023-05-17 13:05           ` Cui, Lili
2023-05-22 13:15             ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).