public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
@ 2023-06-16  8:50 Di Zhao OS
  2023-07-07  8:26 ` [PING][PATCH] " Di Zhao OS
  0 siblings, 1 reply; 3+ messages in thread
From: Di Zhao OS @ 2023-06-16  8:50 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

This patch is to fix the regressions found in SPEC2017 fprate cases
 on aarch64.

1. Reused code in pass widening_mul to check for nested FMA chains
 (those connected by MULT_EXPRs), since re-writing to parallel
 generates worse codes.

2. Avoid re-arrange to produce less FMA chains that can be slow.

Tested on ampere1 and neoverse-n1, this fixed the regressions in
508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
on x86 machines we have, I'd like to know what do you think of this.

(Previously I tried to improve things with FMA by adding a widening_mul
pass before reassoc2 for it's easier to recognize different patterns
of FMA chains and decide whether to split them. But I suppose handling
them all in reassoc pass is more efficient.)

Thanks,
Di Zhao

---
gcc/ChangeLog:

        * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
        Support new mode that merely do the checking. 
        (struct fma_transformation_info): Moved to header.
        (class fma_deferring_state): Moved to header.
        (convert_mult_to_fma): Add new parameter.
        * tree-ssa-math-opts.h (struct fma_transformation_info):
        (class fma_deferring_state): Moved from .cc.
        (convert_mult_to_fma): Add function decl.
        * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
        (rank_ops_for_fma): Return -1 if nested FMAs are found.
        (reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.


[-- Attachment #2: pr110279-Check-for-nested-FMA-chains-in-reassoc.diff --]
[-- Type: application/octet-stream, Size: 12037 bytes --]

diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 9c9ca571bcd..98b7b70e14c 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3067,7 +3067,8 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
    multiplication, converted to FMAs, perform the transformation.  */
 
 static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+		       tree *return_lhs = NULL)
 {
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt;
@@ -3091,8 +3092,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  use_operand_p use_p;
 	  gimple *neguse_stmt;
 	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  if (!return_lhs)
+	    {
+	      gsi_remove (&gsi, true);
+	      release_defs (use_stmt);
+	    }
 
 	  use_stmt = neguse_stmt;
 	  gsi = gsi_for_stmt (use_stmt);
@@ -3106,6 +3110,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	gcc_unreachable ();
       addop = ops[0] == result ? ops[1] : ops[0];
 
+      if (return_lhs)
+	{
+	  *return_lhs = gimple_get_lhs (use_stmt);
+	  return;
+	}
+
       if (code == MINUS_EXPR)
 	{
 	  if (ops[0] == result)
@@ -3181,54 +3191,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
     }
 }
 
-/* Data necessary to perform the actual transformation from a multiplication
-   and an addition to an FMA after decision is taken it should be done and to
-   then delete the multiplication statement from the function IL.  */
-
-struct fma_transformation_info
-{
-  gimple *mul_stmt;
-  tree mul_result;
-  tree op1;
-  tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
-   deferring, whether to continue deferring, and all data necessary to come
-   back and perform all deferred transformations.  */
-
-class fma_deferring_state
-{
-public:
-  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
-     do any deferring.  */
-
-  fma_deferring_state (bool perform_deferring)
-    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
-      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
-  /* List of FMA candidates for which we the transformation has been determined
-     possible but we at this point in BB analysis we do not consider them
-     beneficial.  */
-  auto_vec<fma_transformation_info, 8> m_candidates;
-
-  /* Set of results of multiplication that are part of an already deferred FMA
-     candidates.  */
-  hash_set<tree> m_mul_result_set;
-
-  /* The PHI that supposedly feeds back result of a FMA to another over loop
-     boundary.  */
-  gphi *m_initial_phi;
-
-  /* Result of the last produced FMA candidate or NULL if there has not been
-     one.  */
-  tree m_last_result;
-
-  /* If true, deferring might still be profitable.  If false, transform all
-     candidates and no longer defer.  */
-  bool m_deferring_p;
-};
-
 /* Transform all deferred FMA candidates and mark STATE as no longer
    deferring.  */
 
@@ -3303,12 +3265,18 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
   or its unrolled version, i.e. with several FMA candidates that feed result
   of one into the addend of another.  Instead, we add them to a list in STATE
   and if we later discover an FMA candidate that is not part of such a chain,
-  we go back and perform all deferred past candidates.  */
+  we go back and perform all deferred past candidates.
 
-static bool
+  If RETURN_LHS is not NULL, instead of the actual conversion, set RETURN_LHS
+  and return true.  */
+
+bool
 convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
-		     fma_deferring_state *state, tree mul_cond = NULL_TREE)
+		     fma_deferring_state *state, tree mul_cond,
+		     tree *return_lhs)
 {
+  if (return_lhs)
+    *return_lhs = NULL_TREE;
   tree mul_result = gimple_get_lhs (mul_stmt);
   /* If there isn't a LHS then this can't be an FMA.  There can be no LHS
      if the statement was left just for the side-effects.  */
@@ -3550,7 +3518,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
     {
       if (state->m_deferring_p)
 	cancel_fma_deferring (state);
-      convert_mult_to_fma_1 (mul_result, op1, op2);
+      convert_mult_to_fma_1 (mul_result, op1, op2, return_lhs);
       return true;
     }
 }
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..252597ba428 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,57 @@ along with GCC; see the file COPYING3.  If not see
 extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
 			   tree, HOST_WIDE_INT);
 
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
+		     tree *return_lhs = NULL);
+
 #endif  /* GCC_TREE_SSA_MATH_OPTS_H  */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index 96c88ec003e..dd78f41ae7b 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5488,7 +5488,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
    as possible.  */
 static void
 rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
-					 const vec<operand_entry *> &ops)
+			    const vec<operand_entry *> &ops)
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
   int op_num = ops.length ();
@@ -5517,7 +5517,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
   /* Build parallel dependency chain according to width.  */
   for (i = 0; i < width; i++)
     {
-      /* If the chain has FAM, we do not swap two operands.  */
+      /* If the chain has FMA, we do not swap two operands.  */
       if (op_index > 1 && !has_fma)
 	swap_ops_for_binary_stmt (ops, op_index - 2);
 
@@ -6678,8 +6678,8 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static int
+rank_ops_for_fma (vec<operand_entry *> *ops, tree type)
 {
   operand_entry *oe;
   unsigned int i;
@@ -6701,6 +6701,59 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       else
 	ops_others.safe_push (oe);
     }
+
+  /* Check the lhs of MULT_EXPRs, if there's a FMA chain with pattern:
+
+	d = .FMA (a, b, c);
+	g = .FMA (d, e, f);
+	...
+
+     then return -1, to avoid rewrite_expr_tree_parallel.  */
+  fma_deferring_state fma_state (false);
+  FOR_EACH_VEC_ELT (ops_mult, i, oe)
+    {
+      gimple *mul_stmt = SSA_NAME_DEF_STMT (oe->op);
+      tree lhs, lhs2;
+      if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+			       gimple_assign_rhs2 (mul_stmt), &fma_state, NULL,
+			       &lhs))
+	{
+	  /* Check for MULT_EXPR that uses LHS.  */
+	  imm_use_iterator imm_iter;
+	  gimple *use_stmt;
+	  use_operand_p use_p;
+	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+	    {
+	      use_stmt = USE_STMT (use_p);
+	      if (is_gimple_debug (use_stmt))
+		continue;
+	      if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+		continue;
+
+	      if (is_gimple_assign (use_stmt)
+		  && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+		  && convert_mult_to_fma (use_stmt,
+					  gimple_assign_rhs1 (use_stmt),
+					  gimple_assign_rhs2 (use_stmt),
+					  &fma_state, NULL, &lhs2)
+		  && gimple_bb (SSA_NAME_DEF_STMT (lhs2))
+		       == gimple_bb (mul_stmt))
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file,
+			       "Found possible non-flat FMA chain. LHS1: ");
+		      print_generic_expr (dump_file, lhs);
+		      fprintf (dump_file, ", LHS2: ");
+		      print_generic_expr (dump_file, lhs2);
+		      fprintf (dump_file, "\n");
+		    }
+		  return -1;
+		}
+	    }
+	}
+    }
+
   /* 1. When ops_mult.length == 2, like the following case,
 
      a * b + c * d + e.
@@ -6712,6 +6765,11 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      2. If all ops are defined with mult, we don't need to rearrange them.  */
   if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
     {
+      if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+		    param_avoid_fma_max_bits))
+	/* Avoid re-arrange to produce less FMA chains that can be slow.  */
+	return 1;
+
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
 	 loss of FMA when breaking the chain.  */
@@ -6726,9 +6784,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
+      return 1;
     }
-  return false;
+  return 0;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6894,7 +6952,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  int has_fma = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -6917,7 +6975,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      has_fma = rank_ops_for_fma (&ops, TREE_TYPE (lhs));
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -6925,6 +6983,7 @@ reassociate_bb (basic_block bb)
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
+		      && has_fma >= 0
 		      && (width = get_reassociation_width (ops_num, rhs_code,
 							   mode)) > 1)
 		    {
@@ -6934,7 +6993,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  has_fma > 0,
 						  ops);
 		    }
 		  else
@@ -6943,7 +7002,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && has_fma == 0)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
  2023-06-16  8:50 [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc Di Zhao OS
@ 2023-07-07  8:26 ` Di Zhao OS
  2023-07-07 14:31   ` Philipp Tomsich
  0 siblings, 1 reply; 3+ messages in thread
From: Di Zhao OS @ 2023-07-07  8:26 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]

Update the patch so it can apply.

Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast -flto",
the improvements of 1-copy run are:

Ampere1:
    508.namd_r	4.26% 
    510.parest_r	2.55%
    Overall		0.54%
Intel Xeon:
    503.bwaves_r	1.3%	
    508.namd_r	1.58%
    overall		0.42%


Thanks,
Di Zhao


> -----Original Message-----
> From: Di Zhao OS
> Sent: Friday, June 16, 2023 4:51 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in
> reassoc
> 
> This patch is to fix the regressions found in SPEC2017 fprate cases
>  on aarch64.
> 
> 1. Reused code in pass widening_mul to check for nested FMA chains
>  (those connected by MULT_EXPRs), since re-writing to parallel
>  generates worse codes.
> 
> 2. Avoid re-arrange to produce less FMA chains that can be slow.
> 
> Tested on ampere1 and neoverse-n1, this fixed the regressions in
> 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
> on x86 machines we have, I'd like to know what do you think of this.
> 
> (Previously I tried to improve things with FMA by adding a widening_mul
> pass before reassoc2 for it's easier to recognize different patterns
> of FMA chains and decide whether to split them. But I suppose handling
> them all in reassoc pass is more efficient.)
> 
> Thanks,
> Di Zhao
> 
> ---
> gcc/ChangeLog:
> 
>         * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
>         Support new mode that merely do the checking.
>         (struct fma_transformation_info): Moved to header.
>         (class fma_deferring_state): Moved to header.
>         (convert_mult_to_fma): Add new parameter.
>         * tree-ssa-math-opts.h (struct fma_transformation_info):
>         (class fma_deferring_state): Moved from .cc.
>         (convert_mult_to_fma): Add function decl.
>         * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
>         (rank_ops_for_fma): Return -1 if nested FMAs are found.
>         (reassociate_bb): Avoid rewriting to parallel if nested FMAs are
> found.


[-- Attachment #2: 0001-Check-for-nested-FMA-chains-in-reassoc.patch --]
[-- Type: application/octet-stream, Size: 11274 bytes --]

diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 68fc518b1ab..19e2bef9307 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3067,7 +3067,8 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
    multiplication, converted to FMAs, perform the transformation.  */
 
 static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+		       tree *return_lhs = NULL)
 {
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt;
@@ -3091,8 +3092,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  use_operand_p use_p;
 	  gimple *neguse_stmt;
 	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  if (!return_lhs)
+	    {
+	      gsi_remove (&gsi, true);
+	      release_defs (use_stmt);
+	    }
 
 	  use_stmt = neguse_stmt;
 	  gsi = gsi_for_stmt (use_stmt);
@@ -3106,6 +3110,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	gcc_unreachable ();
       addop = ops[0] == result ? ops[1] : ops[0];
 
+      if (return_lhs)
+	{
+	  *return_lhs = gimple_get_lhs (use_stmt);
+	  return;
+	}
+
       if (code == MINUS_EXPR)
 	{
 	  if (ops[0] == result)
@@ -3181,54 +3191,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
     }
 }
 
-/* Data necessary to perform the actual transformation from a multiplication
-   and an addition to an FMA after decision is taken it should be done and to
-   then delete the multiplication statement from the function IL.  */
-
-struct fma_transformation_info
-{
-  gimple *mul_stmt;
-  tree mul_result;
-  tree op1;
-  tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
-   deferring, whether to continue deferring, and all data necessary to come
-   back and perform all deferred transformations.  */
-
-class fma_deferring_state
-{
-public:
-  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
-     do any deferring.  */
-
-  fma_deferring_state (bool perform_deferring)
-    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
-      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
-  /* List of FMA candidates for which we the transformation has been determined
-     possible but we at this point in BB analysis we do not consider them
-     beneficial.  */
-  auto_vec<fma_transformation_info, 8> m_candidates;
-
-  /* Set of results of multiplication that are part of an already deferred FMA
-     candidates.  */
-  hash_set<tree> m_mul_result_set;
-
-  /* The PHI that supposedly feeds back result of a FMA to another over loop
-     boundary.  */
-  gphi *m_initial_phi;
-
-  /* Result of the last produced FMA candidate or NULL if there has not been
-     one.  */
-  tree m_last_result;
-
-  /* If true, deferring might still be profitable.  If false, transform all
-     candidates and no longer defer.  */
-  bool m_deferring_p;
-};
-
 /* Transform all deferred FMA candidates and mark STATE as no longer
    deferring.  */
 
@@ -3303,12 +3265,18 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
   or its unrolled version, i.e. with several FMA candidates that feed result
   of one into the addend of another.  Instead, we add them to a list in STATE
   and if we later discover an FMA candidate that is not part of such a chain,
-  we go back and perform all deferred past candidates.  */
+  we go back and perform all deferred past candidates.
 
-static bool
+  If RETURN_LHS is not NULL, instead of the actual conversion, set RETURN_LHS
+  and return true.  */
+
+bool
 convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
-		     fma_deferring_state *state, tree mul_cond = NULL_TREE)
+		     fma_deferring_state *state, tree mul_cond,
+		     tree *return_lhs)
 {
+  if (return_lhs)
+    *return_lhs = NULL_TREE;
   tree mul_result = gimple_get_lhs (mul_stmt);
   /* If there isn't a LHS then this can't be an FMA.  There can be no LHS
      if the statement was left just for the side-effects.  */
@@ -3550,7 +3518,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
     {
       if (state->m_deferring_p)
 	cancel_fma_deferring (state);
-      convert_mult_to_fma_1 (mul_result, op1, op2);
+      convert_mult_to_fma_1 (mul_result, op1, op2, return_lhs);
       return true;
     }
 }
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..252597ba428 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,57 @@ along with GCC; see the file COPYING3.  If not see
 extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
 			   tree, HOST_WIDE_INT);
 
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
+		     tree *return_lhs = NULL);
+
 #endif  /* GCC_TREE_SSA_MATH_OPTS_H  */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..73f5c93595c 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -6781,8 +6781,8 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static int
+rank_ops_for_fma (vec<operand_entry *> *ops, tree type)
 {
   operand_entry *oe;
   unsigned int i;
@@ -6804,6 +6804,59 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
       else
 	ops_others.safe_push (oe);
     }
+
+  /* Check the lhs of MULT_EXPRs, if there's a FMA chain with pattern:
+
+	d = .FMA (a, b, c);
+	g = .FMA (d, e, f);
+	...
+
+     then return -1, to avoid rewrite_expr_tree_parallel.  */
+  fma_deferring_state fma_state (false);
+  FOR_EACH_VEC_ELT (ops_mult, i, oe)
+    {
+      gimple *mul_stmt = SSA_NAME_DEF_STMT (oe->op);
+      tree lhs, lhs2;
+      if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+			       gimple_assign_rhs2 (mul_stmt), &fma_state, NULL,
+			       &lhs))
+	{
+	  /* Check for MULT_EXPR that uses LHS.  */
+	  imm_use_iterator imm_iter;
+	  gimple *use_stmt;
+	  use_operand_p use_p;
+	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+	    {
+	      use_stmt = USE_STMT (use_p);
+	      if (is_gimple_debug (use_stmt))
+		continue;
+	      if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+		continue;
+
+	      if (is_gimple_assign (use_stmt)
+		  && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+		  && convert_mult_to_fma (use_stmt,
+					  gimple_assign_rhs1 (use_stmt),
+					  gimple_assign_rhs2 (use_stmt),
+					  &fma_state, NULL, &lhs2)
+		  && gimple_bb (SSA_NAME_DEF_STMT (lhs2))
+		       == gimple_bb (mul_stmt))
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file,
+			       "Found possible non-flat FMA chain. LHS1: ");
+		      print_generic_expr (dump_file, lhs);
+		      fprintf (dump_file, ", LHS2: ");
+		      print_generic_expr (dump_file, lhs2);
+		      fprintf (dump_file, "\n");
+		    }
+		  return -1;
+		}
+	    }
+	}
+    }
+
   /* 1. When ops_mult.length == 2, like the following case,
 
      a * b + c * d + e.
@@ -6815,6 +6868,11 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      2. If all ops are defined with mult, we don't need to rearrange them.  */
   if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
     {
+      if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (type)),
+		    param_avoid_fma_max_bits))
+	/* Avoid re-arrange to produce less FMA chains that can be slow.  */
+	return 1;
+
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
 	 loss of FMA when breaking the chain.  */
@@ -6829,9 +6887,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
+      return 1;
     }
-  return false;
+  return 0;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6997,7 +7055,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  int has_fma = 0;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,7 +7078,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      has_fma = rank_ops_for_fma (&ops, TREE_TYPE (lhs));
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -7028,6 +7086,7 @@ reassociate_bb (basic_block bb)
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
+		      && has_fma >= 0
 		      && (width = get_reassociation_width (ops_num, rhs_code,
 							   mode)) > 1)
 		    {
@@ -7037,7 +7096,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  has_fma > 0,
 						  ops);
 		    }
 		  else
@@ -7046,7 +7105,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && has_fma == 0)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PING][PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc
  2023-07-07  8:26 ` [PING][PATCH] " Di Zhao OS
@ 2023-07-07 14:31   ` Philipp Tomsich
  0 siblings, 0 replies; 3+ messages in thread
From: Philipp Tomsich @ 2023-07-07 14:31 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Fri, 7 Jul 2023 at 10:28, Di Zhao OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Update the patch so it can apply.
>
> Tested on spec2017 fprate cases again. With option "-funroll-loops -Ofast -flto",
> the improvements of 1-copy run are:
>
> Ampere1:
>     508.namd_r  4.26%
>     510.parest_r        2.55%
>     Overall             0.54%
> Intel Xeon:
>     503.bwaves_r        1.3%
>     508.namd_r  1.58%
>     overall             0.42%

This looks like a worthwhile improvement.

From reviewing the patch, a few nit-picks:
- given that 'has_fma' can now take three values { -1, 0, 1 }, an enum
with more descriptive names for these 3 states should be used;
- using "has_fma >= 0" and "fma > 0" tests are hard to read; after
changing this to an enum, you can use macros or helper functions to
test the predicates (i.e., *_P macros or *_p helpers) for readability
- the meaning of the return values of rank_ops_for_fma should be
documented in the comment describing the function
- changing convert_mult_to_fma_1 to return a tree* (i.e., return_lhs
or NULL_TREE) removes the need for an in/out parameter

Thanks,
Philipp.

>
>
> Thanks,
> Di Zhao
>
>
> > -----Original Message-----
> > From: Di Zhao OS
> > Sent: Friday, June 16, 2023 4:51 PM
> > To: gcc-patches@gcc.gnu.org
> > Subject: [PATCH] tree-optimization/110279- Check for nested FMA chains in
> > reassoc
> >
> > This patch is to fix the regressions found in SPEC2017 fprate cases
> >  on aarch64.
> >
> > 1. Reused code in pass widening_mul to check for nested FMA chains
> >  (those connected by MULT_EXPRs), since re-writing to parallel
> >  generates worse codes.
> >
> > 2. Avoid re-arrange to produce less FMA chains that can be slow.
> >
> > Tested on ampere1 and neoverse-n1, this fixed the regressions in
> > 508.namd_r and 510.parest_r 1 copy run. While I'm still collecting data
> > on x86 machines we have, I'd like to know what do you think of this.
> >
> > (Previously I tried to improve things with FMA by adding a widening_mul
> > pass before reassoc2 for it's easier to recognize different patterns
> > of FMA chains and decide whether to split them. But I suppose handling
> > them all in reassoc pass is more efficient.)
> >
> > Thanks,
> > Di Zhao
> >
> > ---
> > gcc/ChangeLog:
> >
> >         * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Add new parameter.
> >         Support new mode that merely do the checking.
> >         (struct fma_transformation_info): Moved to header.
> >         (class fma_deferring_state): Moved to header.
> >         (convert_mult_to_fma): Add new parameter.
> >         * tree-ssa-math-opts.h (struct fma_transformation_info):
> >         (class fma_deferring_state): Moved from .cc.
> >         (convert_mult_to_fma): Add function decl.
> >         * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel):
> >         (rank_ops_for_fma): Return -1 if nested FMAs are found.
> >         (reassociate_bb): Avoid rewriting to parallel if nested FMAs are
> > found.
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-07 14:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-16  8:50 [PATCH] tree-optimization/110279- Check for nested FMA chains in reassoc Di Zhao OS
2023-07-07  8:26 ` [PING][PATCH] " Di Zhao OS
2023-07-07 14:31   ` Philipp Tomsich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).