[PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
@ 2023-08-09 16:52 Di Zhao OS
  2023-08-18 14:30 ` Di Zhao OS
  2023-09-01 11:38 ` Richard Biener
  0 siblings, 2 replies; 3+ messages in thread
From: Di Zhao OS @ 2023-08-09 16:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2766 bytes --]

Hi,

The previous version of this patch tries to solve two problems
at the same time. For better clarity, I'll separate them and 
only deal with the "nested" FMA in this version. I plan to
propose another patch in avoiding bad shaped FMA (deferring FMA).

Other changes:

1. Added new testcases for the "nested" FMA issue. For the
   following code:

	tmp1 = a + c * c + d * d + x * y;
	tmp2 = x * tmp1;
	result += (a + c + d + tmp2);

   , when "tmp1 = ..." is not rewritten, tmp1 will be result of
   an FMA, and there will be a list of consecutive FMAs: 

	_1 = .FMA (c, c, a_39);
	_2 = .FMA (d, d, _1);
	tmp1 = .FMA (x, y, _2);
	_3 = .FMA (tmp1, x, d);
	...
   
   If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
   of a PLUS_EXPR between FMAs:

	_1 = .FMA (c, c, a_39);
	_2 = x * y;
	_3 = .FMA (d, d, _2);
	 tmp1 = _3 + _1;
	 _4 = .FMA (tmp1, x, d);
	...

   It seems the register pressure of the latter is higher than
   the former. On the test machines we have (including Ampere1,
   Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
   parallel, the run time all increased significantly. In
   contrast, when "tmp1" is not the 1st or 2nd operand of another
   FMA (pr110279-1.c), rewriting it results in better performance.
   (I'll also append the testcases in the bug tracker.)

2. Enhanced checking for nested FMA by: 1) Modified
   convert_mult_to_fma so it can return multiple LHS.  2) Check
   NEGATE_EXPRs for nested FMA.

(I think maybe this can be further refined by enabling rewriting
to parallel for very long op list. )

Bootstrapped and regression tested on x86_64-linux-gnu.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
	new parameter collect_lhs.
        (struct fma_transformation_info): Moved to header.
        (class fma_deferring_state): Moved to header.
        (convert_mult_to_fma): Added new parameter collect_lhs.
        * tree-ssa-math-opts.h (struct fma_transformation_info):
        (class fma_deferring_state): Moved from .cc.
        (convert_mult_to_fma): Moved from .cc.
        * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
	describe the state of FMA candidates for a list of
	operands.
        (rewrite_expr_tree_parallel): Changed boolean parameter
	to enum type.
        (has_nested_fma_p): New function to check for nested FMA
	on given multiplication statement.
        (rank_ops_for_fma): Return enum fma_state.
        (reassociate_bb): Avoid rewriting to parallel if nested
	FMAs are found.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.

[-- Attachment #2: tree-optimization-110279-Check-for-nested-FMA-in-rea.patch --]
[-- Type: application/octet-stream, Size: 19047 bytes --]

From be0c6e1540236388450ed66479a00bc50448ba96 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Tue, 8 Aug 2023 20:20:34 +0800
Subject: [PATCH] tree-optimization/110279: Check for nested FMA in reassoc

Reused code in pass widening_mul to check for nested FMAs
(LHS of an expression tree is the 1st or 2nd operand of
another FMA), since re-writing it to parallel generates
worse codes.

With option "-funroll-loops -Ofast --param tree-reassoc-width=4",
the improvements in spec2017 508.namd_r 1-copy run are:

Ampere1:          4.5%
Intel Xeon:       1.8%
Neoverse-n1:      2.3%
---
 gcc/testsuite/gcc.dg/pr110279-1.c |  50 ++++++++++++
 gcc/testsuite/gcc.dg/pr110279-2.c |  49 ++++++++++++
 gcc/tree-ssa-math-opts.cc         |  85 +++++++-------------
 gcc/tree-ssa-math-opts.h          |  54 +++++++++++++
 gcc/tree-ssa-reassoc.cc           | 124 ++++++++++++++++++++++++++----
 5 files changed, 291 insertions(+), 71 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c

diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
new file mode 100644
index 00000000000..3cf9169f471
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -0,0 +1,50 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -fdump-tree-reassoc2-details" } */
+/* { dg-additional-options "-mfpmath=sse -mfma" { target { i?86-*-* x86_64-*-* } } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+/* No nested FMA, re-write both the two lists.
+
+Run time:
+		only rewrite "result"	rewrite both "result" and "tmp"
+Ampere1	 	5.87			4.53
+neoverse-n1	3.61			2.93
+Intel Xeon	4.28			3.33
+*/
+__attribute_noinline__ data_e
+foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
+{
+  data_e fastc = 0, tmp1;
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Length of op list: 4
+	 Number of MULT_EXPRs: 3 */
+      tmp1 = a + c * c - d * d + x * y;
+      result += b * c - x + y - 0.01 + tmp1;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= tmp1;
+      d += a;
+      x *= 0.1;
+      y *= y;
+    }
+
+  return result;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
+}
+
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 2 "reassoc2"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..fbdc6ad1cde
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,49 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -fdump-tree-reassoc2-details" } */
+/* { dg-additional-options "-mfpmath=sse -mfma" { target { i?86-*-* x86_64-*-* } } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+/*  Run time:
+		only rewrite "result"	rewrite both "result" and "tmp"
+Ampere1	 	1.93			2.10
+neoverse-n1	1.45			1.52
+Intel Xeon	1.55			1.62
+*/
+
+__attribute_noinline__ data_e
+foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
+{
+  data_e tmp1, tmp2;
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /*  LHS is operator of another FMA, re-writing to parallel is worse.  */
+      tmp1 = a + c * c - d * d + x * y;
+
+      tmp2 = x * tmp1;
+      result += (a + c - d + tmp2);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      x *= 0.1;
+      y *= y;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
+}
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 1 "reassoc2"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 712097ac5be..06a6094f2d3 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3064,15 +3064,20 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
 
 /* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
    OP2 and which we know is used in statements that can be, together with the
-   multiplication, converted to FMAs, perform the transformation.  */
+   multiplication, converted to FMAs, perform the transformation.
+
+   If COLLECT_LHS is not NULL, skip the actual transformation and store LHS of
+   FMA in COLLECT_LHS.  */
 
 static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+		       vec<tree> *collect_lhs = NULL)
 {
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt;
   imm_use_iterator imm_iter;
   gcall *fma_stmt;
+  tree lhs = NULL_TREE;
 
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
     {
@@ -3091,8 +3096,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  use_operand_p use_p;
 	  gimple *neguse_stmt;
 	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  if (!collect_lhs)
+	    {
+	      gsi_remove (&gsi, true);
+	      release_defs (use_stmt);
+	    }
 
 	  use_stmt = neguse_stmt;
 	  gsi = gsi_for_stmt (use_stmt);
@@ -3107,6 +3115,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	gcc_unreachable ();
       addop = ops[0] == result ? ops[1] : ops[0];
 
+      lhs = gimple_get_lhs (use_stmt);
+      if (collect_lhs)
+	{
+	  collect_lhs->safe_push (lhs);
+	  continue;
+	}
       if (code == MINUS_EXPR)
 	{
 	  if (ops[0] == result)
@@ -3132,7 +3146,7 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 					       op2, addop, else_value);
       else
 	fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
-      gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
+      gimple_set_lhs (fma_stmt, lhs);
       gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
 								   use_stmt));
       gsi_replace (&gsi, fma_stmt, true);
@@ -3186,54 +3200,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
     }
 }
 
-/* Data necessary to perform the actual transformation from a multiplication
-   and an addition to an FMA after decision is taken it should be done and to
-   then delete the multiplication statement from the function IL.  */
-
-struct fma_transformation_info
-{
-  gimple *mul_stmt;
-  tree mul_result;
-  tree op1;
-  tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
-   deferring, whether to continue deferring, and all data necessary to come
-   back and perform all deferred transformations.  */
-
-class fma_deferring_state
-{
-public:
-  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
-     do any deferring.  */
-
-  fma_deferring_state (bool perform_deferring)
-    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
-      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
-  /* List of FMA candidates for which we the transformation has been determined
-     possible but we at this point in BB analysis we do not consider them
-     beneficial.  */
-  auto_vec<fma_transformation_info, 8> m_candidates;
-
-  /* Set of results of multiplication that are part of an already deferred FMA
-     candidates.  */
-  hash_set<tree> m_mul_result_set;
-
-  /* The PHI that supposedly feeds back result of a FMA to another over loop
-     boundary.  */
-  gphi *m_initial_phi;
-
-  /* Result of the last produced FMA candidate or NULL if there has not been
-     one.  */
-  tree m_last_result;
-
-  /* If true, deferring might still be profitable.  If false, transform all
-     candidates and no longer defer.  */
-  bool m_deferring_p;
-};
-
 /* Transform all deferred FMA candidates and mark STATE as no longer
    deferring.  */
 
@@ -3308,12 +3274,15 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
   or its unrolled version, i.e. with several FMA candidates that feed result
   of one into the addend of another.  Instead, we add them to a list in STATE
   and if we later discover an FMA candidate that is not part of such a chain,
-  we go back and perform all deferred past candidates.  */
+  we go back and perform all deferred past candidates.
 
-static bool
+  If COLLECT_LHS is not NULL, skip the actual transformation and store LHS of
+  FMA in COLLECT_LHS.  */
+
+bool
 convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
-		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
-		     tree mul_len = NULL_TREE, tree mul_bias = NULL_TREE)
+		     fma_deferring_state *state, tree mul_cond, tree mul_len,
+		     tree mul_bias, vec<tree> *collect_lhs)
 {
   tree mul_result = gimple_get_lhs (mul_stmt);
   /* If there isn't a LHS then this can't be an FMA.  There can be no LHS
@@ -3590,7 +3559,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
     {
       if (state->m_deferring_p)
 	cancel_fma_deferring (state);
-      convert_mult_to_fma_1 (mul_result, op1, op2);
+      convert_mult_to_fma_1 (mul_result, op1, op2, collect_lhs);
       return true;
     }
 }
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..35773ac65ce 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,58 @@ along with GCC; see the file COPYING3.  If not see
 extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
 			   tree, HOST_WIDE_INT);
 
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
+		     tree mul_len = NULL_TREE, tree mul_bias = NULL_TREE,
+		     vec<tree> *collect_lhs = NULL);
+
 #endif  /* GCC_TREE_SSA_MATH_OPTS_H  */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..c5140e3778b 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,6 +5471,22 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* This enum describes the state of FMA candidates for a list of operands.  */
+
+enum fma_state {
+  /* There is at most one FMA candidate in the op list.  In this case, no
+     special treatment is needed when rewriting expression tree.  */
+  FMA_STATE_NONE_OR_SINGLE = 0,
+
+  /* The result of a FMA is the first or second operand of another FMA.  In
+     this case, rewriting expression tree should be avoided.  */
+  FMA_STATE_NESTED,
+
+  /* There are multiple FMA candidates in the op list.  Avoid swapping operands
+     in this case.  */
+  FMA_STATE_MULTIPLE
+};
+
 #define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
 #define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.  */
 #define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
@@ -5507,7 +5523,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
    SSA6 = SSA4 + SSA5;
  */
 static void
-rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
+rewrite_expr_tree_parallel (gassign *stmt, int width, enum fma_state fs,
 			    const vec<operand_entry *> &ops)
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
@@ -5582,7 +5598,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
 	}
 
       /* Swap the operands if no FMA in the chain.  */
-      if (ops1->length () > 2 && !has_fma)
+      if (ops1->length () > 2 && fs == FMA_STATE_NONE_OR_SINGLE)
 	swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
 
       if (i < width_active
@@ -6766,7 +6782,59 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Given LHS is result of a expression tree and MUL_STMT is one of its
+   operators, return whether LHS is the operand of another FMA.  */
+static bool
+has_nested_fma_p (gimple *mul_stmt, tree lhs)
+{
+  auto_vec<tree, 2> lhs_vec;
+  fma_deferring_state fds (false);
+
+  if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+			   gimple_assign_rhs2 (mul_stmt), &fds, NULL, NULL,
+			   NULL, &lhs_vec)
+      && lhs_vec.contains (lhs))
+    {
+      /* Check for MULT_EXPRs that use LHS.  */
+      imm_use_iterator imm_iter;
+      gimple *use_stmt;
+      use_operand_p use_p;
+
+      FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+	{
+	  use_stmt = USE_STMT (use_p);
+	  if (is_gimple_debug (use_stmt))
+	    continue;
+	  if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+	    continue;
+
+	  lhs_vec.truncate (0);
+	  if (is_gimple_assign (use_stmt)
+	      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+	      && convert_mult_to_fma (use_stmt, gimple_assign_rhs1 (use_stmt),
+				      gimple_assign_rhs2 (use_stmt), &fds, NULL,
+				      NULL, NULL, &lhs_vec))
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "Found nested FMA for ");
+		  print_generic_expr (dump_file, lhs);
+		  fprintf (dump_file, "\n");
+		}
+	      return true;
+	    }
+	}
+    }
+  return false;
+}
+
 /* Rearrange ops may have more FMA when the chain may has more than 2 FMAs.
+   Returns the state of FMA candidates in OPS:
+   - FMA_STATE_NONE_OR_SINGLE if there's at most one FMA candidate in OPS,
+   - FMA_STATE_NESTED if an FMA's result is the first or second operand of
+     another FMA,
+   - FMA_STATE_MULTIPLE if there are multiple FMA candidates in the op list.
+
    Put no-mult ops and mult ops alternately at the end of the queue, which is
    conducive to generating more FMA and reducing the loss of FMA when breaking
    the chain.
@@ -6781,23 +6849,52 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static enum fma_state
+rank_ops_for_fma (vec<operand_entry *> *ops, tree lhs)
 {
   operand_entry *oe;
   unsigned int i;
   unsigned int ops_length = ops->length ();
   auto_vec<operand_entry *> ops_mult;
   auto_vec<operand_entry *> ops_others;
+  if (dump_file)
+    {
+      fprintf (dump_file, "rank_ops_for_fma ");
+      print_generic_expr (dump_file, lhs);
+      fprintf (dump_file, "\n");
+    }
 
   FOR_EACH_VEC_ELT (*ops, i, oe)
     {
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		{
+		  if (has_nested_fma_p (def_stmt, lhs))
+		    return FMA_STATE_NESTED;
+		  else
+		    ops_mult.safe_push (oe);
+		}
+	      /* Also check NEGATE_EXPRs for possible nested FMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR
+		      && has_nested_fma_p (neg_def_stmt, lhs))
+		    return FMA_STATE_NESTED;
+		  else
+		    ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
@@ -6829,9 +6926,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
+      return FMA_STATE_MULTIPLE;
     }
-  return false;
+  return FMA_STATE_NONE_OR_SINGLE;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6997,7 +7094,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  enum fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,7 +7117,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      fs = rank_ops_for_fma (&ops, lhs);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -7028,6 +7125,7 @@ reassociate_bb (basic_block bb)
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
+		      && fs != FMA_STATE_NESTED
 		      && (width = get_reassociation_width (ops_num, rhs_code,
 							   mode)) > 1)
 		    {
@@ -7037,7 +7135,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  fs,
 						  ops);
 		    }
 		  else
@@ -7046,7 +7144,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && fs == FMA_STATE_NONE_OR_SINGLE)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
  2023-08-09 16:52 [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc Di Zhao OS
@ 2023-08-18 14:30 ` Di Zhao OS
  2023-09-01 11:38 ` Richard Biener
  1 sibling, 0 replies; 3+ messages in thread
From: Di Zhao OS @ 2023-08-18 14:30 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]

Hi,

A few updates to the patch:

1. rank_ops_for_fma: return FMA_STATE_NESTED only for complete
   FMA chain, since the regression is obvious only in this case.

2. Added new testcase.

Thanks,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
	new parameter collect_lhs.
        (struct fma_transformation_info): Moved to header.
        (class fma_deferring_state): Moved to header.
        (convert_mult_to_fma): Added new parameter collect_lhs.
        * tree-ssa-math-opts.h (struct fma_transformation_info):
        (class fma_deferring_state): Moved from .cc.
        (convert_mult_to_fma): Moved from .cc.
        * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
	describe the state of FMA candidates for a list of
	operands.
        (rewrite_expr_tree_parallel): Changed boolean parameter
	to enum type.
        (has_nested_fma_p): New function to check for nested FMA
	on given multiplication statement.
        (rank_ops_for_fma): Return enum fma_state.
        (reassociate_bb): Avoid rewriting to parallel if nested
	FMAs are found.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279-1.c: New test.
        * gcc.dg/pr110279-2.c: New test.
        * gcc.dg/pr110279-3.c: New test.

> -----Original Message-----
> From: Di Zhao OS
> Sent: Thursday, August 10, 2023 12:53 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Richard Biener <richard.guenther@gmail.com>
> Subject: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
> 
> Hi,
> 
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
> 
> Other changes:
> 
> 1. Added new testcases for the "nested" FMA issue. For the
>    following code:
> 
> 	tmp1 = a + c * c + d * d + x * y;
> 	tmp2 = x * tmp1;
> 	result += (a + c + d + tmp2);
> 
>    , when "tmp1 = ..." is not rewritten, tmp1 will be result of
>    an FMA, and there will be a list of consecutive FMAs:
> 
> 	_1 = .FMA (c, c, a_39);
> 	_2 = .FMA (d, d, _1);
> 	tmp1 = .FMA (x, y, _2);
> 	_3 = .FMA (tmp1, x, d);
> 	...
> 
>    If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>    of a PLUS_EXPR between FMAs:
> 
> 	_1 = .FMA (c, c, a_39);
> 	_2 = x * y;
> 	_3 = .FMA (d, d, _2);
> 	 tmp1 = _3 + _1;
> 	 _4 = .FMA (tmp1, x, d);
> 	...
> 
>    It seems the register pressure of the latter is higher than
>    the former. On the test machines we have (including Ampere1,
>    Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>    parallel, the run time all increased significantly. In
>    contrast, when "tmp1" is not the 1st or 2nd operand of another
>    FMA (pr110279-1.c), rewriting it results in better performance.
>    (I'll also append the testcases in the bug tracker.)
> 
> 2. Enhanced checking for nested FMA by: 1) Modified
>    convert_mult_to_fma so it can return multiple LHS.  2) Check
>    NEGATE_EXPRs for nested FMA.
> 
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )
> 
> Bootstrapped and regression tested on x86_64-linux-gnu.
> 
> Thanks,
> Di Zhao


[-- Attachment #2: 0001-PATCH-tree-optimization-110279-Check-for-nested-FMA-.patch --]
[-- Type: application/octet-stream, Size: 20792 bytes --]

From 9168b905f5163492434a756298adeff5dfd1864f Mon Sep 17 00:00:00 2001
From: "Di Zhao OS" <dizhao@os.amperecomputing.com>
Date: Tue, 8 Aug 2023 20:20:34 +0800
Subject: [PATCH] tree-optimization/110279: Check for nested FMA in reassoc

Reused code in pass widening_mul to check for nested FMAs
(LHS of an expression tree is the 1st or 2nd operand of
another FMA), since re-writing it to parallel generates
worse codes.

With option "-funroll-loops -Ofast --param tree-reassoc-width=4",
the improvements in spec2017 508.namd_r 1-copy run are:

Ampere1:          4.5%
Intel Xeon:       1.8%
Neoverse-n1:      2.3%
---
 gcc/testsuite/gcc.dg/pr110279-1.c |  50 ++++++++++++
 gcc/testsuite/gcc.dg/pr110279-2.c |  49 ++++++++++++
 gcc/testsuite/gcc.dg/pr110279-3.c |  44 +++++++++++
 gcc/tree-ssa-math-opts.cc         |  85 +++++++--------------
 gcc/tree-ssa-math-opts.h          |  54 +++++++++++++
 gcc/tree-ssa-reassoc.cc           | 121 ++++++++++++++++++++++++++----
 6 files changed, 332 insertions(+), 71 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c
 create mode 100644 gcc/testsuite/gcc.dg/pr110279-3.c

diff --git a/gcc/testsuite/gcc.dg/pr110279-1.c b/gcc/testsuite/gcc.dg/pr110279-1.c
new file mode 100644
index 00000000000..be6ff1e06ff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-1.c
@@ -0,0 +1,50 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -fdump-tree-reassoc2-details" } */
+/* { dg-additional-options "-mfpmath=sse -mfma" { target { i?86-*-* x86_64-*-* } } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+/* No nested FMA, re-write both the two lists.
+
+Run time:
+             only rewrite "result"  rewrite both "result" and "tmp"
+Ampere1      5.87                   4.53
+neoverse-n1  3.61                   2.93
+Intel Xeon   4.28                   3.33
+*/
+__attribute_noinline__ data_e
+foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
+{
+  data_e fastc = 0, tmp1;
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* Length of op list: 4
+	 Number of MULT_EXPRs: 3 */
+      tmp1 = a + c * c - d * d + x * y;
+      result += b * c - x + y - 0.01 + tmp1;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= tmp1;
+      d += a;
+      x *= 0.1;
+      y *= y;
+    }
+
+  return result;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
+}
+
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 2 "reassoc2"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..869a9b41700
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,49 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -fdump-tree-reassoc2-details" } */
+/* { dg-additional-options "-mfpmath=sse -mfma" { target { i?86-*-* x86_64-*-* } } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+/*  Run time:
+              only rewrite "result"  rewrite both "result" and "tmp"
+Ampere1       1.88                   2.10
+neoverse-n1   1.45                   1.52
+Intel Xeon    1.55                   1.62
+*/
+
+__attribute_noinline__ data_e
+foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
+{
+  data_e tmp1, tmp2;
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /*  LHS is operator of another FMA, re-writing to parallel is worse.  */
+      tmp1 = a + c * c - d * d + x * y;
+
+      tmp2 = x * tmp1;
+      result += (a + c - d + tmp2);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      x *= 0.1;
+      y *= y;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
+}
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 1 "reassoc2"} } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/pr110279-3.c b/gcc/testsuite/gcc.dg/pr110279-3.c
new file mode 100644
index 00000000000..0f814f34053
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-3.c
@@ -0,0 +1,44 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 -fdump-tree-reassoc2-details" } */
+/* { dg-additional-options "-mfpmath=sse -mfma" { target { i?86-*-* x86_64-*-* } } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e a, data_e b, data_e c, data_e d, data_e x, data_e y)
+{
+  data_e tmp1, tmp2;
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      /* "tmp1" is not a complete FMA chain, performance change is not
+	 significant whether it is rewritten.  So leave this with default
+	 behavior.  */
+      tmp1 = a + c * c - d + x * y;
+
+      tmp2 = x * tmp1;
+      result += (a + c - d + tmp2);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      x *= 0.1;
+      y *= y;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+int
+main (int argc, char **argv)
+{
+  printf ("%f\n", foo (-1.0, 0.01, 9.8, 1e2, -1.9, 0.2));
+}
+
+/* { dg-final { scan-tree-dump-times "Width = 2 was chosen for reassociation" 2 "reassoc2"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 712097ac5be..06a6094f2d3 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3064,15 +3064,20 @@ convert_plusminus_to_widen (gimple_stmt_iterator *gsi, gimple *stmt,
 
 /* Given a result MUL_RESULT which is a result of a multiplication of OP1 and
    OP2 and which we know is used in statements that can be, together with the
-   multiplication, converted to FMAs, perform the transformation.  */
+   multiplication, converted to FMAs, perform the transformation.
+
+   If COLLECT_LHS is not NULL, skip the actual transformation and store LHS of
+   FMA in COLLECT_LHS.  */
 
 static void
-convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
+convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2,
+		       vec<tree> *collect_lhs = NULL)
 {
   tree type = TREE_TYPE (mul_result);
   gimple *use_stmt;
   imm_use_iterator imm_iter;
   gcall *fma_stmt;
+  tree lhs = NULL_TREE;
 
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, mul_result)
     {
@@ -3091,8 +3096,11 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	  use_operand_p use_p;
 	  gimple *neguse_stmt;
 	  single_imm_use (gimple_assign_lhs (use_stmt), &use_p, &neguse_stmt);
-	  gsi_remove (&gsi, true);
-	  release_defs (use_stmt);
+	  if (!collect_lhs)
+	    {
+	      gsi_remove (&gsi, true);
+	      release_defs (use_stmt);
+	    }
 
 	  use_stmt = neguse_stmt;
 	  gsi = gsi_for_stmt (use_stmt);
@@ -3107,6 +3115,12 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 	gcc_unreachable ();
       addop = ops[0] == result ? ops[1] : ops[0];
 
+      lhs = gimple_get_lhs (use_stmt);
+      if (collect_lhs)
+	{
+	  collect_lhs->safe_push (lhs);
+	  continue;
+	}
       if (code == MINUS_EXPR)
 	{
 	  if (ops[0] == result)
@@ -3132,7 +3146,7 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
 					       op2, addop, else_value);
       else
 	fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
-      gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
+      gimple_set_lhs (fma_stmt, lhs);
       gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
 								   use_stmt));
       gsi_replace (&gsi, fma_stmt, true);
@@ -3186,54 +3200,6 @@ convert_mult_to_fma_1 (tree mul_result, tree op1, tree op2)
     }
 }
 
-/* Data necessary to perform the actual transformation from a multiplication
-   and an addition to an FMA after decision is taken it should be done and to
-   then delete the multiplication statement from the function IL.  */
-
-struct fma_transformation_info
-{
-  gimple *mul_stmt;
-  tree mul_result;
-  tree op1;
-  tree op2;
-};
-
-/* Structure containing the current state of FMA deferring, i.e. whether we are
-   deferring, whether to continue deferring, and all data necessary to come
-   back and perform all deferred transformations.  */
-
-class fma_deferring_state
-{
-public:
-  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
-     do any deferring.  */
-
-  fma_deferring_state (bool perform_deferring)
-    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
-      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
-
-  /* List of FMA candidates for which we the transformation has been determined
-     possible but we at this point in BB analysis we do not consider them
-     beneficial.  */
-  auto_vec<fma_transformation_info, 8> m_candidates;
-
-  /* Set of results of multiplication that are part of an already deferred FMA
-     candidates.  */
-  hash_set<tree> m_mul_result_set;
-
-  /* The PHI that supposedly feeds back result of a FMA to another over loop
-     boundary.  */
-  gphi *m_initial_phi;
-
-  /* Result of the last produced FMA candidate or NULL if there has not been
-     one.  */
-  tree m_last_result;
-
-  /* If true, deferring might still be profitable.  If false, transform all
-     candidates and no longer defer.  */
-  bool m_deferring_p;
-};
-
 /* Transform all deferred FMA candidates and mark STATE as no longer
    deferring.  */
 
@@ -3308,12 +3274,15 @@ last_fma_candidate_feeds_initial_phi (fma_deferring_state *state,
   or its unrolled version, i.e. with several FMA candidates that feed result
   of one into the addend of another.  Instead, we add them to a list in STATE
   and if we later discover an FMA candidate that is not part of such a chain,
-  we go back and perform all deferred past candidates.  */
+  we go back and perform all deferred past candidates.
 
-static bool
+  If COLLECT_LHS is not NULL, skip the actual transformation and store LHS of
+  FMA in COLLECT_LHS.  */
+
+bool
 convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
-		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
-		     tree mul_len = NULL_TREE, tree mul_bias = NULL_TREE)
+		     fma_deferring_state *state, tree mul_cond, tree mul_len,
+		     tree mul_bias, vec<tree> *collect_lhs)
 {
   tree mul_result = gimple_get_lhs (mul_stmt);
   /* If there isn't a LHS then this can't be an FMA.  There can be no LHS
@@ -3590,7 +3559,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
     {
       if (state->m_deferring_p)
 	cancel_fma_deferring (state);
-      convert_mult_to_fma_1 (mul_result, op1, op2);
+      convert_mult_to_fma_1 (mul_result, op1, op2, collect_lhs);
       return true;
     }
 }
diff --git a/gcc/tree-ssa-math-opts.h b/gcc/tree-ssa-math-opts.h
index 52b7938b599..35773ac65ce 100644
--- a/gcc/tree-ssa-math-opts.h
+++ b/gcc/tree-ssa-math-opts.h
@@ -23,4 +23,58 @@ along with GCC; see the file COPYING3.  If not see
 extern tree powi_as_mults (gimple_stmt_iterator *, location_t,
 			   tree, HOST_WIDE_INT);
 
+/* Data necessary to perform the actual transformation from a multiplication
+   and an addition to an FMA after decision is taken it should be done and to
+   then delete the multiplication statement from the function IL.  */
+
+struct fma_transformation_info
+{
+  gimple *mul_stmt;
+  tree mul_result;
+  tree op1;
+  tree op2;
+};
+
+/* Structure containing the current state of FMA deferring, i.e. whether we are
+   deferring, whether to continue deferring, and all data necessary to come
+   back and perform all deferred transformations.  */
+
+class fma_deferring_state
+{
+public:
+  /* Class constructor.  Pass true as PERFORM_DEFERRING in order to actually
+     do any deferring.  */
+
+  fma_deferring_state (bool perform_deferring)
+    : m_candidates (), m_mul_result_set (), m_initial_phi (NULL),
+      m_last_result (NULL_TREE), m_deferring_p (perform_deferring) {}
+
+  /* List of FMA candidates for which we the transformation has been determined
+     possible but we at this point in BB analysis we do not consider them
+     beneficial.  */
+  auto_vec<fma_transformation_info, 8> m_candidates;
+
+  /* Set of results of multiplication that are part of an already deferred FMA
+     candidates.  */
+  hash_set<tree> m_mul_result_set;
+
+  /* The PHI that supposedly feeds back result of a FMA to another over loop
+     boundary.  */
+  gphi *m_initial_phi;
+
+  /* Result of the last produced FMA candidate or NULL if there has not been
+     one.  */
+  tree m_last_result;
+
+  /* If true, deferring might still be profitable.  If false, transform all
+     candidates and no longer defer.  */
+  bool m_deferring_p;
+};
+
+bool
+convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
+		     fma_deferring_state *state, tree mul_cond = NULL_TREE,
+		     tree mul_len = NULL_TREE, tree mul_bias = NULL_TREE,
+		     vec<tree> *collect_lhs = NULL);
+
 #endif  /* GCC_TREE_SSA_MATH_OPTS_H  */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..78bfc4a9df0 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,6 +5471,22 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* This enum describes the state of FMA candidates for a list of operands.  */
+
+enum fma_state {
+  /* There is at most one FMA candidate in the op list.  In this case, no
+     special treatment is needed when rewriting expression tree.  */
+  FMA_STATE_NONE_OR_SINGLE = 0,
+
+  /* The result of a FMA is the first or second operand of another FMA.  In
+     this case, rewriting expression tree should be avoided.  */
+  FMA_STATE_NESTED,
+
+  /* There are multiple FMA candidates in the op list.  Avoid swapping operands
+     in this case.  */
+  FMA_STATE_MULTIPLE
+};
+
 #define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
 #define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.  */
 #define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
@@ -5507,7 +5523,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
    SSA6 = SSA4 + SSA5;
  */
 static void
-rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
+rewrite_expr_tree_parallel (gassign *stmt, int width, enum fma_state fs,
 			    const vec<operand_entry *> &ops)
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
@@ -5582,7 +5598,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
 	}
 
       /* Swap the operands if no FMA in the chain.  */
-      if (ops1->length () > 2 && !has_fma)
+      if (ops1->length () > 2 && fs == FMA_STATE_NONE_OR_SINGLE)
 	swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
 
       if (i < width_active
@@ -6766,7 +6782,59 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Given LHS is result of a expression tree and MUL_STMT is one of its
+   operators, return whether LHS is the operand of another FMA.  */
+static bool
+has_nested_fma_p (gimple *mul_stmt, tree lhs)
+{
+  auto_vec<tree, 2> lhs_vec;
+  fma_deferring_state fds (false);
+
+  if (convert_mult_to_fma (mul_stmt, gimple_assign_rhs1 (mul_stmt),
+			   gimple_assign_rhs2 (mul_stmt), &fds, NULL, NULL,
+			   NULL, &lhs_vec)
+      && lhs_vec.contains (lhs))
+    {
+      /* Check for MULT_EXPRs that use LHS.  */
+      imm_use_iterator imm_iter;
+      gimple *use_stmt;
+      use_operand_p use_p;
+
+      FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
+	{
+	  use_stmt = USE_STMT (use_p);
+	  if (is_gimple_debug (use_stmt))
+	    continue;
+	  if (gimple_bb (use_stmt) != gimple_bb (mul_stmt))
+	    continue;
+
+	  lhs_vec.truncate (0);
+	  if (is_gimple_assign (use_stmt)
+	      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR
+	      && convert_mult_to_fma (use_stmt, gimple_assign_rhs1 (use_stmt),
+				      gimple_assign_rhs2 (use_stmt), &fds, NULL,
+				      NULL, NULL, &lhs_vec))
+	    {
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "Found nested FMA for ");
+		  print_generic_expr (dump_file, lhs);
+		  fprintf (dump_file, "\n");
+		}
+	      return true;
+	    }
+	}
+    }
+  return false;
+}
+
 /* Rearrange ops may have more FMA when the chain may has more than 2 FMAs.
+   Returns the state of FMA candidates in OPS:
+   - FMA_STATE_NONE_OR_SINGLE if there's at most one FMA candidate in OPS,
+   - FMA_STATE_NESTED if a FMA's result is the first or second operand of
+     another FMA,
+   - FMA_STATE_MULTIPLE if there are multiple FMA candidates in the op list.
+
    Put no-mult ops and mult ops alternately at the end of the queue, which is
    conducive to generating more FMA and reducing the loss of FMA when breaking
    the chain.
@@ -6781,29 +6849,55 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static enum fma_state
+rank_ops_for_fma (vec<operand_entry *> *ops, tree lhs)
 {
   operand_entry *oe;
   unsigned int i;
   unsigned int ops_length = ops->length ();
   auto_vec<operand_entry *> ops_mult;
   auto_vec<operand_entry *> ops_others;
+  bool has_nested_fma = false;
 
   FOR_EACH_VEC_ELT (*ops, i, oe)
     {
       if (TREE_CODE (oe->op) == SSA_NAME)
 	{
 	  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
-	  if (is_gimple_assign (def_stmt)
-	      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
-	    ops_mult.safe_push (oe);
+	  if (is_gimple_assign (def_stmt))
+	    {
+	      if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		{
+		  has_nested_fma |= has_nested_fma_p (def_stmt, lhs);
+		  ops_mult.safe_push (oe);
+		}
+	      /* Also check NEGATE_EXPRs for possible nested FMA.  */
+	      else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+		       && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+		{
+		  gimple *neg_def_stmt
+		    = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+		  if (is_gimple_assign (neg_def_stmt)
+		      && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+		      && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+		    has_nested_fma |= has_nested_fma_p (neg_def_stmt, lhs);
+		  ops_others.safe_push (oe);
+		}
+	      else
+		ops_others.safe_push (oe);
+	    }
 	  else
 	    ops_others.safe_push (oe);
 	}
       else
 	ops_others.safe_push (oe);
     }
+
+  /* When there's nested FMA following a complete FMA chain, breaking it
+     increases run time (pr110279).  */
+  if (has_nested_fma && ops_mult.length () >= ops_length - 1)
+    return FMA_STATE_NESTED;
+
   /* 1. When ops_mult.length == 2, like the following case,
 
      a * b + c * d + e.
@@ -6829,9 +6923,9 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
+      return FMA_STATE_MULTIPLE;
     }
-  return false;
+  return FMA_STATE_NONE_OR_SINGLE;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6997,7 +7091,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  enum fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,7 +7114,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      fs = rank_ops_for_fma (&ops, lhs);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -7028,6 +7122,7 @@ reassociate_bb (basic_block bb)
 		     with initial linearization.  */
 		  if (!reassoc_insert_powi_p
 		      && ops.length () > 3
+		      && fs != FMA_STATE_NESTED
 		      && (width = get_reassociation_width (ops_num, rhs_code,
 							   mode)) > 1)
 		    {
@@ -7037,7 +7132,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  fs,
 						  ops);
 		    }
 		  else
@@ -7046,7 +7141,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && fs == FMA_STATE_NONE_OR_SINGLE)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc
  2023-08-09 16:52 [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc Di Zhao OS
  2023-08-18 14:30 ` Di Zhao OS
@ 2023-09-01 11:38 ` Richard Biener
  1 sibling, 0 replies; 3+ messages in thread
From: Richard Biener @ 2023-09-01 11:38 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: gcc-patches

On Wed, Aug 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com> wrote:
>
> Hi,
>
> The previous version of this patch tries to solve two problems
> at the same time. For better clarity, I'll separate them and
> only deal with the "nested" FMA in this version. I plan to
> propose another patch in avoiding bad shaped FMA (deferring FMA).
>
> Other changes:
>
> 1. Added new testcases for the "nested" FMA issue. For the
>    following code:
>
>         tmp1 = a + c * c + d * d + x * y;
>         tmp2 = x * tmp1;
>         result += (a + c + d + tmp2);
>
>    , when "tmp1 = ..." is not rewritten, tmp1 will be result of
>    an FMA, and there will be a list of consecutive FMAs:
>
>         _1 = .FMA (c, c, a_39);
>         _2 = .FMA (d, d, _1);
>         tmp1 = .FMA (x, y, _2);
>         _3 = .FMA (tmp1, x, d);
>         ...
>
>    If "tmp1 = ..." is rewritten to parallel, tmp1 will be result
>    of a PLUS_EXPR between FMAs:
>
>         _1 = .FMA (c, c, a_39);
>         _2 = x * y;
>         _3 = .FMA (d, d, _2);
>          tmp1 = _3 + _1;
>          _4 = .FMA (tmp1, x, d);
>         ...
>
>    It seems the register pressure of the latter is higher than
>    the former.

Yes, that's a natural consequence of rewriting to parallel.

>    On the test machines we have (including Ampere1,
>    Neoverse-n1 and Intel Xeon), with "tmp1 = ..." is rewritten to
>    parallel, the run time all increased significantly. In
>    contrast, when "tmp1" is not the 1st or 2nd operand of another
>    FMA (pr110279-1.c), rewriting it results in better performance.
>    (I'll also append the testcases in the bug tracker.)
>
> 2. Enhanced checking for nested FMA by: 1) Modified
>    convert_mult_to_fma so it can return multiple LHS.  2) Check
>    NEGATE_EXPRs for nested FMA.
>
> (I think maybe this can be further refined by enabling rewriting
> to parallel for very long op list. )

So again, what you do applies to all operations, not just FMA.
Consider

      tmp1 = a + c + d + y;
      tmp2 = x + tmp1;
      result += (a + c + d + tmp2);
      foo (tmp2);

where I just removed all multiplications.  Since re-assoc works
on isolated single-use chains it will rewrite the tmp2 chain
to parallel and it will rewrite the result chain to parallel, in
the end this results in reassoc-width for 'result' to not be honored
because we don't see that at 'tmp2' it will fork again.  OTOH
the other 'result' arms end, so eventually just two (for reassoc
width 2) arms are "active" at any point.

That said - isn't the issue that we "overcommit" reassoc-width
this way because we apply it locally instead of globally
(of course also ignoring every other chain of instructions
reassoc isn't interestedin)?

Unfortunately we work backwards when processing chains,
if we processed leaf chains first we could record the
association width applied to the chain at its new root and
honor that when such root ends up in the oplist of a consuming
chain.  But as we work backwards we'd have to record
the reassoc width used to in the leafs of the associated
chain.  So if those become roots of other chains we can
then still honor that.

Would it work to attack the problem this way?  For
parallel rewritten chains record the width used?
Similar to operand_rank we could use a hash-map
from SSA leaf to width it appears in.  When we rewrite
a chain with such leaf as root we can then subtract
the incoming chain width from reassoc-width to lower
the width its tail?

Richard.

> Bootstrapped and regression tested on x86_64-linux-gnu.
>
> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added
>         new parameter collect_lhs.
>         (struct fma_transformation_info): Moved to header.
>         (class fma_deferring_state): Moved to header.
>         (convert_mult_to_fma): Added new parameter collect_lhs.
>         * tree-ssa-math-opts.h (struct fma_transformation_info):
>         (class fma_deferring_state): Moved from .cc.
>         (convert_mult_to_fma): Moved from .cc.
>         * tree-ssa-reassoc.cc (enum fma_state): Defined enum to
>         describe the state of FMA candidates for a list of
>         operands.
>         (rewrite_expr_tree_parallel): Changed boolean parameter
>         to enum type.
>         (has_nested_fma_p): New function to check for nested FMA
>         on given multiplication statement.
>         (rank_ops_for_fma): Return enum fma_state.
>         (reassociate_bb): Avoid rewriting to parallel if nested
>         FMAs are found.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.
>         * gcc.dg/pr110279-2.c: New test.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-09-01 11:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-09 16:52 [PATCH v3] tree-optimization/110279- Check for nested FMA in reassoc Di Zhao OS
2023-08-18 14:30 ` Di Zhao OS
2023-09-01 11:38 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).