[PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: Richard Biener <richard.guenther@gmail.com>
Subject: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Date: Mon, 28 Aug 2023 08:17:29 +0000	[thread overview]
Message-ID: <SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

This patch tries to fix the 2% regression in 510.parest_r on
ampere1 in the tracker. (Previous discussion is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)

1. Add testcases for the problem. For an op list in the form of
"acc = a * b + c * d + acc", currently reassociation doesn't
Swap the operands so that more FMAs can be generated.
After widening_mul the result looks like:

   _1 = .FMA(a, b, acc_0);
   acc_1 = .FMA(c, d, _1);

While previously (before the "Handle FMA friendly..." patch),
widening_mul's result was like:

   _1 = a * b;
   _2 = .FMA (c, d, _1);
   acc_1 = acc_0 + _2;

If the code fragment is in a loop, some architecture can execute
the latter in parallel, so the performance can be much faster than
the former. For the small testcase, the performance gap is over
10% on both ampere1 and neoverse-n1. So the point here is to avoid
turning the last statement into FMA, and keep it a PLUS_EXPR as
much as possible. (If we are rewriting the op list into parallel,
no special treatment is needed, since the last statement after
rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)

2. Function result_feeds_back_from_phi_p is to check for cross
backedge dependency. Added new enum fma_state to describe the
state of FMA candidates.

With this patch, there's a 3% improvement in 510.parest_r 1-copy
run on ampere1. The compile options are: 
"-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".

Best regards,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (enum fma_state): New enum to
        describe the state of FMA candidates for an op list.
        (rewrite_expr_tree_parallel): Changed boolean 
        parameter to enum type.
        (result_feeds_back_from_phi_p): New function to check
        for cross backedge dependency.
        (rank_ops_for_fma): Return enum fma_state. Added new
        parameter.
        (reassociate_bb): If there's backedge dependency in an
        op list, swap the operands before rewrite_expr_tree.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

[-- Attachment #2: 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch --]
[-- Type: application/octet-stream, Size: 8928 bytes --]

From 663f9822b2f5b458ae4fd0983ec59fd1eb47877f Mon Sep 17 00:00:00 2001
From: amptest <amptest@amperecomputing.com>
Date: Wed, 23 Aug 2023 01:26:24 -0700
Subject: [PATCH] swap operands in reassoc to reduce cross backedge FMA

---
 gcc/testsuite/gcc.dg/pr110279.c |  62 ++++++++++++++++++
 gcc/tree-ssa-reassoc.cc         | 111 +++++++++++++++++++++++++++++---
 2 files changed, 163 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279.c

diff --git a/gcc/testsuite/gcc.dg/pr110279.c b/gcc/testsuite/gcc.dg/pr110279.c
new file mode 100644
index 00000000000..9dc72658bff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279.c
@@ -0,0 +1,62 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..3552ee9c0e7 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,6 +5471,24 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* This enum describes the state of FMA candidates for a list of operands.  */
+
+enum fma_state
+{
+  /* There is at most one FMA candidate in the op list.  In this case, no
+     special treatment is needed when rewriting expression tree.  */
+  FMA_STATE_NONE_OR_SINGLE = 0,
+
+  /* The result of op-list has cross backedge dependency, and we need to avoid
+     turning it into an FMA.  In this case, swap the operands if we are not
+     rewriting it into parallel.  */
+  FMA_STATE_CROSS_BACKEDGE,
+
+  /* There are multiple FMA candidates in the op list.  Avoid swapping operands
+     in this case.  */
+  FMA_STATE_MULTIPLE
+};
+
 #define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
 #define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.  */
 #define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
@@ -5507,7 +5525,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
    SSA6 = SSA4 + SSA5;
  */
 static void
-rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
+rewrite_expr_tree_parallel (gassign *stmt, int width, fma_state fs,
 			    const vec<operand_entry *> &ops)
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
@@ -5582,7 +5600,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
 	}
 
       /* Swap the operands if no FMA in the chain.  */
-      if (ops1->length () > 2 && !has_fma)
+      if (ops1->length () > 2 && fs == FMA_STATE_NONE_OR_SINGLE)
 	swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
 
       if (i < width_active
@@ -6766,7 +6784,69 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Given that LHS is the result SSA_NAME of OPS, returns whether it has cross
+   backedge dependency:
+
+   1. LHS is a phi argument in the same basic block it is defined.
+
+   2. And, the result of the phi node is used in OPS.  */
+
+static bool
+result_feeds_back_from_phi_p (vec<operand_entry *> *ops, tree lhs)
+{
+  basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+  gimple_stmt_iterator gsi;
+
+  /* Iterate over phi argument in the basic block LHS is defined.  */
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+      for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	{
+	  tree op = PHI_ARG_DEF (phi, i);
+	  if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+	    continue;
+	  tree phi_result = gimple_phi_result (phi);
+	  operand_entry *oe;
+	  unsigned int j;
+	  FOR_EACH_VEC_ELT (*ops, j, oe)
+	    {
+	      if (TREE_CODE (oe->op) != SSA_NAME)
+		continue;
+
+	      /* Result of phi is operand of PLUS_EXPR.  */
+	      if (oe->op == phi_result)
+		return true;
+
+	      /* Check is result of phi is operand of MULT_EXPR.  */
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+	      if (is_gimple_assign (def_stmt)
+		  && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		{
+		  tree rhs = gimple_assign_rhs1 (def_stmt);
+		  if (TREE_CODE (rhs) == SSA_NAME)
+		    {
+		      if (rhs == phi_result)
+			return true;
+		      def_stmt = SSA_NAME_DEF_STMT (rhs);
+		    }
+		}
+	      if (is_gimple_assign (def_stmt)
+		  && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		{
+		  if (gimple_assign_rhs1 (def_stmt) == phi_result
+		      || gimple_assign_rhs2 (def_stmt) == phi_result)
+		    return true;
+		}
+	    }
+	}
+    }
+  return false;
+}
+
 /* Rearrange ops may have more FMA when the chain may has more than 2 FMAs.
+   LHS is the result SSA_NAME of OPS.
+
    Put no-mult ops and mult ops alternately at the end of the queue, which is
    conducive to generating more FMA and reducing the loss of FMA when breaking
    the chain.
@@ -6781,14 +6861,15 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static fma_state
+rank_ops_for_fma (vec<operand_entry *> *ops, tree lhs)
 {
   operand_entry *oe;
   unsigned int i;
   unsigned int ops_length = ops->length ();
   auto_vec<operand_entry *> ops_mult;
   auto_vec<operand_entry *> ops_others;
+  fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
   FOR_EACH_VEC_ELT (*ops, i, oe)
     {
@@ -6815,6 +6896,17 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      2. If all ops are defined with mult, we don't need to rearrange them.  */
   if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
     {
+      /* On some platforms, if the result has cross backedge dependency,
+	 transforming it into an FMA leads to worse data parallelism.  If we're
+	 rewriting the op list to parallel we don't need to worry about this,
+	 otherwise swap the operands.  */
+      if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		    param_avoid_fma_max_bits)
+	  && result_feeds_back_from_phi_p (ops, lhs))
+	fs = FMA_STATE_CROSS_BACKEDGE;
+      else
+	fs = FMA_STATE_MULTIPLE;
+
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
 	 loss of FMA when breaking the chain.  */
@@ -6829,9 +6921,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return fs;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6997,7 +7088,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,7 +7111,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      fs = rank_ops_for_fma (&ops, lhs);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -7037,7 +7128,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  fs,
 						  ops);
 		    }
 		  else
@@ -7046,7 +7137,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && fs != FMA_STATE_MULTIPLE)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1

next             reply	other threads:[~2023-08-28  8:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-28  8:17 Di Zhao OS [this message]
2023-08-28 23:22 ` Jeff Law
2023-08-29  7:41   ` Richard Biener
2023-08-29  7:49     ` Di Zhao OS
2023-08-29  8:09       ` Richard Biener
2023-08-29  8:58         ` Di Zhao OS
2023-08-29 11:11           ` Richard Biener
2023-08-30  9:33             ` Di Zhao OS
2023-08-31 12:22               ` Richard Biener
2023-09-04 10:33                 ` Di Zhao OS
2023-08-29 12:34     ` Jeff Law

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com \
    --to=dizhao@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).