[PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
@ 2023-08-28  8:17 Di Zhao OS
  2023-08-28 23:22 ` Jeff Law
  0 siblings, 1 reply; 11+ messages in thread
From: Di Zhao OS @ 2023-08-28  8:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

This patch tries to fix the 2% regression in 510.parest_r on
ampere1 in the tracker. (Previous discussion is here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)

1. Add testcases for the problem. For an op list in the form of
"acc = a * b + c * d + acc", currently reassociation doesn't
Swap the operands so that more FMAs can be generated.
After widening_mul the result looks like:

   _1 = .FMA(a, b, acc_0);
   acc_1 = .FMA(c, d, _1);

While previously (before the "Handle FMA friendly..." patch),
widening_mul's result was like:

   _1 = a * b;
   _2 = .FMA (c, d, _1);
   acc_1 = acc_0 + _2;

If the code fragment is in a loop, some architecture can execute
the latter in parallel, so the performance can be much faster than
the former. For the small testcase, the performance gap is over
10% on both ampere1 and neoverse-n1. So the point here is to avoid
turning the last statement into FMA, and keep it a PLUS_EXPR as
much as possible. (If we are rewriting the op list into parallel,
no special treatment is needed, since the last statement after
rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)

2. Function result_feeds_back_from_phi_p is to check for cross
backedge dependency. Added new enum fma_state to describe the
state of FMA candidates.

With this patch, there's a 3% improvement in 510.parest_r 1-copy
run on ampere1. The compile options are: 
"-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".

Best regards,
Di Zhao

----

        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (enum fma_state): New enum to
        describe the state of FMA candidates for an op list.
        (rewrite_expr_tree_parallel): Changed boolean 
        parameter to enum type.
        (result_feeds_back_from_phi_p): New function to check
        for cross backedge dependency.
        (rank_ops_for_fma): Return enum fma_state. Added new
        parameter.
        (reassociate_bb): If there's backedge dependency in an
        op list, swap the operands before rewrite_expr_tree.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

[-- Attachment #2: 0001-swap-operands-in-reassoc-to-reduce-cross-backedge-FM.patch --]
[-- Type: application/octet-stream, Size: 8928 bytes --]

From 663f9822b2f5b458ae4fd0983ec59fd1eb47877f Mon Sep 17 00:00:00 2001
From: amptest <amptest@amperecomputing.com>
Date: Wed, 23 Aug 2023 01:26:24 -0700
Subject: [PATCH] swap operands in reassoc to reduce cross backedge FMA

---
 gcc/testsuite/gcc.dg/pr110279.c |  62 ++++++++++++++++++
 gcc/tree-ssa-reassoc.cc         | 111 +++++++++++++++++++++++++++++---
 2 files changed, 163 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279.c

diff --git a/gcc/testsuite/gcc.dg/pr110279.c b/gcc/testsuite/gcc.dg/pr110279.c
new file mode 100644
index 00000000000..9dc72658bff
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279.c
@@ -0,0 +1,62 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..3552ee9c0e7 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5471,6 +5471,24 @@ get_reassociation_width (int ops_num, enum tree_code opc,
   return width;
 }
 
+/* This enum describes the state of FMA candidates for a list of operands.  */
+
+enum fma_state
+{
+  /* There is at most one FMA candidate in the op list.  In this case, no
+     special treatment is needed when rewriting expression tree.  */
+  FMA_STATE_NONE_OR_SINGLE = 0,
+
+  /* The result of op-list has cross backedge dependency, and we need to avoid
+     turning it into an FMA.  In this case, swap the operands if we are not
+     rewriting it into parallel.  */
+  FMA_STATE_CROSS_BACKEDGE,
+
+  /* There are multiple FMA candidates in the op list.  Avoid swapping operands
+     in this case.  */
+  FMA_STATE_MULTIPLE
+};
+
 #define SPECIAL_BIASED_END_STMT 0 /* It is the end stmt of all ops.  */
 #define BIASED_END_STMT 1 /* It is the end stmt of normal or biased ops.  */
 #define NORMAL_END_STMT 2 /* It is the end stmt of normal ops.  */
@@ -5507,7 +5525,7 @@ get_reassociation_width (int ops_num, enum tree_code opc,
    SSA6 = SSA4 + SSA5;
  */
 static void
-rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
+rewrite_expr_tree_parallel (gassign *stmt, int width, fma_state fs,
 			    const vec<operand_entry *> &ops)
 {
   enum tree_code opcode = gimple_assign_rhs_code (stmt);
@@ -5582,7 +5600,7 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
 	}
 
       /* Swap the operands if no FMA in the chain.  */
-      if (ops1->length () > 2 && !has_fma)
+      if (ops1->length () > 2 && fs == FMA_STATE_NONE_OR_SINGLE)
 	swap_ops_for_binary_stmt (*ops1, ops1->length () - 3);
 
       if (i < width_active
@@ -6766,7 +6784,69 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
     }
 }
 
+/* Given that LHS is the result SSA_NAME of OPS, returns whether it has cross
+   backedge dependency:
+
+   1. LHS is a phi argument in the same basic block it is defined.
+
+   2. And, the result of the phi node is used in OPS.  */
+
+static bool
+result_feeds_back_from_phi_p (vec<operand_entry *> *ops, tree lhs)
+{
+  basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+  gimple_stmt_iterator gsi;
+
+  /* Iterate over phi argument in the basic block LHS is defined.  */
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+      for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	{
+	  tree op = PHI_ARG_DEF (phi, i);
+	  if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+	    continue;
+	  tree phi_result = gimple_phi_result (phi);
+	  operand_entry *oe;
+	  unsigned int j;
+	  FOR_EACH_VEC_ELT (*ops, j, oe)
+	    {
+	      if (TREE_CODE (oe->op) != SSA_NAME)
+		continue;
+
+	      /* Result of phi is operand of PLUS_EXPR.  */
+	      if (oe->op == phi_result)
+		return true;
+
+	      /* Check is result of phi is operand of MULT_EXPR.  */
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+	      if (is_gimple_assign (def_stmt)
+		  && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		{
+		  tree rhs = gimple_assign_rhs1 (def_stmt);
+		  if (TREE_CODE (rhs) == SSA_NAME)
+		    {
+		      if (rhs == phi_result)
+			return true;
+		      def_stmt = SSA_NAME_DEF_STMT (rhs);
+		    }
+		}
+	      if (is_gimple_assign (def_stmt)
+		  && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		{
+		  if (gimple_assign_rhs1 (def_stmt) == phi_result
+		      || gimple_assign_rhs2 (def_stmt) == phi_result)
+		    return true;
+		}
+	    }
+	}
+    }
+  return false;
+}
+
 /* Rearrange ops may have more FMA when the chain may has more than 2 FMAs.
+   LHS is the result SSA_NAME of OPS.
+
    Put no-mult ops and mult ops alternately at the end of the queue, which is
    conducive to generating more FMA and reducing the loss of FMA when breaking
    the chain.
@@ -6781,14 +6861,15 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
 
    _4  = .FMA (c_7(D), d_8(D), _3);
    _11 = .FMA (a_5(D), b_6(D), _4);  */
-static bool
-rank_ops_for_fma (vec<operand_entry *> *ops)
+static fma_state
+rank_ops_for_fma (vec<operand_entry *> *ops, tree lhs)
 {
   operand_entry *oe;
   unsigned int i;
   unsigned int ops_length = ops->length ();
   auto_vec<operand_entry *> ops_mult;
   auto_vec<operand_entry *> ops_others;
+  fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
   FOR_EACH_VEC_ELT (*ops, i, oe)
     {
@@ -6815,6 +6896,17 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
      2. If all ops are defined with mult, we don't need to rearrange them.  */
   if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
     {
+      /* On some platforms, if the result has cross backedge dependency,
+	 transforming it into an FMA leads to worse data parallelism.  If we're
+	 rewriting the op list to parallel we don't need to worry about this,
+	 otherwise swap the operands.  */
+      if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		    param_avoid_fma_max_bits)
+	  && result_feeds_back_from_phi_p (ops, lhs))
+	fs = FMA_STATE_CROSS_BACKEDGE;
+      else
+	fs = FMA_STATE_MULTIPLE;
+
       /* Put no-mult ops and mult ops alternately at the end of the
 	 queue, which is conducive to generating more FMA and reducing the
 	 loss of FMA when breaking the chain.  */
@@ -6829,9 +6921,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
 	  if (opindex > 0)
 	    opindex--;
 	}
-      return true;
     }
-  return false;
+  return fs;
 }
 /* Reassociate expressions in basic block BB and its post-dominator as
    children.
@@ -6997,7 +7088,7 @@ reassociate_bb (basic_block bb)
 		  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
 		  int ops_num = ops.length ();
 		  int width;
-		  bool has_fma = false;
+		  fma_state fs = FMA_STATE_NONE_OR_SINGLE;
 
 		  /* For binary bit operations, if there are at least 3
 		     operands and the last operand in OPS is a constant,
@@ -7020,7 +7111,7 @@ reassociate_bb (basic_block bb)
 						      opt_type)
 		      && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
 		    {
-		      has_fma = rank_ops_for_fma (&ops);
+		      fs = rank_ops_for_fma (&ops, lhs);
 		    }
 
 		  /* Only rewrite the expression tree to parallel in the
@@ -7037,7 +7128,7 @@ reassociate_bb (basic_block bb)
 				 width);
 		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
 						  width,
-						  has_fma,
+						  fs,
 						  ops);
 		    }
 		  else
@@ -7046,7 +7137,7 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3 && fs != FMA_STATE_MULTIPLE)
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-28  8:17 [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA Di Zhao OS
@ 2023-08-28 23:22 ` Jeff Law
  2023-08-29  7:41   ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Law @ 2023-08-28 23:22 UTC (permalink / raw)
  To: Di Zhao OS, gcc-patches



On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> This patch tries to fix the 2% regression in 510.parest_r on
> ampere1 in the tracker. (Previous discussion is here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> 
> 1. Add testcases for the problem. For an op list in the form of
> "acc = a * b + c * d + acc", currently reassociation doesn't
> Swap the operands so that more FMAs can be generated.
> After widening_mul the result looks like:
> 
>     _1 = .FMA(a, b, acc_0);
>     acc_1 = .FMA(c, d, _1);
> 
> While previously (before the "Handle FMA friendly..." patch),
> widening_mul's result was like:
> 
>     _1 = a * b;
>     _2 = .FMA (c, d, _1);
>     acc_1 = acc_0 + _2;
> 
> If the code fragment is in a loop, some architecture can execute
> the latter in parallel, so the performance can be much faster than
> the former. For the small testcase, the performance gap is over
> 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> turning the last statement into FMA, and keep it a PLUS_EXPR as
> much as possible. (If we are rewriting the op list into parallel,
> no special treatment is needed, since the last statement after
> rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> 
> 2. Function result_feeds_back_from_phi_p is to check for cross
> backedge dependency. Added new enum fma_state to describe the
> state of FMA candidates.
> 
> With this patch, there's a 3% improvement in 510.parest_r 1-copy
> run on ampere1. The compile options are:
> "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> 
> Best regards,
> Di Zhao
> 
> ----
> 
>          PR tree-optimization/110279
> 
> gcc/ChangeLog:
> 
>          * tree-ssa-reassoc.cc (enum fma_state): New enum to
>          describe the state of FMA candidates for an op list.
>          (rewrite_expr_tree_parallel): Changed boolean
>          parameter to enum type.
>          (result_feeds_back_from_phi_p): New function to check
>          for cross backedge dependency.
>          (rank_ops_for_fma): Return enum fma_state. Added new
>          parameter.
>          (reassociate_bb): If there's backedge dependency in an
>          op list, swap the operands before rewrite_expr_tree.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.dg/pr110279.c: New test.
Not a review, but more of a question -- isn't this transformation's 
profitability uarch sensitive.  ie, just because it's bad for a set of 
aarch64 uarches, doesn't mean it's bad everywhere.

And in general we shy away from trying to adjust gimple code based on 
uarch preferences.

It seems the right place to do this is gimple->rtl expansion.

Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-28 23:22 ` Jeff Law
@ 2023-08-29  7:41   ` Richard Biener
  2023-08-29  7:49     ` Di Zhao OS
  2023-08-29 12:34     ` Jeff Law
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Biener @ 2023-08-29  7:41 UTC (permalink / raw)
  To: Jeff Law, Martin Jambor; +Cc: Di Zhao OS, gcc-patches

On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > This patch tries to fix the 2% regression in 510.parest_r on
> > ampere1 in the tracker. (Previous discussion is here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> >
> > 1. Add testcases for the problem. For an op list in the form of
> > "acc = a * b + c * d + acc", currently reassociation doesn't
> > Swap the operands so that more FMAs can be generated.
> > After widening_mul the result looks like:
> >
> >     _1 = .FMA(a, b, acc_0);
> >     acc_1 = .FMA(c, d, _1);
> >
> > While previously (before the "Handle FMA friendly..." patch),
> > widening_mul's result was like:
> >
> >     _1 = a * b;
> >     _2 = .FMA (c, d, _1);
> >     acc_1 = acc_0 + _2;

How can we execute the multiply and the FMA in parallel?  They
depend on each other.  Or is it the uarch can handle dependence
on the add operand but only when it is with a multiplication and
not a FMA in some better ways?  (I'd doubt so much complexity)

Can you explain in more detail how the uarch executes one vs. the
other case?

> > If the code fragment is in a loop, some architecture can execute
> > the latter in parallel, so the performance can be much faster than
> > the former. For the small testcase, the performance gap is over
> > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > much as possible. (If we are rewriting the op list into parallel,
> > no special treatment is needed, since the last statement after
> > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> >
> > 2. Function result_feeds_back_from_phi_p is to check for cross
> > backedge dependency. Added new enum fma_state to describe the
> > state of FMA candidates.
> >
> > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > run on ampere1. The compile options are:
> > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> >
> > Best regards,
> > Di Zhao
> >
> > ----
> >
> >          PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> >          describe the state of FMA candidates for an op list.
> >          (rewrite_expr_tree_parallel): Changed boolean
> >          parameter to enum type.
> >          (result_feeds_back_from_phi_p): New function to check
> >          for cross backedge dependency.
> >          (rank_ops_for_fma): Return enum fma_state. Added new
> >          parameter.
> >          (reassociate_bb): If there's backedge dependency in an
> >          op list, swap the operands before rewrite_expr_tree.
> >
> > gcc/testsuite/ChangeLog:
> >
> >          * gcc.dg/pr110279.c: New test.
> Not a review, but more of a question -- isn't this transformation's
> profitability uarch sensitive.  ie, just because it's bad for a set of
> aarch64 uarches, doesn't mean it's bad everywhere.
>
> And in general we shy away from trying to adjust gimple code based on
> uarch preferences.
>
> It seems the right place to do this is gimple->rtl expansion.

Another comment is that FMA forming has this deferring code which I
think deals exactly with this kind of thing?  CCing Martin who did this
work based on AMD uarchs also not wanting cross-loop dependences
on FMAs (or so).  In particular I see

  if (fma_state.m_deferring_p
      && fma_state.m_initial_phi)
    {
      gcc_checking_assert (fma_state.m_last_result);
      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
                                                 &m_last_result_set))
        cancel_fma_deferring (&fma_state);

and I think code to avoid FMAs in other/related cases should be here
as well, like avoid forming back-to-back FMAs.

Richard.

> Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29  7:41   ` Richard Biener
@ 2023-08-29  7:49     ` Di Zhao OS
  2023-08-29  8:09       ` Richard Biener
  2023-08-29 12:34     ` Jeff Law
  1 sibling, 1 reply; 11+ messages in thread
From: Di Zhao OS @ 2023-08-29  7:49 UTC (permalink / raw)
  To: Richard Biener, Jeff Law, Martin Jambor; +Cc: gcc-patches

Hi,

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, August 29, 2023 3:41 PM
> To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> >
> >
> > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > This patch tries to fix the 2% regression in 510.parest_r on
> > > ampere1 in the tracker. (Previous discussion is here:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > >
> > > 1. Add testcases for the problem. For an op list in the form of
> > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > Swap the operands so that more FMAs can be generated.
> > > After widening_mul the result looks like:
> > >
> > >     _1 = .FMA(a, b, acc_0);
> > >     acc_1 = .FMA(c, d, _1);
> > >
> > > While previously (before the "Handle FMA friendly..." patch),
> > > widening_mul's result was like:
> > >
> > >     _1 = a * b;
> > >     _2 = .FMA (c, d, _1);
> > >     acc_1 = acc_0 + _2;
> 
> How can we execute the multiply and the FMA in parallel?  They
> depend on each other.  Or is it the uarch can handle dependence
> on the add operand but only when it is with a multiplication and
> not a FMA in some better ways?  (I'd doubt so much complexity)
> 
> Can you explain in more detail how the uarch executes one vs. the
> other case?
> 
> > > If the code fragment is in a loop, some architecture can execute
> > > the latter in parallel, so the performance can be much faster than
> > > the former. For the small testcase, the performance gap is over
> > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > much as possible. (If we are rewriting the op list into parallel,
> > > no special treatment is needed, since the last statement after
> > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > >
> > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > backedge dependency. Added new enum fma_state to describe the
> > > state of FMA candidates.
> > >
> > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > run on ampere1. The compile options are:
> > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > >
> > > Best regards,
> > > Di Zhao
> > >
> > > ----
> > >
> > >          PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > >          describe the state of FMA candidates for an op list.
> > >          (rewrite_expr_tree_parallel): Changed boolean
> > >          parameter to enum type.
> > >          (result_feeds_back_from_phi_p): New function to check
> > >          for cross backedge dependency.
> > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > >          parameter.
> > >          (reassociate_bb): If there's backedge dependency in an
> > >          op list, swap the operands before rewrite_expr_tree.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >          * gcc.dg/pr110279.c: New test.
> > Not a review, but more of a question -- isn't this transformation's
> > profitability uarch sensitive.  ie, just because it's bad for a set of
> > aarch64 uarches, doesn't mean it's bad everywhere.
> >
> > And in general we shy away from trying to adjust gimple code based on
> > uarch preferences.
> >
> > It seems the right place to do this is gimple->rtl expansion.
> 
> Another comment is that FMA forming has this deferring code which I
> think deals exactly with this kind of thing?  CCing Martin who did this
> work based on AMD uarchs also not wanting cross-loop dependences
> on FMAs (or so).  In particular I see
> 
>   if (fma_state.m_deferring_p
>       && fma_state.m_initial_phi)
>     {
>       gcc_checking_assert (fma_state.m_last_result);
>       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
>                                                  &m_last_result_set))
>         cancel_fma_deferring (&fma_state);
> 
> and I think code to avoid FMAs in other/related cases should be here
> as well, like avoid forming back-to-back FMAs.

The changes in this patch is controlled by "param_avoid_fma_max_bits", so
I think it should only affect architectures with similar behavior. (The
parameter was added in a previous patch "Deferring FMA transformations
in tight loops", which seems to be dealing with the same issue.)

> Richard.
> 
> > Jeff

Thanks,
Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29  7:49     ` Di Zhao OS
@ 2023-08-29  8:09       ` Richard Biener
  2023-08-29  8:58         ` Di Zhao OS
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-08-29  8:09 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: Jeff Law, Martin Jambor, gcc-patches

On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, August 29, 2023 3:41 PM
> > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> > reduce cross backedge FMA
> >
> > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > >
> > >
> > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > ampere1 in the tracker. (Previous discussion is here:
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > >
> > > > 1. Add testcases for the problem. For an op list in the form of
> > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > Swap the operands so that more FMAs can be generated.
> > > > After widening_mul the result looks like:
> > > >
> > > >     _1 = .FMA(a, b, acc_0);
> > > >     acc_1 = .FMA(c, d, _1);
> > > >
> > > > While previously (before the "Handle FMA friendly..." patch),
> > > > widening_mul's result was like:
> > > >
> > > >     _1 = a * b;
> > > >     _2 = .FMA (c, d, _1);
> > > >     acc_1 = acc_0 + _2;
> >
> > How can we execute the multiply and the FMA in parallel?  They
> > depend on each other.  Or is it the uarch can handle dependence
> > on the add operand but only when it is with a multiplication and
> > not a FMA in some better ways?  (I'd doubt so much complexity)
> >
> > Can you explain in more detail how the uarch executes one vs. the
> > other case?
> >
> > > > If the code fragment is in a loop, some architecture can execute
> > > > the latter in parallel, so the performance can be much faster than
> > > > the former. For the small testcase, the performance gap is over
> > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > much as possible. (If we are rewriting the op list into parallel,
> > > > no special treatment is needed, since the last statement after
> > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > >
> > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > backedge dependency. Added new enum fma_state to describe the
> > > > state of FMA candidates.
> > > >
> > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > run on ampere1. The compile options are:
> > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > >
> > > > Best regards,
> > > > Di Zhao
> > > >
> > > > ----
> > > >
> > > >          PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > >          describe the state of FMA candidates for an op list.
> > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > >          parameter to enum type.
> > > >          (result_feeds_back_from_phi_p): New function to check
> > > >          for cross backedge dependency.
> > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > >          parameter.
> > > >          (reassociate_bb): If there's backedge dependency in an
> > > >          op list, swap the operands before rewrite_expr_tree.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >          * gcc.dg/pr110279.c: New test.
> > > Not a review, but more of a question -- isn't this transformation's
> > > profitability uarch sensitive.  ie, just because it's bad for a set of
> > > aarch64 uarches, doesn't mean it's bad everywhere.
> > >
> > > And in general we shy away from trying to adjust gimple code based on
> > > uarch preferences.
> > >
> > > It seems the right place to do this is gimple->rtl expansion.
> >
> > Another comment is that FMA forming has this deferring code which I
> > think deals exactly with this kind of thing?  CCing Martin who did this
> > work based on AMD uarchs also not wanting cross-loop dependences
> > on FMAs (or so).  In particular I see
> >
> >   if (fma_state.m_deferring_p
> >       && fma_state.m_initial_phi)
> >     {
> >       gcc_checking_assert (fma_state.m_last_result);
> >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> >                                                  &m_last_result_set))
> >         cancel_fma_deferring (&fma_state);
> >
> > and I think code to avoid FMAs in other/related cases should be here
> > as well, like avoid forming back-to-back FMAs.
>
> The changes in this patch is controlled by "param_avoid_fma_max_bits", so
> I think it should only affect architectures with similar behavior. (The
> parameter was added in a previous patch "Deferring FMA transformations
> in tight loops", which seems to be dealing with the same issue.)

That's what I said.  Is the pipeline behavior properly modeled by the
scheduler description?  Your patch seems to not only affect loops
but the FMA forming case already present does - is the case you are
after not in a tight loop?  In principle with a loop with enough scheduling
freedom this should be a non-issue - unless the pipeline model isn't
accurate and we put the FMAs too close together?

Richard.

> > Richard.
> >
> > > Jeff
>
> Thanks,
> Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29  8:09       ` Richard Biener
@ 2023-08-29  8:58         ` Di Zhao OS
  2023-08-29 11:11           ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Di Zhao OS @ 2023-08-29  8:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Martin Jambor, gcc-patches

Hi,

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, August 29, 2023 4:09 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > >
> > > >
> > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > >
> > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > Swap the operands so that more FMAs can be generated.
> > > > > After widening_mul the result looks like:
> > > > >
> > > > >     _1 = .FMA(a, b, acc_0);
> > > > >     acc_1 = .FMA(c, d, _1);
> > > > >
> > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > widening_mul's result was like:
> > > > >
> > > > >     _1 = a * b;
> > > > >     _2 = .FMA (c, d, _1);
> > > > >     acc_1 = acc_0 + _2;
> > >
> > > How can we execute the multiply and the FMA in parallel?  They
> > > depend on each other.  Or is it the uarch can handle dependence
> > > on the add operand but only when it is with a multiplication and
> > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > >
> > > Can you explain in more detail how the uarch executes one vs. the
> > > other case?

Here's my understanding after consulted our hardware team. For the
second case, the uarch of some out-of-order processors can calculate
"_2" of several loops at the same time, since there's no dependency
among different iterations. While for the first case the next iteration
has to wait for the current iteration to finish, so "acc_0"'s value is
known. I assume it is also the case in some i386 processors, since I
saw the patch "Deferring FMA transformations in tight loops" also
changed corresponding files.

> > >
> > > > > If the code fragment is in a loop, some architecture can execute
> > > > > the latter in parallel, so the performance can be much faster than
> > > > > the former. For the small testcase, the performance gap is over
> > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > no special treatment is needed, since the last statement after
> > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > >
> > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > state of FMA candidates.
> > > > >
> > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > run on ampere1. The compile options are:
> > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > >
> > > > > Best regards,
> > > > > Di Zhao
> > > > >
> > > > > ----
> > > > >
> > > > >          PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > >          describe the state of FMA candidates for an op list.
> > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > >          parameter to enum type.
> > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > >          for cross backedge dependency.
> > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > >          parameter.
> > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >          * gcc.dg/pr110279.c: New test.
> > > > Not a review, but more of a question -- isn't this transformation's
> > > > profitability uarch sensitive.  ie, just because it's bad for a set of
> > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > >
> > > > And in general we shy away from trying to adjust gimple code based on
> > > > uarch preferences.
> > > >
> > > > It seems the right place to do this is gimple->rtl expansion.
> > >
> > > Another comment is that FMA forming has this deferring code which I
> > > think deals exactly with this kind of thing?  CCing Martin who did this
> > > work based on AMD uarchs also not wanting cross-loop dependences
> > > on FMAs (or so).  In particular I see
> > >
> > >   if (fma_state.m_deferring_p
> > >       && fma_state.m_initial_phi)
> > >     {
> > >       gcc_checking_assert (fma_state.m_last_result);
> > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > >                                                  &m_last_result_set))
> > >         cancel_fma_deferring (&fma_state);
> > >
> > > and I think code to avoid FMAs in other/related cases should be here
> > > as well, like avoid forming back-to-back FMAs.
> >
> > The changes in this patch is controlled by "param_avoid_fma_max_bits", so
> > I think it should only affect architectures with similar behavior. (The
> > parameter was added in a previous patch "Deferring FMA transformations
> > in tight loops", which seems to be dealing with the same issue.)
> 
> That's what I said.  Is the pipeline behavior properly modeled by the
> scheduler description?  Your patch seems to not only affect loops
> but the FMA forming case already present does - is the case you are
> after not in a tight loop?  In principle with a loop with enough scheduling
> freedom this should be a non-issue - unless the pipeline model isn't
> accurate and we put the FMAs too close together?

The current deferring FMA scheme discards all the FMA candidates in the FMA
chain that has loop dependency. So for this case there will be too less (zero)
FMA generated; while before the "Handle FMA friendly..." patch or after this
patch we can still have 1 FMA. I also tried to change the deferring FMA scheme
so only the last FMA is discarded, but it still can't solve the regression.
I think it's because this deferring scheme is in the later pass widening_mul,
which only decides whether to form FMA, but doesn't manipulate the tree for
better data parallelism. So it seems better to optimize for data parallelism 
(swapping the operands) in reassoc if we know there can be FMAs with loop
dependency, and leave the current deferring FMA scheme as the last resort?

> 
> Richard.
> 
> > > Richard.
> > >
> > > > Jeff
> >


Thanks,
Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29  8:58         ` Di Zhao OS
@ 2023-08-29 11:11           ` Richard Biener
  2023-08-30  9:33             ` Di Zhao OS
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-08-29 11:11 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: Jeff Law, Martin Jambor, gcc-patches

On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, August 29, 2023 4:09 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> > reduce cross backedge FMA
> >
> > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> > to
> > > > reduce cross backedge FMA
> > > >
> > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > > >
> > > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > After widening_mul the result looks like:
> > > > > >
> > > > > >     _1 = .FMA(a, b, acc_0);
> > > > > >     acc_1 = .FMA(c, d, _1);
> > > > > >
> > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > widening_mul's result was like:
> > > > > >
> > > > > >     _1 = a * b;
> > > > > >     _2 = .FMA (c, d, _1);
> > > > > >     acc_1 = acc_0 + _2;
> > > >
> > > > How can we execute the multiply and the FMA in parallel?  They
> > > > depend on each other.  Or is it the uarch can handle dependence
> > > > on the add operand but only when it is with a multiplication and
> > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > >
> > > > Can you explain in more detail how the uarch executes one vs. the
> > > > other case?
>
> Here's my understanding after consulted our hardware team. For the
> second case, the uarch of some out-of-order processors can calculate
> "_2" of several loops at the same time, since there's no dependency
> among different iterations. While for the first case the next iteration
> has to wait for the current iteration to finish, so "acc_0"'s value is
> known. I assume it is also the case in some i386 processors, since I
> saw the patch "Deferring FMA transformations in tight loops" also
> changed corresponding files.

That should be true for all kind of operations, no?  Thus it means
reassoc should in general associate cross-iteration accumulation
last?  Historically we associated those first because that's how the
vectorizer liked to see them, but I think that's no longer necessary.

It should be achievable by properly biasing the operand during
rank computation (don't we already do that?).

> > > >
> > > > > > If the code fragment is in a loop, some architecture can execute
> > > > > > the latter in parallel, so the performance can be much faster than
> > > > > > the former. For the small testcase, the performance gap is over
> > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > > no special treatment is needed, since the last statement after
> > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > >
> > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > state of FMA candidates.
> > > > > >
> > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > > run on ampere1. The compile options are:
> > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > > >
> > > > > > Best regards,
> > > > > > Di Zhao
> > > > > >
> > > > > > ----
> > > > > >
> > > > > >          PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > >          describe the state of FMA candidates for an op list.
> > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > >          parameter to enum type.
> > > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > > >          for cross backedge dependency.
> > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > >          parameter.
> > > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > Not a review, but more of a question -- isn't this transformation's
> > > > > profitability uarch sensitive.  ie, just because it's bad for a set of
> > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > >
> > > > > And in general we shy away from trying to adjust gimple code based on
> > > > > uarch preferences.
> > > > >
> > > > > It seems the right place to do this is gimple->rtl expansion.
> > > >
> > > > Another comment is that FMA forming has this deferring code which I
> > > > think deals exactly with this kind of thing?  CCing Martin who did this
> > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > on FMAs (or so).  In particular I see
> > > >
> > > >   if (fma_state.m_deferring_p
> > > >       && fma_state.m_initial_phi)
> > > >     {
> > > >       gcc_checking_assert (fma_state.m_last_result);
> > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > >                                                  &m_last_result_set))
> > > >         cancel_fma_deferring (&fma_state);
> > > >
> > > > and I think code to avoid FMAs in other/related cases should be here
> > > > as well, like avoid forming back-to-back FMAs.
> > >
> > > The changes in this patch is controlled by "param_avoid_fma_max_bits", so
> > > I think it should only affect architectures with similar behavior. (The
> > > parameter was added in a previous patch "Deferring FMA transformations
> > > in tight loops", which seems to be dealing with the same issue.)
> >
> > That's what I said.  Is the pipeline behavior properly modeled by the
> > scheduler description?  Your patch seems to not only affect loops
> > but the FMA forming case already present does - is the case you are
> > after not in a tight loop?  In principle with a loop with enough scheduling
> > freedom this should be a non-issue - unless the pipeline model isn't
> > accurate and we put the FMAs too close together?
>
> The current deferring FMA scheme discards all the FMA candidates in the FMA
> chain that has loop dependency. So for this case there will be too less (zero)
> FMA generated; while before the "Handle FMA friendly..." patch or after this
> patch we can still have 1 FMA. I also tried to change the deferring FMA scheme
> so only the last FMA is discarded, but it still can't solve the regression.
> I think it's because this deferring scheme is in the later pass widening_mul,
> which only decides whether to form FMA, but doesn't manipulate the tree for
> better data parallelism. So it seems better to optimize for data parallelism
> (swapping the operands) in reassoc if we know there can be FMAs with loop
> dependency, and leave the current deferring FMA scheme as the last resort?
>
> >
> > Richard.
> >
> > > > Richard.
> > > >
> > > > > Jeff
> > >
>
>
> Thanks,
> Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29  7:41   ` Richard Biener
  2023-08-29  7:49     ` Di Zhao OS
@ 2023-08-29 12:34     ` Jeff Law
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff Law @ 2023-08-29 12:34 UTC (permalink / raw)
  To: Richard Biener, Martin Jambor; +Cc: Di Zhao OS, gcc-patches

On 8/29/23 01:41, Richard Biener wrote:
>>>
>>>      _1 = a * b;
>>>      _2 = .FMA (c, d, _1);
>>>      acc_1 = acc_0 + _2;
> 
> How can we execute the multiply and the FMA in parallel?  They
> depend on each other.  Or is it the uarch can handle dependence
> on the add operand but only when it is with a multiplication and
> not a FMA in some better ways?  (I'd doubt so much complexity)
I've worked on an architecture that could almost do that.    The ops 
didn't run in parallel, but instead serially as "chained" FP ops.

Essentially in cases where you could chain them they become a single 
instruction.  These were fully piped, thus issuing every cycle.  Latency 
was 1c faster than if you'd issued the ops as distinct instructions. 
More importantly, by combining the two FP ops into a single instruction 
you could issue more FP ops/cycle which significantly helps many FP 
codes.  It's safe to assume this required notable additional FP 
hardware, but it's something we already had in the design for other 
purposes.

I keep hoping that architecture becomes public.  There were some other 
really interesting features in the design that could be incorporated 
into other designs with minimal hardware cost.

> 
> Can you explain in more detail how the uarch executes one vs. the
> other case?
Probably can't say more than I already have.

Anyway, given the architecture in question is still private and no 
longer something I have to champion, if you want to move forward with 
with the patch, I won't object.

jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-29 11:11           ` Richard Biener
@ 2023-08-30  9:33             ` Di Zhao OS
  2023-08-31 12:22               ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Di Zhao OS @ 2023-08-30  9:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Martin Jambor, gcc-patches

Hello Richard,

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, August 29, 2023 7:11 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>;
> gcc-
> > > patches@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > > <dizhao@os.amperecomputing.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > > > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-
> patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> reassoc
> > > to
> > > > > reduce cross backedge FMA
> > > > >
> > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > > > >
> > > > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > After widening_mul the result looks like:
> > > > > > >
> > > > > > >     _1 = .FMA(a, b, acc_0);
> > > > > > >     acc_1 = .FMA(c, d, _1);
> > > > > > >
> > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > widening_mul's result was like:
> > > > > > >
> > > > > > >     _1 = a * b;
> > > > > > >     _2 = .FMA (c, d, _1);
> > > > > > >     acc_1 = acc_0 + _2;
> > > > >
> > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > on the add operand but only when it is with a multiplication and
> > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > >
> > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > other case?
> >
> > Here's my understanding after consulted our hardware team. For the
> > second case, the uarch of some out-of-order processors can calculate
> > "_2" of several loops at the same time, since there's no dependency
> > among different iterations. While for the first case the next iteration
> > has to wait for the current iteration to finish, so "acc_0"'s value is
> > known. I assume it is also the case in some i386 processors, since I
> > saw the patch "Deferring FMA transformations in tight loops" also
> > changed corresponding files.
> 
> That should be true for all kind of operations, no?  Thus it means
> reassoc should in general associate cross-iteration accumulation
Yes I think both are true.

> last?  Historically we associated those first because that's how the
> vectorizer liked to see them, but I think that's no longer necessary.
> 
> It should be achievable by properly biasing the operand during
> rank computation (don't we already do that?).

The issue is related with the following codes (handling cases with
three operands left):
      /* When there are three operands left, we want
	 to make sure the ones that get the double
	 binary op are chosen wisely.  */
      int len = ops.length ();
      if (len >= 3 && !has_fma)
	swap_ops_for_binary_stmt (ops, len - 3);

      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
				   powi_result != NULL
				   || negate_result,
				   len != orig_len);

Originally (before the "Handle FMA friendly..." patch), for the
tiny example, the 2 multiplications will be placed first by 
swap_ops_for_binary_stmt and rewrite_expr_tree, according to
ranks. While currently, to preserve more FMAs,
swap_ops_for_binary_stmt won't be called, so the result would
be MULT_EXPRs and PLUS_EXPRs interleaved with each other (which
is mostly fine if these are not in such tight loops).

What this patch tries to do can be summarized as: when cross
backedge dependency is detected (and the uarch doesn't like it),
better fallback to the "old" way, since we don't want the loop
depending FMA anyway. (I hope I'm understanding the question
correctly.)

> 
> > > > >
> > > > > > > If the code fragment is in a loop, some architecture can execute
> > > > > > > the latter in parallel, so the performance can be much faster
> than
> > > > > > > the former. For the small testcase, the performance gap is over
> > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to
> avoid
> > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > > > no special treatment is needed, since the last statement after
> > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > > >
> > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > > state of FMA candidates.
> > > > > > >
> > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > > > run on ampere1. The compile options are:
> > > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Di Zhao
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > >          PR tree-optimization/110279
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > > >          describe the state of FMA candidates for an op list.
> > > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > > >          parameter to enum type.
> > > > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > > > >          for cross backedge dependency.
> > > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > > >          parameter.
> > > > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > > Not a review, but more of a question -- isn't this transformation's
> > > > > > profitability uarch sensitive.  ie, just because it's bad for a set
> of
> > > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > > >
> > > > > > And in general we shy away from trying to adjust gimple code based
> on
> > > > > > uarch preferences.
> > > > > >
> > > > > > It seems the right place to do this is gimple->rtl expansion.
> > > > >
> > > > > Another comment is that FMA forming has this deferring code which I
> > > > > think deals exactly with this kind of thing?  CCing Martin who did
> this
> > > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > > on FMAs (or so).  In particular I see
> > > > >
> > > > >   if (fma_state.m_deferring_p
> > > > >       && fma_state.m_initial_phi)
> > > > >     {
> > > > >       gcc_checking_assert (fma_state.m_last_result);
> > > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > > >                                                  &m_last_result_set))
> > > > >         cancel_fma_deferring (&fma_state);
> > > > >
> > > > > and I think code to avoid FMAs in other/related cases should be here
> > > > > as well, like avoid forming back-to-back FMAs.
> > > >
> > > > The changes in this patch is controlled by "param_avoid_fma_max_bits",
> so
> > > > I think it should only affect architectures with similar behavior. (The
> > > > parameter was added in a previous patch "Deferring FMA transformations
> > > > in tight loops", which seems to be dealing with the same issue.)
> > >
> > > That's what I said.  Is the pipeline behavior properly modeled by the
> > > scheduler description?  Your patch seems to not only affect loops
> > > but the FMA forming case already present does - is the case you are
> > > after not in a tight loop?  In principle with a loop with enough
> scheduling
> > > freedom this should be a non-issue - unless the pipeline model isn't
> > > accurate and we put the FMAs too close together?
> >
> > The current deferring FMA scheme discards all the FMA candidates in the FMA
> > chain that has loop dependency. So for this case there will be too less
> (zero)
> > FMA generated; while before the "Handle FMA friendly..." patch or after
> this
> > patch we can still have 1 FMA. I also tried to change the deferring FMA
> scheme
> > so only the last FMA is discarded, but it still can't solve the regression.
> > I think it's because this deferring scheme is in the later pass
> widening_mul,
> > which only decides whether to form FMA, but doesn't manipulate the tree for
> > better data parallelism. So it seems better to optimize for data
> parallelism
> > (swapping the operands) in reassoc if we know there can be FMAs with loop
> > dependency, and leave the current deferring FMA scheme as the last resort?
> >
> > >
> > > Richard.
> > >
> > > > > Richard.
> > > > >
> > > > > > Jeff
> > > >
> >
> >
> > Thanks,
> > Di Zhao


Thanks,
Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-30  9:33             ` Di Zhao OS
@ 2023-08-31 12:22               ` Richard Biener
  2023-09-04 10:33                 ` Di Zhao OS
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-08-31 12:22 UTC (permalink / raw)
  To: Di Zhao OS; +Cc: Jeff Law, Martin Jambor, gcc-patches

On Wed, Aug 30, 2023 at 11:33 AM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hello Richard,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, August 29, 2023 7:11 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> > reduce cross backedge FMA
> >
> > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>;
> > gcc-
> > > > patches@gcc.gnu.org
> > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> > to
> > > > reduce cross backedge FMA
> > > >
> > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > > > <dizhao@os.amperecomputing.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > > > > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-
> > patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> > reassoc
> > > > to
> > > > > > reduce cross backedge FMA
> > > > > >
> > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > > > > >
> > > > > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > > After widening_mul the result looks like:
> > > > > > > >
> > > > > > > >     _1 = .FMA(a, b, acc_0);
> > > > > > > >     acc_1 = .FMA(c, d, _1);
> > > > > > > >
> > > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > > widening_mul's result was like:
> > > > > > > >
> > > > > > > >     _1 = a * b;
> > > > > > > >     _2 = .FMA (c, d, _1);
> > > > > > > >     acc_1 = acc_0 + _2;
> > > > > >
> > > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > > on the add operand but only when it is with a multiplication and
> > > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > > >
> > > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > > other case?
> > >
> > > Here's my understanding after consulted our hardware team. For the
> > > second case, the uarch of some out-of-order processors can calculate
> > > "_2" of several loops at the same time, since there's no dependency
> > > among different iterations. While for the first case the next iteration
> > > has to wait for the current iteration to finish, so "acc_0"'s value is
> > > known. I assume it is also the case in some i386 processors, since I
> > > saw the patch "Deferring FMA transformations in tight loops" also
> > > changed corresponding files.
> >
> > That should be true for all kind of operations, no?  Thus it means
> > reassoc should in general associate cross-iteration accumulation
> Yes I think both are true.
>
> > last?  Historically we associated those first because that's how the
> > vectorizer liked to see them, but I think that's no longer necessary.
> >
> > It should be achievable by properly biasing the operand during
> > rank computation (don't we already do that?).
>
> The issue is related with the following codes (handling cases with
> three operands left):
>       /* When there are three operands left, we want
>          to make sure the ones that get the double
>          binary op are chosen wisely.  */
>       int len = ops.length ();
>       if (len >= 3 && !has_fma)
>         swap_ops_for_binary_stmt (ops, len - 3);
>
>       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
>                                    powi_result != NULL
>                                    || negate_result,
>                                    len != orig_len);
>
> Originally (before the "Handle FMA friendly..." patch), for the
> tiny example, the 2 multiplications will be placed first by
> swap_ops_for_binary_stmt and rewrite_expr_tree, according to
> ranks. While currently, to preserve more FMAs,
> swap_ops_for_binary_stmt won't be called, so the result would
> be MULT_EXPRs and PLUS_EXPRs interleaved with each other (which
> is mostly fine if these are not in such tight loops).
>
> What this patch tries to do can be summarized as: when cross
> backedge dependency is detected (and the uarch doesn't like it),
> better fallback to the "old" way, since we don't want the loop
> depending FMA anyway. (I hope I'm understanding the question
> correctly.)

Hmm, I'd argue this should be

 if (len >= 3 && (!has_fma || (width = get_reassociation_width (...)) > 1))
  swap_ops_for_binary_stmt (ops, len - 3);

to cover the case of exactly 3 ops - this is for exactly 3 ops, right?
In principle we could, in swap_ops_for_binary_stmt, for has_fma
avoid doing sth when no PHI recurrence is involved.  But the point
is the target can execute things in parallel, and the thing to check for
this is the reassociation-width (even if its for extending the chain
via speculative execution across a backedge).

> >
> > > > > >
> > > > > > > > If the code fragment is in a loop, some architecture can execute
> > > > > > > > the latter in parallel, so the performance can be much faster
> > than
> > > > > > > > the former. For the small testcase, the performance gap is over
> > > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to
> > avoid
> > > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > > > > no special treatment is needed, since the last statement after
> > > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > > > >
> > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > > > state of FMA candidates.
> > > > > > > >
> > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > > > > run on ampere1. The compile options are:
> > > > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Di Zhao
> > > > > > > >
> > > > > > > > ----
> > > > > > > >
> > > > > > > >          PR tree-optimization/110279
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > > > >          describe the state of FMA candidates for an op list.
> > > > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > > > >          parameter to enum type.
> > > > > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > > > > >          for cross backedge dependency.
> > > > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > > > >          parameter.
> > > > > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > > > Not a review, but more of a question -- isn't this transformation's
> > > > > > > profitability uarch sensitive.  ie, just because it's bad for a set
> > of
> > > > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > > > >
> > > > > > > And in general we shy away from trying to adjust gimple code based
> > on
> > > > > > > uarch preferences.
> > > > > > >
> > > > > > > It seems the right place to do this is gimple->rtl expansion.
> > > > > >
> > > > > > Another comment is that FMA forming has this deferring code which I
> > > > > > think deals exactly with this kind of thing?  CCing Martin who did
> > this
> > > > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > > > on FMAs (or so).  In particular I see
> > > > > >
> > > > > >   if (fma_state.m_deferring_p
> > > > > >       && fma_state.m_initial_phi)
> > > > > >     {
> > > > > >       gcc_checking_assert (fma_state.m_last_result);
> > > > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > > > >                                                  &m_last_result_set))
> > > > > >         cancel_fma_deferring (&fma_state);
> > > > > >
> > > > > > and I think code to avoid FMAs in other/related cases should be here
> > > > > > as well, like avoid forming back-to-back FMAs.
> > > > >
> > > > > The changes in this patch is controlled by "param_avoid_fma_max_bits",
> > so
> > > > > I think it should only affect architectures with similar behavior. (The
> > > > > parameter was added in a previous patch "Deferring FMA transformations
> > > > > in tight loops", which seems to be dealing with the same issue.)
> > > >
> > > > That's what I said.  Is the pipeline behavior properly modeled by the
> > > > scheduler description?  Your patch seems to not only affect loops
> > > > but the FMA forming case already present does - is the case you are
> > > > after not in a tight loop?  In principle with a loop with enough
> > scheduling
> > > > freedom this should be a non-issue - unless the pipeline model isn't
> > > > accurate and we put the FMAs too close together?
> > >
> > > The current deferring FMA scheme discards all the FMA candidates in the FMA
> > > chain that has loop dependency. So for this case there will be too less
> > (zero)
> > > FMA generated; while before the "Handle FMA friendly..." patch or after
> > this
> > > patch we can still have 1 FMA. I also tried to change the deferring FMA
> > scheme
> > > so only the last FMA is discarded, but it still can't solve the regression.
> > > I think it's because this deferring scheme is in the later pass
> > widening_mul,
> > > which only decides whether to form FMA, but doesn't manipulate the tree for
> > > better data parallelism. So it seems better to optimize for data
> > parallelism
> > > (swapping the operands) in reassoc if we know there can be FMAs with loop
> > > dependency, and leave the current deferring FMA scheme as the last resort?
> > >
> > > >
> > > > Richard.
> > > >
> > > > > > Richard.
> > > > > >
> > > > > > > Jeff
> > > > >
> > >
> > >
> > > Thanks,
> > > Di Zhao
>
>
> Thanks,
> Di Zhao

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
  2023-08-31 12:22               ` Richard Biener
@ 2023-09-04 10:33                 ` Di Zhao OS
  0 siblings, 0 replies; 11+ messages in thread
From: Di Zhao OS @ 2023-09-04 10:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Martin Jambor, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 13508 bytes --]

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, August 31, 2023 8:23 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Wed, Aug 30, 2023 at 11:33 AM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > Hello Richard,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, August 29, 2023 7:11 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>;
> gcc-
> > > patches@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 10:59 AM Di Zhao OS
> > > <dizhao@os.amperecomputing.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Tuesday, August 29, 2023 4:09 PM
> > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>;
> > > gcc-
> > > > > patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> reassoc
> > > to
> > > > > reduce cross backedge FMA
> > > > >
> > > > > On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > > > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor
> <mjambor@suse.cz>
> > > > > > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-
> > > patches@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in
> > > reassoc
> > > > > to
> > > > > > > reduce cross backedge FMA
> > > > > > >
> > > > > > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-
> July/624893.html)
> > > > > > > > >
> > > > > > > > > 1. Add testcases for the problem. For an op list in the form
> of
> > > > > > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > > > > After widening_mul the result looks like:
> > > > > > > > >
> > > > > > > > >     _1 = .FMA(a, b, acc_0);
> > > > > > > > >     acc_1 = .FMA(c, d, _1);
> > > > > > > > >
> > > > > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > > > > widening_mul's result was like:
> > > > > > > > >
> > > > > > > > >     _1 = a * b;
> > > > > > > > >     _2 = .FMA (c, d, _1);
> > > > > > > > >     acc_1 = acc_0 + _2;
> > > > > > >
> > > > > > > How can we execute the multiply and the FMA in parallel?  They
> > > > > > > depend on each other.  Or is it the uarch can handle dependence
> > > > > > > on the add operand but only when it is with a multiplication and
> > > > > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > > > > >
> > > > > > > Can you explain in more detail how the uarch executes one vs. the
> > > > > > > other case?
> > > >
> > > > Here's my understanding after consulted our hardware team. For the
> > > > second case, the uarch of some out-of-order processors can calculate
> > > > "_2" of several loops at the same time, since there's no dependency
> > > > among different iterations. While for the first case the next iteration
> > > > has to wait for the current iteration to finish, so "acc_0"'s value is
> > > > known. I assume it is also the case in some i386 processors, since I
> > > > saw the patch "Deferring FMA transformations in tight loops" also
> > > > changed corresponding files.
> > >
> > > That should be true for all kind of operations, no?  Thus it means
> > > reassoc should in general associate cross-iteration accumulation
> > Yes I think both are true.
> >
> > > last?  Historically we associated those first because that's how the
> > > vectorizer liked to see them, but I think that's no longer necessary.
> > >
> > > It should be achievable by properly biasing the operand during
> > > rank computation (don't we already do that?).
> >
> > The issue is related with the following codes (handling cases with
> > three operands left):
> >       /* When there are three operands left, we want
> >          to make sure the ones that get the double
> >          binary op are chosen wisely.  */
> >       int len = ops.length ();
> >       if (len >= 3 && !has_fma)
> >         swap_ops_for_binary_stmt (ops, len - 3);
> >
> >       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
> >                                    powi_result != NULL
> >                                    || negate_result,
> >                                    len != orig_len);
> >
> > Originally (before the "Handle FMA friendly..." patch), for the
> > tiny example, the 2 multiplications will be placed first by
> > swap_ops_for_binary_stmt and rewrite_expr_tree, according to
> > ranks. While currently, to preserve more FMAs,
> > swap_ops_for_binary_stmt won't be called, so the result would
> > be MULT_EXPRs and PLUS_EXPRs interleaved with each other (which
> > is mostly fine if these are not in such tight loops).
> >
> > What this patch tries to do can be summarized as: when cross
> > backedge dependency is detected (and the uarch doesn't like it),
> > better fallback to the "old" way, since we don't want the loop
> > depending FMA anyway. (I hope I'm understanding the question
> > correctly.)
> 
> Hmm, I'd argue this should be
> 
>  if (len >= 3 && (!has_fma || (width = get_reassociation_width (...)) > 1))
>   swap_ops_for_binary_stmt (ops, len - 3);
> 
> to cover the case of exactly 3 ops - this is for exactly 3 ops, right?

Yes, this is better.

> In principle we could, in swap_ops_for_binary_stmt, for has_fma
> avoid doing sth when no PHI recurrence is involved.  But the point
> is the target can execute things in parallel, and the thing to check for
> this is the reassociation-width (even if its for extending the chain
> via speculative execution across a backedge).

Attached is a modified version of the patch. I used a new function
(swap_for_better_reassoc_width_p) to check for cross backedge
dependency rather than extending get_reassociation_width, considering
we want a match only if the width without swapping is worse than
swapped. For a case like "acc = a * b + c * d + e" (without loop
dependency), these uarchs can execute the code in parallel whether
it's swapped or not. In this case it's better not to swap the
operands so more FMAs can be generated.

> 
> > >
> > > > > > >
> > > > > > > > > If the code fragment is in a loop, some architecture can
> execute
> > > > > > > > > the latter in parallel, so the performance can be much faster
> > > than
> > > > > > > > > the former. For the small testcase, the performance gap is
> over
> > > > > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to
> > > avoid
> > > > > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR
> as
> > > > > > > > > much as possible. (If we are rewriting the op list into
> parallel,
> > > > > > > > > no special treatment is needed, since the last statement
> after
> > > > > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > > > > >
> > > > > > > > > 2. Function result_feeds_back_from_phi_p is to check for
> cross
> > > > > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > > > > state of FMA candidates.
> > > > > > > > >
> > > > > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-
> copy
> > > > > > > > > run on ampere1. The compile options are:
> > > > > > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Di Zhao
> > > > > > > > >
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > >          PR tree-optimization/110279
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > > > > >          describe the state of FMA candidates for an op list.
> > > > > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > > > > >          parameter to enum type.
> > > > > > > > >          (result_feeds_back_from_phi_p): New function to
> check
> > > > > > > > >          for cross backedge dependency.
> > > > > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > > > > >          parameter.
> > > > > > > > >          (reassociate_bb): If there's backedge dependency in
> an
> > > > > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > > > > Not a review, but more of a question -- isn't this
> transformation's
> > > > > > > > profitability uarch sensitive.  ie, just because it's bad for a
> set
> > > of
> > > > > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > > > > >
> > > > > > > > And in general we shy away from trying to adjust gimple code
> based
> > > on
> > > > > > > > uarch preferences.
> > > > > > > >
> > > > > > > > It seems the right place to do this is gimple->rtl expansion.
> > > > > > >
> > > > > > > Another comment is that FMA forming has this deferring code which
> I
> > > > > > > think deals exactly with this kind of thing?  CCing Martin who
> did
> > > this
> > > > > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > > > > on FMAs (or so).  In particular I see
> > > > > > >
> > > > > > >   if (fma_state.m_deferring_p
> > > > > > >       && fma_state.m_initial_phi)
> > > > > > >     {
> > > > > > >       gcc_checking_assert (fma_state.m_last_result);
> > > > > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > > > > >
> &m_last_result_set))
> > > > > > >         cancel_fma_deferring (&fma_state);
> > > > > > >
> > > > > > > and I think code to avoid FMAs in other/related cases should be
> here
> > > > > > > as well, like avoid forming back-to-back FMAs.
> > > > > >
> > > > > > The changes in this patch is controlled by
> "param_avoid_fma_max_bits",
> > > so
> > > > > > I think it should only affect architectures with similar behavior.
> (The
> > > > > > parameter was added in a previous patch "Deferring FMA
> transformations
> > > > > > in tight loops", which seems to be dealing with the same issue.)
> > > > >
> > > > > That's what I said.  Is the pipeline behavior properly modeled by the
> > > > > scheduler description?  Your patch seems to not only affect loops
> > > > > but the FMA forming case already present does - is the case you are
> > > > > after not in a tight loop?  In principle with a loop with enough
> > > scheduling
> > > > > freedom this should be a non-issue - unless the pipeline model isn't
> > > > > accurate and we put the FMAs too close together?
> > > >
> > > > The current deferring FMA scheme discards all the FMA candidates in the
> FMA
> > > > chain that has loop dependency. So for this case there will be too less
> > > (zero)
> > > > FMA generated; while before the "Handle FMA friendly..." patch or after
> > > this
> > > > patch we can still have 1 FMA. I also tried to change the deferring FMA
> > > scheme
> > > > so only the last FMA is discarded, but it still can't solve the
> regression.
> > > > I think it's because this deferring scheme is in the later pass
> > > widening_mul,
> > > > which only decides whether to form FMA, but doesn't manipulate the tree
> for
> > > > better data parallelism. So it seems better to optimize for data
> > > parallelism
> > > > (swapping the operands) in reassoc if we know there can be FMAs with
> loop
> > > > dependency, and leave the current deferring FMA scheme as the last
> resort?
> > > >
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Jeff
> > > > > >
> > > >
> > > >
> > > > Thanks,
> > > > Di Zhao
> >
> >
> > Thanks,
> > Di Zhao

Thanks,
Di Zhao


----
        PR tree-optimization/110279

gcc/ChangeLog:

        * tree-ssa-reassoc.cc (swap_for_better_reassoc_width_p):
        New function to check whether ranking the ops results in
        better parallelism.
        (reassociate_bb): For 3 ops, refine the condition to call
        swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

        * gcc.dg/pr110279.c: New test.

[-- Attachment #2: 0001-PATCH-swap-operands-in-reassoc-to-reduce-cross-backe.patch --]
[-- Type: application/octet-stream, Size: 5045 bytes --]

From 1b6acd6f65acc9905ec933480c3676fd7101c956 Mon Sep 17 00:00:00 2001
From: "dzhao.ampere" <di.zhao@amperecomputing.com>
Date: Mon, 4 Sep 2023 11:36:51 +0800
Subject: [PATCH] [PATCH] swap operands in reassoc to reduce cross backedge FMA

---
 gcc/testsuite/gcc.dg/pr110279.c | 62 ++++++++++++++++++++++++++
 gcc/tree-ssa-reassoc.cc         | 79 ++++++++++++++++++++++++++++++++-
 2 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr110279.c

diff --git a/gcc/testsuite/gcc.dg/pr110279.c b/gcc/testsuite/gcc.dg/pr110279.c
new file mode 100644
index 00000000000..b32931ca318
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279.c
@@ -0,0 +1,62 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param avoid-fma-max-bits=512 --param tree-reassoc-width=4 -fdump-tree-widening_mul-details" } */
+/* { dg-additional-options "-march=armv8.2-a" } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+/* Check that FMAs with backedge dependency are avoided. Otherwise there won't
+   be FMA generated with "--param avoid-fma-max-bits=512".   */
+
+foo1 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result += (a * b + c * d);
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo2 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = a * b + result + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+foo3 (data_e a, data_e b, data_e c, data_e d)
+{
+  data_e result = 0;
+
+  for (int ic = 0; ic < LOOP_COUNT; ic++)
+    {
+      result = result + a * b + c * d;
+
+      a -= 0.1;
+      b += 0.9;
+      c *= 1.02;
+      d *= 0.61;
+    }
+
+  return result;
+}
+
+/* { dg-final { scan-tree-dump-times "Generated FMA" 3 "widening_mul"} } */
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index eda03bf98a6..84439dbcb6b 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5427,6 +5427,81 @@ get_required_cycles (int ops_num, int cpu_width)
   return res;
 }
 
+/* Given that LHS is the result SSA_NAME of OPS, returns whether ranking the
+   ops results in better parallelism.  */
+static bool
+swap_for_better_reassoc_width_p (vec<operand_entry *> *ops, tree lhs)
+{
+  /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
+     uarchs can execute results like:
+
+	_1 = a * b;
+	_2 = .FMA (c, d, _1);
+	acc_1 = acc_0 + _2;
+
+     in parallel, while turning it into
+
+	_1 = .FMA(a, b, acc_0);
+	acc_1 = .FMA(c, d, _1);
+
+     hinders that, because then the first FMA depends on the result of
+     preceding iteration.  */
+  if (maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+		param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+	1. LHS is a phi argument in the same basic block it is defined.
+	2. And the result of the phi node is used in OPS.  */
+      basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
+	  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+	    {
+	      tree op = PHI_ARG_DEF (phi, i);
+	      if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src == bb))
+		continue;
+	      tree phi_result = gimple_phi_result (phi);
+	      operand_entry *oe;
+	      unsigned int j;
+	      FOR_EACH_VEC_ELT (*ops, j, oe)
+		{
+		  if (TREE_CODE (oe->op) != SSA_NAME)
+		    continue;
+
+		  /* Result of phi is operand of PLUS_EXPR.  */
+		  if (oe->op == phi_result)
+		    return true;
+
+		  /* Check is result of phi is operand of MULT_EXPR.  */
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR)
+		    {
+		      tree rhs = gimple_assign_rhs1 (def_stmt);
+		      if (TREE_CODE (rhs) == SSA_NAME)
+			{
+			  if (rhs == phi_result)
+			    return true;
+			  def_stmt = SSA_NAME_DEF_STMT (rhs);
+			}
+		    }
+		  if (is_gimple_assign (def_stmt)
+		      && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+		    {
+		      if (gimple_assign_rhs1 (def_stmt) == phi_result
+			  || gimple_assign_rhs2 (def_stmt) == phi_result)
+			return true;
+		    }
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
 /* Returns an optimal number of registers to use for computation of
    given statements.  */
 
@@ -7046,7 +7121,9 @@ reassociate_bb (basic_block bb)
 			 to make sure the ones that get the double
 			 binary op are chosen wisely.  */
 		      int len = ops.length ();
-		      if (len >= 3 && !has_fma)
+		      if (len >= 3
+			  && (!has_fma
+			      || swap_for_better_reassoc_width_p (&ops, lhs)))
 			swap_ops_for_binary_stmt (ops, len - 3);
 
 		      new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-09-04 10:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-28  8:17 [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA Di Zhao OS
2023-08-28 23:22 ` Jeff Law
2023-08-29  7:41   ` Richard Biener
2023-08-29  7:49     ` Di Zhao OS
2023-08-29  8:09       ` Richard Biener
2023-08-29  8:58         ` Di Zhao OS
2023-08-29 11:11           ` Richard Biener
2023-08-30  9:33             ` Di Zhao OS
2023-08-31 12:22               ` Richard Biener
2023-09-04 10:33                 ` Di Zhao OS
2023-08-29 12:34     ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).